Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Practical kernel-based reinforcement learning (and other stories)
Joelle Pineau Reasoning and Learning Lab / Center for Intelligent Machines
School of Computer Science, McGill University
Joint work with: Andre Barreto, Mahdi Milani Fard, William Hamilton, Doina Precup
Workshop on Big Data and Statistical Machine Learning
Fields Institute January 29, 2015
Reinforcement learning 1. Learning agent tries a sequence of actions (at).
2. Observes outcomes (state st+1, rewards rt) of those actions.
3. Statistically estimates relationship between action choice and outcomes, Pr(st|st-1,at). After some time... learn action selection policy, π(s), that optimizes selected outcomes.
argmaxπ Eπ [ r0 + r1 + … + rT ] [Bellman, 1957; Sutton, 1988; Sutton&Barto, 1998.]
2 Recent advances in Reinforcement Learning Joelle Pineau
http://en.wikipedia.org/wiki/Animal_training
Recent applications of RL
3 Recent advances in Reinforcement Learning Joelle Pineau
• Robotics • Medicine • Advertisement • Resource management • Game playing …
Goal: To create an adaptive neuro-stimulation system that can
maximally reduce the incidence of epileptiform activity.
Adaptive deep-brain stimulation
Recent advances in Reinforcement Learning Joelle Pineau 4
Reinforcement learning agent
Closed-loop, non-periodic, stimulation strategy learned for an in vitro model of epilepsy.
Adaptive deep-brain stimulation
Recent advances in Reinforcement Learning Joelle Pineau 5
RL vs supervised learning
Recent advances in Reinforcement Learning Joelle Pineau 7
Jointly learning AND planning from correlated
samples
Supervised Learning Inputs Outputs
Training signal = desired (target outputs), e.g. class
ReinforcementLearning Inputs Outputs (“actions”)
Training signal = “rewards” Environment
Learning from i.i.d samples
Learning problem
Learn the Q function, defining state-action values of the policy π :
Qπ(s,a) = r(s,a) + γ ∑s’ P(s’|s,a) maxa’ Qπ(s’,a’)
Immediate reward Future expected sum of rewards
Error to adjust the Q-function, given a sample <s,a,r,s’,a’>:
δ = Q(s,a) – (r + γ maxa’Q(s’,a’))
8 Recent advances in Reinforcement Learning Joelle Pineau
Batch reinforcement learning
Use regression analysis to estimate the long-term cost of different actions from the training data
Regression with linear function, kernel function, random forests, neural networks, ….. Important! Target function, Q, is sum of future expected rewards.
fi < ti
… …
fi < ti
…
fi < ti
…
fi < ti
… …
10 Recent advances in Reinforcement Learning Joelle Pineau
Algorithms for large-scale RL
11
# features
# examples
Algorithms with sample complexity guarantees
[Fard et al., NIPS’13, AAAI’12, UAI’11, NIPS’10]
Algorithms with convergence guarantees
[Barreto et al., JAIR’14, NIPS’12, NIPS’11]
Recent advances in Reinforcement Learning Joelle Pineau
Kernel-based RL (Ormoneit & Sen, 2002)
12
KBRL (Ormoneit & Sen, 2002)
Sample transitions associated with action a
Recent advances in Reinforcement Learning Joelle Pineau
13
KBRL (Ormoneit & Sen, 2002)
pa(s|s) := (s, s) = k⌧ (s,s)Ps k⌧ (s,s) ,
with k⌧
(s, s) = �(||s� s||/⌧)
Kernel-based RL (Ormoneit & Sen, 2002)
Recent advances in Reinforcement Learning Joelle Pineau
14
KBRL (Ormoneit & Sen, 2002)
Q(s, a) =P
na
i=1 (s, si
)hri
+ �V ⇤(si
)i,
Kernel-based RL (Ormoneit & Sen, 2002)
Recent advances in Reinforcement Learning Joelle Pineau
• KBRL has good theoretical guarantees:
– Converges to a solution.
– Consistent, in the statistical sense. » Additional data improves quality of approximation.
– If kernel widths are decreased at an appropriate rate, the probability of selecting suboptimal actions converges to zero.
Problem: One application of the Q-function update is O(n2).
15
Kernel-based RL (Ormoneit & Sen, 2002)
Recent advances in Reinforcement Learning Joelle Pineau
Stochastic factorization trick
16
Stochastic-Factorization Trick
Stochastic matrix
P
• pij
� 0
•P
j
pij
= 1
Recent advances in Reinforcement Learning Joelle Pineau
Stochastic factorization trick
17
Stochastic-Factorization Trick
Stochastic factorization
D K = P
Recent advances in Reinforcement Learning Joelle Pineau
Stochastic factorization trick
18
Stochastic-Factorization Trick
Stochastic-factorization trick
K D = P
Recent advances in Reinforcement Learning Joelle Pineau
Kernel-based stochastic factorization
19
KBSF
Recent advances in Reinforcement Learning Joelle Pineau
Kernel-based stochastic factorization
20
KBSF
Select a set of representative states.Define D 2 <nxm, K 2 <mxn
Recent advances in Reinforcement Learning Joelle Pineau
KBSF algorithm
21
KBSF
KBSF algorithm
1 For each a 2 A do1 Compute matrix Da: da
ij
= (sa
i
, sj
)
2 Compute matrix Ka: ka
ij
= (si
, sa
j
)
3 Compute ra: ra
i
=P
j
ka
ij
ra
j
4 Pa = KaDa
2 Solve the MDP described by Pa and ra
Recent advances in Reinforcement Learning Joelle Pineau
Properties
22
KBSF
Computational requirements
KBRL KBSFModel generation O(n2|A|) O(nm2|A|)Policy evaluation O(n3) O(m3)Policy improvement O(n2|A|) O(m2|A|)
n = number of sample transitionsm = number of representative states
Recent advances in Reinforcement Learning Joelle Pineau
Advantages of KBSF
• It always converges to a solution.
• It has good theoretical guarantees.
• It is fast.
• It is simple.
• It has a predictive behavior, generally improving its performance as m→n or n→∞.
23 Recent advances in Reinforcement Learning Joelle Pineau
Epilepsy suppression in silico
24
KBSF
Epilepsy suppression in vitro
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
0.10
0.15
0.20
Fraction of stimulation
Frac
tion
of s
eizu
res
0Hz
0.5Hz
1Hz1.5Hz
T20−40
T20−10
T200−40
T200−20T200−10
KBSF−40
KBSF−20
KBSF−10
LSPI−40LSPI−20
LSPI−10
T20−20
T20−10
T200−10
LSPI−10
KBSF−10
T20−20
T200−20
LSPI−20
KBSF−20
T20−40
T200−40
LSPI−40
KBSF−40
Seconds (log)
50 200 1000 5000
Recent advances in Reinforcement Learning Joelle Pineau
Online / Parallel KBSF
25
Incremental KBSF
Incremental KBSF
Recent advances in Reinforcement Learning Joelle Pineau
Incremental KBSF
Recent advances in Reinforcement Learning Joelle Pineau 26
Incremental KBSF
Triple Pole-Balancing
●
●
●
●●
● ●
●● ● ●
2e+06 4e+06 6e+06 8e+06 1e+07
0.4
0.5
0.6
0.7
0.8
Number of sample transitions
Succ
essf
ul e
piso
des
BatchKBSF
● KBSFTREE
●
●
●
●
●
●
●
●
●
●
●
2e+06 4e+06 6e+06 8e+06 1e+07
050
010
0015
00
Number of sample transitions
Seco
nds
Batch KBSF
● KBSFTREE
Incremental KBSF
Recent advances in Reinforcement Learning Joelle Pineau 27
Incremental KBSF
Helicopter task (from the 2013 RL competition)
5 10 15 20
020
0040
0060
00
Transitions (x 30000 )
Step
s fly
ing
2013 RL competition results for KBSF: 1st for the polyathlon task; 2nd
for the helicopter task (best value-based, online learning method).
High-dimensional input data?
Learning representations for RL
Recent advances in Reinforcement Learning Joelle Pineau 28
Q(ϕ(s),a)
s
ϕ(s)?
Original state
Partially Observable MDP
Hidden
...
...
Visible
St�1
St St+1
Zt�1 AtZt Zt+1
At+1
S0
rt rt+1
T T
O O O
�!
hS,A,Z, T, O,Ri
• POMDP defined by n-tuple <S, A, Z, T, O, R>, where <S, A, T, R> are same as in an MDP.
• States are hidden è Observations (Z) • Observation function O(s,a,z) := Pr(z | s, a)
POMDP planning in high-dimensions
• Poc-Man problem: |S|=1056, |A|=4, |Z|=1024.
http://jveness.info/publications/default.html
Learning representations for RL
Recent advances in Reinforcement Learning Joelle Pineau 31
[Hamilton, Fard & Pineau, 2013]
Q(ϕ(s),a)
s ϕ(s)
Random projection Original state
Latent-state approaches to learning
o Expectation maximization
» HMM learning (Rabiner, 1990)
» Online nested EM (Liu, Liao & Carin, 2013)
» Model-free RL as mixture learning (Vlassis & Toussaint, 2009)
o Bayesian learning » Kalman filtering
» Bayes-Adaptive POMDP (Ross, Chaib-draa & Pineau, 2007)
» Infinite POMDP (Doshi-Velez, 2009)
Event-based approaches to learning
• An alternative is to work directly with observable events.
o Merge-split histories » U-Tree (McCallum, 1996)
» MC-AIXI (Veness, Ng, Hutter, Uther & Silver, 2011)
o Predictive state representations » Observable operator models (Jaeger, 2000)
» PSRs (Littman, Sutton, Singh, 2002)
o Methods of moments » Spectral HMMs (Hsu et al., 2008)
» TPSRs (Boots & Gordon, 2010)
» Tensor decomposition (Anandkumar, 2012)
Estimate probabilities of the form p( ot:t+k | o1:t-1 ) and represent compactly.
The PSR systems dynamics matrix
EfficientlyModellingSparse
DynamicalSystems
WilliamHamilton,
Mahdi Fard,Joelle Pineau
Motivation
OurContribution
Results
Summary
The System Dynamics Matrix, D
We want P(⌧i |hi) 8i 8jRank finite and bounded [Littman et al., 2002].Tests corresponding to row basis called core tests.
5 / 21
Learning PSRs from data
• Spectral approach [Rosencrantz et al., 2004; Boots et al.,
2009]: o Estimate large sub-matrix of D.
o Project to low-dimensions subspace using SVD.
o Globally optimal, if you know rank(D).
• Still computationally expensive, O(|T|2|H|).
Regularity in large observation spaces
• We say that D is k-sparse if only k tests are possible for any given history hi.
• Valid assumption? In Poc-Man domain, large sub-matrix estimates of D empirically observed to have an average of 99.9% column sparsity.
EfficientlyModellingSparse
DynamicalSystems
WilliamHamilton,
Mahdi Fard,Joelle Pineau
Motivation
OurContribution
Results
Summary
The System Dynamics Matrix, D
We want P(⌧i |hi) 8i 8jRank finite and bounded [Littman et al., 2002].Tests corresponding to row basis called core tests.
5 / 21
Exploiting sparsity
• Sparse structure can be exploited using random projections.
Compress m x n matrix Y to a d x n matrix X, where d << m. Let X = ϕY,
where ϕ is a d x m projection matrix with entries from N~(0, 1/d).
The CPSR Algorithm
Note: Multiplication done “online” and D matrix never held in memory.
EfficientlyModellingSparse
DynamicalSystems
WilliamHamilton,
Mahdi MilaniFard, Joelle
Pineau
The CPSR Algorithm
AlgorithmObtain compressed estimates for sub-matrices of D,�PT ,H, �PT ,ol ,Hs, and PH by sampling time seriesdata.
Estimate �PT ,H in compressed space by adding �i tocolumn j each time ti observed after hi (Likewise for�PT ,ol ,Hs).
Compute CPSR model:c0 = �P(⌧ |;)Co = �PT ,ol ,H(�PT ,H)+
c1 = (�PT ,H)+PH
10 / 1
Using the compact representation EfficientlyModellingSparse
DynamicalSystems
WilliamHamilton,
Mahdi Fard,Joelle Pineau
Motivation
OurContribution
Results
Summary
Using the compact representation.
State definition and necessary equationsc0 serves as initial prediction vector (i.e. state vector).Update state vector after seeing observation with
ct+1 = CoCtC1Coct
Predict k-steps into the future usingP(oj
t+k |ht) = b1Coj (C?)k�1Ct where C? =
Poi2O Coi .
11 / 21
Properties of CPSR
• Computationally efficient (projection has low cost). • A projection size of d = O(k log |Q|) suffices in most systems.
• Allows projection to subspaces of dimension d < rank (D). • If d ≥ rank(D) then model is trivially consistent.
• Regularizes the solution (i.e. the learned model parameters).
• Total propagated error for T steps is bounded.
40 Reinforcement Learning Joelle Pineau
GridWorld prediction error EfficientlyModellingSparse
DynamicalSystems
WilliamHamilton,
Mahdi Fard,Joelle Pineau
Motivation
OurContribution
Results
Summary
GridWorld: Increased time-efficiency in smallsimple systems
CPSR TPSR0
5
10
15
run
tim
e (
min
ute
s)
0 1 2 3 4 5 6 7 8 9 10 110
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
Steps Ahead to Predict
Pre
dic
tion E
rror
TPSR
CPSR
15 / 21
Poc-Man prediction error EfficientlyModellingSparse
DynamicalSystems
WilliamHamilton,
Mahdi MilaniFard, Joelle
Pineau
Motivation
OurContribution
Results
Summary
Poc-Man: Better model quality in large difficultsystems
Partially observable variant of Pac-Man video-game with|S| = 1056 and |O| = 210 [Silver and Veness, 2010].
0 2 4 6 8 10 12 14 16 18 20 21
0
0.05
0.1
0.15
0.2
0.25
0.3
Number of Steps Ahead to Predict
Pre
dic
tio
n E
rro
r
CPSR ! All Tests
TPSR ! Single Length Tests
16 / 17
Planning in CPSRs
Point-based value iteration (traditional POMDP approach): CPSR state vector = POMDP belief state
Point-based value iteration (common MDP approach): CPSR state vector = input to value function approximation
8
Joelle PineauEfficient decision-making under uncertainty 15
The basic algorithm: Point-based value iteration
P(s1)
V(b)
b1 b0 b2
Approach:
Select a small set of belief points
Plan for those belief points only ⇒ Learn value and its gradient
a,z a,z
⇒ Which beliefs are most important?
Pick action that maximizes value ⇒ ( )bbV !="#$
$max)(
Joelle PineauEfficient decision-making under uncertainty 16
Selecting belief points
• Earlier strategies:
Fully observable states Points at regular intervals Random samples
• Can we use theoretical properties to pick important beliefs?
x1
x0
x2
Poc-Man results
• S-PocMan: Drop parts of the observation vector that allow agent to sense in what direction food lies.
Compressed Predictive States
6.4.2 Partially Observable PacMan
!300
!200
!100
0
Hashed!250 Memoryless Rademacher!250 Random Spherical!250Model
Ave
rag
e R
etu
rn p
er
Ep
iso
de
(a) PocMan
!300
!200
!100
0
Hashed!250 Memoryless Rademacher!250 Random Spherical!250Model
(b) S-PocMan
Figure 7: Average return per episode achieved in the PocMan (a) and S-PocMan (b) do-mains using di↵erent models and the baselines. Compressed dimension sizes arelisted next to the model names. 95% confidence interval error bars are shown.
For both PocMan and S-PocMan, we set d0 = 25 and examined compressed dimensionsin the range [250, 500] (selecting only the top performer via a validation set); no TPSRmodels were used on these domains, as their construction exhausted the memory capacityof the machine used. Following Veness et al. (2011), for these domains we use � = 0.99999as a discount factor.
Figure 7 details the performance of the CPSR algorithms on the PocMan and S-PocMandomains. In these domains, we see a much smaller performance gap between the CPSRapproaches and the memoryless baseline. In fact, in the PocMan domain, the memorylesscontroller is the top-performer. This demonstrates, first and foremost, that the PocMandomain is not strongly partially observable. Though the observations do not fully determinethe agent’s state, the immediate rewards available to an agent (with the exception of rewardfor eating the power pill and catching a ghost) are discernible through the observation vector(e.g., the agent can see locally where food is). Thus, the memoryless controller is able toformulate successful plans despite the fact that is treating the domain as if it were fullyobservable. Moreover, a qualitative comparison with the Monte-Carlo AIXI approximation(Veness et al., 2011) reveals that the quality of the memoryless controller’s plans are actuallyquite good. In that work, they use a slightly di↵erent optimization criteria of optimizingfor average transition reward, and with on the order of 50000 transitions they achieve anaverage transition reward in the range [�1, 1] (depending on parameter settings). Withon the order of 250000 transitions they achieve an average transition reward in the range[1, 1.5]. In this work, the memoryless controller achieves an average transition reward of
31
Rnd Obs. Rnd Obs.
Adaptive migratory bird management
IJCAI 2013 Challenge problem (Nicol et al. 2013): Observe bird population level at different breeding sites over time + protection level on intertidal sites on their migratory path.
Hamilton, Milani Fard and Pineau
9000
9200
9400
9600
9800
Hashed−10 Memoryless Rademacher−50 Random Spherical−10Model
Aver
age
Dis
coun
ted
Rew
ard
Figure 8: Average discounted reward per episode (i.e., average return per episode) achievedin the AMM domain using di↵erent methods over 100000 test episodes (each oflength 50). The numbers beside the CPSR method names denote the projecteddimension size. 95% confidence intervals are too small to be visible.
�0.2 (despite the fact that it is actually optimizing for average return per episode), and it isthus, competitive given the same magnitude of samples, as approximately 50000 transitionswere used in this work. It is also important to note that PSR-type models may be combinedwith memoryless controllers as memory PSRs (described in Section 7.2), and so it shouldbe possible to boost the performance of the CPSR models to match that of the memorylesscontroller in that way.
Importantly, in S-Pocman where part of the observation vector is dropped and therewards are sparsified, we see that the top-performer is again a CPSR based model (whichin this case uses spherical projections). This matches expectations since the food-rewards areno longer fully discernible from the observation vector, and thus the domain is significantlyless observable. It is also worth noting that building naive TPSRs (without compressionor domain-specific feature selection) is infeasible computationally in these PacMan-inspireddomains, and thus the use of a PSR-based reinforcement learning agent (via the compressiontechniques used) in these domains is a considerable advancement.
A final observation is that the performance is quite sensitive to the choice of projectionmatrices in these results. For example, in the S-PocMan domain, the Rademacher projec-tions perform no better than the memoryless baseline, whereas for PocMan the Rademacheroutperforms the other projection methods. The exact cause of this performance change isunclear. Nevertheless, this highlights the importance of evaluating di↵erent projection tech-niques when applying this algorithm in practice.
32
Rnd Obs.
Learning (more!) representations for RL
Recent advances in Reinforcement Learning Joelle Pineau 47
Q(ϕ(s),a)
s
ϕ(s)?
Original state
Deep Q-learning
Reinforcement learning + deep learning in the Arcade Learning Environment [Mnih et al., 2013; Guo et al., 2014]. Many interesting open questions: • Efficient exploration • Temporally extended actions • Transfer learning between games • …
Recent advances in Reinforcement Learning Joelle Pineau 49
www.arcadelearningenvironment.org
Research team @ McGill
Funding: Natural Sciences and Engineering Research Council Canada (NSERC), Canadian Institutes for Health Research (CIHR), National Institutes of Health (NIH) ,
Fonds Québécois de Recherche Nature et Technologie (FQRNT)
50 Recent advances in Reinforcement Learning Joelle Pineau