Practical kernel-based reinforcement learning (and other ...Practical kernel-based reinforcement learning (and other stories) Joelle Pineau Reasoning and Learning Lab / Center for

Practical kernel-based reinforcement learning (and other stories)

Joelle Pineau Reasoning and Learning Lab / Center for Intelligent Machines

School of Computer Science, McGill University

Joint work with: Andre Barreto, Mahdi Milani Fard, William Hamilton, Doina Precup

Workshop on Big Data and Statistical Machine Learning

Fields Institute January 29, 2015

Reinforcement learning 1.  Learning agent tries a sequence of actions (at).

2.  Observes outcomes (state st+1, rewards rt) of those actions.

3.  Statistically estimates relationship between action choice and outcomes, Pr(st|st-1,at). After some time... learn action selection policy, π(s), that optimizes selected outcomes.

argmaxπ Eπ [ r0 + r1 + … + rT ] [Bellman, 1957; Sutton, 1988; Sutton&Barto, 1998.]

2 Recent advances in Reinforcement Learning Joelle Pineau

http://en.wikipedia.org/wiki/Animal_training

Recent applications of RL


•  Robotics •  Medicine •  Advertisement •  Resource management •  Game playing …

Goal: To create an adaptive neuro-stimulation system that can

maximally reduce the incidence of epileptiform activity.

Adaptive deep-brain stimulation

Recent advances in Reinforcement Learning Joelle Pineau 4

Reinforcement learning agent

Closed-loop, non-periodic, stimulation strategy learned for an in vitro model of epilepsy.

Adaptive deep-brain stimulation


Recent advances in Reinforcement Learning Joelle Pineau

Speech-based control of smart wheelchair

6

RL vs supervised learning


Jointly learning AND planning from correlated

samples

Supervised Learning Inputs Outputs

Training signal = desired (target outputs), e.g. class

ReinforcementLearning Inputs Outputs (“actions”)

Training signal = “rewards” Environment

Learning from i.i.d samples

Learning problem

Learn the Q function, defining state-action values of the policy π :

Qπ(s,a) = r(s,a) + γ ∑s’ P(s’|s,a) maxa’ Qπ(s’,a’)

Immediate reward Future expected sum of rewards

Error to adjust the Q-function, given a sample <s,a,r,s’,a’>:

δ = Q(s,a) – (r + γ maxa’Q(s’,a’))


In large state spaces: Need approximation


Batch reinforcement learning

Use regression analysis to estimate the long-term cost of different actions from the training data

Regression with linear function, kernel function, random forests, neural networks, ….. Important! Target function, Q, is sum of future expected rewards.

fi < ti

… …

fi < ti

…

fi < ti

…

fi < ti

… …


Algorithms for large-scale RL

11

# features

# examples

Algorithms with sample complexity guarantees

[Fard et al., NIPS’13, AAAI’12, UAI’11, NIPS’10]

Algorithms with convergence guarantees

[Barreto et al., JAIR’14, NIPS’12, NIPS’11]


Kernel-based RL (Ormoneit & Sen, 2002)

12

KBRL (Ormoneit & Sen, 2002)

Sample transitions associated with action a


13


pa(s|s) := (s, s) = k⌧ (s,s)Ps k⌧ (s,s) ,

with k⌧

(s, s) = �(||s� s||/⌧)



14


Q(s, a) =P

na

i=1 (s, si

)hri

+ �V ⇤(si

)i,



•  KBRL has good theoretical guarantees:

–  Converges to a solution.

–  Consistent, in the statistical sense. »  Additional data improves quality of approximation.

–  If kernel widths are decreased at an appropriate rate, the probability of selecting suboptimal actions converges to zero.

Problem: One application of the Q-function update is O(n2).

15



Stochastic factorization trick

16

Stochastic-Factorization Trick

Stochastic matrix

P

• pij

� 0

•P

j

pij

= 1



17


Stochastic factorization

D K = P



18


Stochastic-factorization trick

K D = P


Kernel-based stochastic factorization

19

KBSF


Kernel-based stochastic factorization

20

KBSF

Select a set of representative states.Define D 2 <nxm, K 2 <mxn


KBSF algorithm

21

KBSF

KBSF algorithm

1 For each a 2 A do1 Compute matrix Da: da

ij

= (sa

i

, sj

)

2 Compute matrix Ka: ka

ij

= (si

, sa

j

)

3 Compute ra: ra

i

=P

j

ka

ij

ra

j

4 Pa = KaDa

2 Solve the MDP described by Pa and ra


Properties

22

KBSF

Computational requirements

KBRL KBSFModel generation O(n2|A|) O(nm2|A|)Policy evaluation O(n3) O(m3)Policy improvement O(n2|A|) O(m2|A|)

n = number of sample transitionsm = number of representative states


Advantages of KBSF

•  It always converges to a solution.

•  It has good theoretical guarantees.

•  It is fast.

•  It is simple.

•  It has a predictive behavior, generally improving its performance as m→n or n→∞.


Epilepsy suppression in silico

24

KBSF

Epilepsy suppression in vitro

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.10

0.15

0.20

Fraction of stimulation

Frac

tion

of s

eizu

res

0Hz

0.5Hz

1Hz1.5Hz

T20−40

T20−10

T200−40

T200−20T200−10

KBSF−40

KBSF−20

KBSF−10

LSPI−40LSPI−20

LSPI−10

T20−20

T20−10

T200−10

LSPI−10

KBSF−10

T20−20

T200−20

LSPI−20

KBSF−20

T20−40

T200−40

LSPI−40

KBSF−40

Seconds (log)

50 200 1000 5000


Online / Parallel KBSF

25

Incremental KBSF

Incremental KBSF


Incremental KBSF


Incremental KBSF

Triple Pole-Balancing

●

●

●

●●

● ●

●● ● ●

2e+06 4e+06 6e+06 8e+06 1e+07

0.4

0.5

0.6

0.7

0.8

Number of sample transitions

Succ

essf

ul e

piso

des

BatchKBSF

● KBSFTREE

●

●

●

●

●

●

●

●

●

●

●

2e+06 4e+06 6e+06 8e+06 1e+07

050

010

0015

00

Number of sample transitions

Seco

nds

Batch KBSF

● KBSFTREE

Incremental KBSF


Incremental KBSF

Helicopter task (from the 2013 RL competition)

5 10 15 20

020

0040

0060

00

Transitions (x 30000 )

Step

s fly

ing

2013 RL competition results for KBSF: 1st for the polyathlon task; 2nd

for the helicopter task (best value-based, online learning method).

High-dimensional input data?

Learning representations for RL


Q(ϕ(s),a)

s

ϕ(s)?

Original state

Partially Observable MDP

Hidden

...

...

Visible

St�1

St St+1

Zt�1 AtZt Zt+1

At+1

S0

rt rt+1

T T

O O O

�!

hS,A,Z, T, O,Ri

•  POMDP defined by n-tuple <S, A, Z, T, O, R>, where <S, A, T, R> are same as in an MDP.

•  States are hidden è Observations (Z) •  Observation function O(s,a,z) := Pr(z | s, a)

POMDP planning in high-dimensions

•  Poc-Man problem: |S|=1056, |A|=4, |Z|=1024.

http://jveness.info/publications/default.html

Learning representations for RL


[Hamilton, Fard & Pineau, 2013]

Q(ϕ(s),a)

s ϕ(s)

Random projection Original state

Latent-state approaches to learning

o  Expectation maximization

»  HMM learning (Rabiner, 1990)

»  Online nested EM (Liu, Liao & Carin, 2013)

»  Model-free RL as mixture learning (Vlassis & Toussaint, 2009)

o  Bayesian learning »  Kalman filtering

»  Bayes-Adaptive POMDP (Ross, Chaib-draa & Pineau, 2007)

»  Infinite POMDP (Doshi-Velez, 2009)

Event-based approaches to learning

•  An alternative is to work directly with observable events.

o  Merge-split histories »  U-Tree (McCallum, 1996)

»  MC-AIXI (Veness, Ng, Hutter, Uther & Silver, 2011)

o  Predictive state representations »  Observable operator models (Jaeger, 2000)

»  PSRs (Littman, Sutton, Singh, 2002)

o  Methods of moments »  Spectral HMMs (Hsu et al., 2008)

»  TPSRs (Boots & Gordon, 2010)

»  Tensor decomposition (Anandkumar, 2012)

Estimate probabilities of the form p( ot:t+k | o1:t-1 ) and represent compactly.

The PSR systems dynamics matrix

EfficientlyModellingSparse

DynamicalSystems

WilliamHamilton,

Mahdi Fard,Joelle Pineau

Motivation

OurContribution

Results

Summary

The System Dynamics Matrix, D

We want P(⌧i |hi) 8i 8jRank finite and bounded [Littman et al., 2002].Tests corresponding to row basis called core tests.

5 / 21

Learning PSRs from data

•  Spectral approach [Rosencrantz et al., 2004; Boots et al.,

2009]: o  Estimate large sub-matrix of D.

o  Project to low-dimensions subspace using SVD.

o  Globally optimal, if you know rank(D).

•  Still computationally expensive, O(|T|2|H|).

Regularity in large observation spaces

•  We say that D is k-sparse if only k tests are possible for any given history hi.

•  Valid assumption? In Poc-Man domain, large sub-matrix estimates of D empirically observed to have an average of 99.9% column sparsity.


DynamicalSystems

WilliamHamilton,


Motivation

OurContribution

Results

Summary

The System Dynamics Matrix, D

We want P(⌧i |hi) 8i 8jRank finite and bounded [Littman et al., 2002].Tests corresponding to row basis called core tests.

5 / 21

Exploiting sparsity

•  Sparse structure can be exploited using random projections.

Compress m x n matrix Y to a d x n matrix X, where d << m. Let X = ϕY,

where ϕ is a d x m projection matrix with entries from N~(0, 1/d).

The CPSR Algorithm

Note: Multiplication done “online” and D matrix never held in memory.


DynamicalSystems

WilliamHamilton,

Mahdi MilaniFard, Joelle

Pineau

The CPSR Algorithm

AlgorithmObtain compressed estimates for sub-matrices of D,�PT ,H, �PT ,ol ,Hs, and PH by sampling time seriesdata.

Estimate �PT ,H in compressed space by adding �i tocolumn j each time ti observed after hi (Likewise for�PT ,ol ,Hs).

Compute CPSR model:c0 = �P(⌧ |;)Co = �PT ,ol ,H(�PT ,H)+

c1 = (�PT ,H)+PH

10 / 1

Using the compact representation EfficientlyModellingSparse

DynamicalSystems

WilliamHamilton,


Motivation

OurContribution

Results

Summary

Using the compact representation.

State definition and necessary equationsc0 serves as initial prediction vector (i.e. state vector).Update state vector after seeing observation with

ct+1 = CoCtC1Coct

Predict k-steps into the future usingP(oj

t+k |ht) = b1Coj (C?)k�1Ct where C? =

Poi2O Coi .

11 / 21

Properties of CPSR

•  Computationally efficient (projection has low cost). •  A projection size of d = O(k log |Q|) suffices in most systems.

•  Allows projection to subspaces of dimension d < rank (D). •  If d ≥ rank(D) then model is trivially consistent.

•  Regularizes the solution (i.e. the learned model parameters).

•  Total propagated error for T steps is bounded.

40 Reinforcement Learning Joelle Pineau

GridWorld prediction error EfficientlyModellingSparse

DynamicalSystems

WilliamHamilton,


Motivation

OurContribution

Results

Summary

GridWorld: Increased time-efficiency in smallsimple systems

CPSR TPSR0

5

10

15

run

tim

e (

min

ute

s)

0 1 2 3 4 5 6 7 8 9 10 110

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

Steps Ahead to Predict

Pre

dic

tion E

rror

TPSR

CPSR

15 / 21

Poc-Man prediction error EfficientlyModellingSparse

DynamicalSystems

WilliamHamilton,

Mahdi MilaniFard, Joelle

Pineau

Motivation

OurContribution

Results

Summary

Poc-Man: Better model quality in large difficultsystems

Partially observable variant of Pac-Man video-game with|S| = 1056 and |O| = 210 [Silver and Veness, 2010].

0 2 4 6 8 10 12 14 16 18 20 21

0

0.05

0.1

0.15

0.2

0.25

0.3

Number of Steps Ahead to Predict

Pre

dic

tio

n E

rro

r

CPSR ! All Tests

TPSR ! Single Length Tests

16 / 17

Planning in CPSRs

Point-based value iteration (traditional POMDP approach): CPSR state vector = POMDP belief state

Point-based value iteration (common MDP approach): CPSR state vector = input to value function approximation

8

Joelle PineauEfficient decision-making under uncertainty 15

The basic algorithm: Point-based value iteration

P(s1)

V(b)

b1 b0 b2

Approach:

Select a small set of belief points

Plan for those belief points only ⇒ Learn value and its gradient

a,z a,z

⇒ Which beliefs are most important?

Pick action that maximizes value ⇒ ( )bbV !="#$

$max)(

Joelle PineauEfficient decision-making under uncertainty 16

Selecting belief points

• Earlier strategies:

Fully observable states Points at regular intervals Random samples

• Can we use theoretical properties to pick important beliefs?

x1

x0

x2

Poc-Man results

Observation space too large for building TPSR model!

Poc-Man results

•  S-PocMan: Drop parts of the observation vector that allow agent to sense in what direction food lies.

Compressed Predictive States

6.4.2 Partially Observable PacMan

!300

!200

!100

0

Hashed!250 Memoryless Rademacher!250 Random Spherical!250Model

Ave

rag

e R

etu

rn p

er

Ep

iso

de

(a) PocMan

!300

!200

!100

0

Hashed!250 Memoryless Rademacher!250 Random Spherical!250Model

(b) S-PocMan

Figure 7: Average return per episode achieved in the PocMan (a) and S-PocMan (b) do-mains using di↵erent models and the baselines. Compressed dimension sizes arelisted next to the model names. 95% confidence interval error bars are shown.

For both PocMan and S-PocMan, we set d0 = 25 and examined compressed dimensionsin the range [250, 500] (selecting only the top performer via a validation set); no TPSRmodels were used on these domains, as their construction exhausted the memory capacityof the machine used. Following Veness et al. (2011), for these domains we use � = 0.99999as a discount factor.

Figure 7 details the performance of the CPSR algorithms on the PocMan and S-PocMandomains. In these domains, we see a much smaller performance gap between the CPSRapproaches and the memoryless baseline. In fact, in the PocMan domain, the memorylesscontroller is the top-performer. This demonstrates, first and foremost, that the PocMandomain is not strongly partially observable. Though the observations do not fully determinethe agent’s state, the immediate rewards available to an agent (with the exception of rewardfor eating the power pill and catching a ghost) are discernible through the observation vector(e.g., the agent can see locally where food is). Thus, the memoryless controller is able toformulate successful plans despite the fact that is treating the domain as if it were fullyobservable. Moreover, a qualitative comparison with the Monte-Carlo AIXI approximation(Veness et al., 2011) reveals that the quality of the memoryless controller’s plans are actuallyquite good. In that work, they use a slightly di↵erent optimization criteria of optimizingfor average transition reward, and with on the order of 50000 transitions they achieve anaverage transition reward in the range [�1, 1] (depending on parameter settings). Withon the order of 250000 transitions they achieve an average transition reward in the range[1, 1.5]. In this work, the memoryless controller achieves an average transition reward of

31

Rnd Obs. Rnd Obs.

Adaptive migratory bird management

IJCAI 2013 Challenge problem (Nicol et al. 2013): Observe bird population level at different breeding sites over time + protection level on intertidal sites on their migratory path.

Hamilton, Milani Fard and Pineau

9000

9200

9400

9600

9800

Hashed−10 Memoryless Rademacher−50 Random Spherical−10Model

Aver

age

Dis

coun

ted

Rew

ard

Figure 8: Average discounted reward per episode (i.e., average return per episode) achievedin the AMM domain using di↵erent methods over 100000 test episodes (each oflength 50). The numbers beside the CPSR method names denote the projecteddimension size. 95% confidence intervals are too small to be visible.

�0.2 (despite the fact that it is actually optimizing for average return per episode), and it isthus, competitive given the same magnitude of samples, as approximately 50000 transitionswere used in this work. It is also important to note that PSR-type models may be combinedwith memoryless controllers as memory PSRs (described in Section 7.2), and so it shouldbe possible to boost the performance of the CPSR models to match that of the memorylesscontroller in that way.

Importantly, in S-Pocman where part of the observation vector is dropped and therewards are sparsified, we see that the top-performer is again a CPSR based model (whichin this case uses spherical projections). This matches expectations since the food-rewards areno longer fully discernible from the observation vector, and thus the domain is significantlyless observable. It is also worth noting that building naive TPSRs (without compressionor domain-specific feature selection) is infeasible computationally in these PacMan-inspireddomains, and thus the use of a PSR-based reinforcement learning agent (via the compressiontechniques used) in these domains is a considerable advancement.

A final observation is that the performance is quite sensitive to the choice of projectionmatrices in these results. For example, in the S-PocMan domain, the Rademacher projec-tions perform no better than the memoryless baseline, whereas for PocMan the Rademacheroutperforms the other projection methods. The exact cause of this performance change isunclear. Nevertheless, this highlights the importance of evaluating di↵erent projection tech-niques when applying this algorithm in practice.

32

Rnd Obs.

Learning (more!) representations for RL


Q(ϕ(s),a)

s

ϕ(s)?

Original state

Deep Q-learning


Deep Q-learning

Reinforcement learning + deep learning in the Arcade Learning Environment [Mnih et al., 2013; Guo et al., 2014]. Many interesting open questions: •  Efficient exploration •  Temporally extended actions •  Transfer learning between games •  …


www.arcadelearningenvironment.org

Research team @ McGill

Funding: Natural Sciences and Engineering Research Council Canada (NSERC), Canadian Institutes for Health Research (CIHR), National Institutes of Health (NIH) ,

Fonds Québécois de Recherche Nature et Technologie (FQRNT)


Documents

Practical kernel-based reinforcement learning (and other ...Practical kernel-based reinforcement learning (and other stories) Joelle Pineau Reasoning and Learning Lab / Center for