8
Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding Nir Baram 1 [email protected] Tom Zahavy 1 [email protected] Shie Mannor [email protected] Technion Israel Institute of Technology, Haifa, Israel Abstract Recent progress in the field of Reinforcement Learning (RL) has enabled to tackle bigger and more challenging tasks. However, the increasing complexity of the problems, as well as the use of more sophisticated models such as Deep Neural Networks (DNN), impedes the understanding of artificial agents behavior. In this work, we present the Semi-Aggregated Markov Decision Process (SAMDP) model. The purpose of the SAMDP modeling is to describe and allow a better understanding of complex behaviors by identifying temporal and spatial abstractions. In contrast to other modeling approaches, SAMDP is built in a transformed state-space that encodes the dynamics of the problem. We show that working with the right state representation mitigates the problem of finding spatial and temporal abstractions. We describe the process of building the SAMDP model from observed tra- jectories and give examples for using it in a toy problem and complicated DQN policies. Finally, we show how using the SAMDP we can monitor the policy at hand and make it more robust. Keywords: Interpretability, Neural Networks, SMDP 1. Introduction Policy interpretation is the ability to reason the behavior of agents. Understanding the mechanisms that govern its choice of actions can help to improve it, and is mandatory before adopting it in autonomous systems. In the early days of RL, understanding the behavior of agents was rather easy (Sutton, 1990); researchers focused on simpler problems (Peng and Williams, 1993), and policies were built using lighter models (Tesauro, 1994). An effective analysis was possible even when working with the original state representation, and relating to primitive actions and individual states. In recent years, research has made a huge step forward. The RL community now tackles bigger and more challenging problems (Silver et al., 2016), and fancier models such as DNNs have become a commodity (Mnih et al., 2015). As a result, the need for interpretability soared. Understanding a policy modeled by a DNN, either graphically using the state-action diagram or by any other way, is practically impossible; Problems consist of an immense number of states, and policies often rely on skills (Mankowitz et al., 2014), creating more than a single level of planning. The resulting Markov reward process induced by such policies is too complicated to comprehend through observation. Simplifying the behavior requires finding a suitable representation of the state space; a long-standing problem in machine learning where extensive research has been conducted (Boutilier et al., 1999). In the field of RL, this difficulty is exacerbated 1. These authors contributed equally. 1

Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding

Spatio-Temporal Abstractions in Reinforcement LearningThrough Neural Encoding

Nir Baram 1 [email protected]

Tom Zahavy 1 [email protected]

Shie Mannor [email protected]

Technion Israel Institute of Technology, Haifa, Israel

Abstract

Recent progress in the field of Reinforcement Learning (RL) has enabled to tackle biggerand more challenging tasks. However, the increasing complexity of the problems, as well asthe use of more sophisticated models such as Deep Neural Networks (DNN), impedes theunderstanding of artificial agents behavior. In this work, we present the Semi-AggregatedMarkov Decision Process (SAMDP) model. The purpose of the SAMDP modeling is todescribe and allow a better understanding of complex behaviors by identifying temporaland spatial abstractions. In contrast to other modeling approaches, SAMDP is built in atransformed state-space that encodes the dynamics of the problem. We show that workingwith the right state representation mitigates the problem of finding spatial and temporalabstractions. We describe the process of building the SAMDP model from observed tra-jectories and give examples for using it in a toy problem and complicated DQN policies.Finally, we show how using the SAMDP we can monitor the policy at hand and make itmore robust.

Keywords: Interpretability, Neural Networks, SMDP

1. Introduction

Policy interpretation is the ability to reason the behavior of agents. Understanding themechanisms that govern its choice of actions can help to improve it, and is mandatorybefore adopting it in autonomous systems. In the early days of RL, understanding thebehavior of agents was rather easy (Sutton, 1990); researchers focused on simpler problems(Peng and Williams, 1993), and policies were built using lighter models (Tesauro, 1994).An effective analysis was possible even when working with the original state representation,and relating to primitive actions and individual states. In recent years, research has made ahuge step forward. The RL community now tackles bigger and more challenging problems(Silver et al., 2016), and fancier models such as DNNs have become a commodity (Mnihet al., 2015). As a result, the need for interpretability soared. Understanding a policymodeled by a DNN, either graphically using the state-action diagram or by any other way, ispractically impossible; Problems consist of an immense number of states, and policies oftenrely on skills (Mankowitz et al., 2014), creating more than a single level of planning. Theresulting Markov reward process induced by such policies is too complicated to comprehendthrough observation. Simplifying the behavior requires finding a suitable representation ofthe state space; a long-standing problem in machine learning where extensive research hasbeen conducted (Boutilier et al., 1999). In the field of RL, this difficulty is exacerbated

1. These authors contributed equally.

1

Page 2: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Baram, Zahavy and Mannor

even more since a task also includes a temporal dimension, and a good state representationneeds to account for the dynamics of the problem as well. Representation learning aims atlearning better representations (Ng, 2011; Lee et al., 2006). Such algorithms are used totransform the state space, φ : s → s, usually as a pre-processing stage before performingsome prediction. DNNs are very useful in this context since they automatically build ahierarchy of representations with an increasing level of abstraction along the layers. In thiswork, we show that the state representation learned automatically by DNNs is suitable forbuilding abstractions in RL. For this end, we introduce the SAMDP model; a modelingapproach that creates abstractions both in space and time. In oppose to other modelingapproaches, SAMDP is built in the transformed state space, where the problem of creatingspatial abstractions (i.e., state aggregation), and temporal abstractions (i.e., identifyingskills) is facilitated using spatiotemporal clustering.

2. The Semi Aggregated MDP

Reinforcement Learning problems are typically modeled using the MDP formulation. Givenan MDP, a variety of algorithms have been proposed to find an optimal policy. However,when one wishes to analyze a policy, MDP models may not be the best modeling choicedue to the size of the state space and the length of the planning horizon. In this section, wepresent the SMDP and AMDP models which can simplify the analysis by using temporaland spatial abstractions respectively. We also propose a new model, the SAMDP, thatcombines these models in a novel way and allows analysis using spatiotemporal abstrac-tions. SMDP (Sutton et al., 1999), simplifies the analysis by using temporal abstractions.The model extends the MDP action space A and allows the agent to plan with temporallyextended actions Σ (i.e., skills). Analyzing policies using the SMDP model shortens theplanning horizon and simplifies the analysis. However, there are two problems with thisapproach. First, one must consider the high complexity of the state space, and second,the SMDP model requires to identify a set of skills, a challenging problem with no easysolution (Mankowitz et al., 2016). Another approach is to analyze a policy using spatialabstractions in the state space. If there is a reason to believe that groups of states sharecommon attributes such as similar policy or value function, it is possible to use State Aggre-gation (Moore, 1991). State Aggregation is a well-studied problem that typically involvesidentifying clusters as the new states of an Aggregated MDP (AMDP), where the set ofclusters C replaces the MDP states S. Applying RL on aggregated states is potentiallyadvantageous since the dimensions of the transition matrix P , the reward signal R, and the

𝐶2

𝐶1

𝐶3

𝜎1,2 𝜎1,3

Cluster

State

Action

Skill

𝑆, 𝐴, 𝑃𝑆𝐴, 𝑅𝑆

𝐴, 𝛾

MDP

SMDP

AMDP

SAMDP

Skill

Identification

State

Aggregation

Low Temporal Hierarchy High Temporal Hierarchy

Low Spatial

Hierarchy

High Spatial Hierarchy

𝑆, Σ, 𝑃𝑆Σ, 𝑅𝑆

Σ, 𝛾

𝐶, 𝐴, 𝑃𝐶𝐴, 𝑅𝐶

𝐴, 𝛾 𝐶, Σ, 𝑃𝐶Σ, 𝑅𝐶

Σ, 𝛾

State Aggregation

𝝋(𝑺)

Figure 1: Left: Illustration of state aggregation and skills. Right: Analysis approaches.2

Page 3: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding

policy π are decreased, e.g., Singh et al. (1995). However, the AMDP modeling approachhas two drawbacks. First, the action space A is not modified and therefore the planninghorizon remains intractable, and second, AMDPs are not necessarily Markovian (Bai et al.,2016).In this paper, we propose a model we denote as SAMDP that combines the advantages ofthe SMDP and AMDP approaches. Under SAMDP modeling, aggregation defines both thestates and the set of skills, allowing analysis with spatiotemporal abstractions (the state-space dimensions and the planning horizon are reduced). However, SAMDPs are still notnecessarily Markovian. We summarize the different modeling approaches in Figure 1. Wenow describe the five stages of building an SAMDP:

(0) Feature selection. We define the mapping from MDP states to features, by a mappingfunction φ : s→ s′ ⊂ Rm. The features may be raw (e.g., spatial coordinates, frame pixels)or higher level abstractions (e.g., the last hidden layer of an NN). The feature representationhas a significant effect on the quality of the resulting SAMDP model and vice versa; a goodmodel can point out a good feature representation.(1) Aggregation via Spatio-temporal clustering. The goal of Aggregation is to finda mapping (clustering) from the MDP feature space S′ ⊂ Rm to the AMDP state space C.Clustering algorithms assume that the data is drawn from an i.i.d distribution. However,in our problem the data is generated from an MDP which violates this assumption. Wealleviate this problem using two different approaches. First, we decouple the clustering stepwith the SAMDP model, by creating an ensemble of clustering candidates and buildingan SAMDP model for each (stage 4). Second, we introduce a novel extension of the cel-ebrated K-means algorithm (MacQueen et al., 1967), which enforces temporal coherencyalong trajectories. In the vanilla K-means algorithm, a point xt is assigned to cluster ciwith mean µi if µi is the closest cluster center to xt. We modified this step as follows:c(xt) =

{ci :

∥∥Xt − µi∥∥2

F≤∥∥Xt − µj

∥∥2

F,∀j ∈ [1,K]

}. Here, F stands for the Frobenius

norm, K is the number of clusters, t is the time index of xt, and Xt is a set of 2w + 1centered at xt from the same trajectory:

{xj ∈ Xt ⇐⇒ j ∈ [t−w, t+w]

}. The dimensions

of µ correspond to a single point, but it is expanded to the dimensions of Xt. In this way,a point xt is assigned to a cluster ci if its neighbors along the trajectory are also close toµi. Thus, enforcing temporal coherency.(2) Skill identification. We define an SAMDP skill σi,j ∈ Σ uniquely by a single ini-tiation state ci ∈ C and a single termination state cj ∈ C : σij =< ci, πi,j , cj > . Moreformally, at time t the agent enters an AMDP state ci at an MDP state st ∈ ci. It choosesa skill according to its SAMDP policy and follows the skill policy πi,j for k time steps untilit reaches a state st+k ∈ cj , s.t i 6= j. We do not define the skill length k apriori nor theskill policy but infer the skill length from the data. As for the skill policies, our model doesnot define them explicitly, but we will observe later that our model successfully identifiesskills that are localized in time and space.(3) Inference. Given the SAMDP states and skills, we infer the skill length, the SAMDPreward and the SAMDP probability transition matrix from observations. The skill lengthk, is inferred for skill σi,j by averaging the number of MDP states visited since enter-ing SAMDP state ci until transiting to SAMDP state cj . The skill reward is given by:Rσs = E[rσs ] = E[rt+1 +γrt+2 + · · ·+γk−1rt+k|st = s, σ]. The inference of the SAMDP tran-

3

Page 4: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Baram, Zahavy and Mannor

(a) MDP (b) SMDP (c) AMDP (d) SAMDP

Figure 2: State-action diagrams for a gridworld problem. Clusters are found after trans-forming the state space. Intra-Cluster transitions (dashed arrows) explain the skills, whileinter-cluster transitions (big red arrows) explain the governing policy.

sition matrices is a bit more puzzling, since the probability of seeing the next SAMDPstate depends both on the MDP dynamics and the agent policy in the MDP state space.Our goal is to infer two quantities: (a) The SAMDP transition probability matricesPΣ : P σ∈Σ

i,j = Pr(cj |ci, σ), measures the probability of moving from state ci to cj giventhat skill σ is chosen. These matrices are defined uniquely by our definition of skills asdeterministic probability matrices. (b) The probability of moving from state ci to cj giventhat skill σ is chosen according to the agent SAMDP policy: P πi,j = Pr(cj |ci, σ = π(ci)).This quantity involves both the SAMDP transition probability matrices and the agentpolicy. However, since SAMDP transition probability matrices are deterministic, this isequivalent to the agent policy in the SAMDP state space. Therefore by inferring transi-tions between SAMDP states, we directly infer the agent’s SAMDP policy. Given an MDPwith a deterministic environment and an agent with a nearly deterministic MDP policy(e.g., a deterministic policy that uses an ε-greedy exploration (ε � 1)), it is intuitive toassume that we would observe a nearly deterministic SAMDP policy. However, there aretwo mechanisms that cause stochasticity of the SAMDP policy: (1) Stochasticity that isaccumulated along skill trajectories. (2) Approximation errors in the aggregation process.A given SAMDP state may contain more than one ”real” state and therefore more than oneskill. Performing inference in this setup, we might observe a stochastic policy that choosesrandomly between skills. Therefore, it is very likely to infer a stochastic SAMDP transitionmatrix, even though the SAMDP transition probability matrices and the MDP environmentare deterministic, and the MDP policy is nearly deterministic.(4) Model selection. There are two advantages for choosing between multiple SAMDPs.First, there are different hyperparameter to tune, and second, there is randomness in the ag-gregation step. Hence, clustering multiple times and picking the best result will potentiallyyield better models. We developed evaluation criteria that allow us to select the best model.Motivated by (Hallak et al., 2013), we define the Value Mean Square Error(VMSE).VMSE measures the consistency of the model with the observations. The estimator is givenby VMSE = ‖v−vSAMDP ‖/‖v‖, where v stands for the SAMDP value function of the givenpolicy, and vSAMDP is given by: VSAMDP = (I + γkP )−1r, where P is measured under theSAMDP policy.

4

Page 5: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding

Figure 3: SAMDP visualization for Breakout. SAMDP states are visualized by their meanstate (with frame pixels) at the mean t-SNE coordinate, and the numbers above the edgescorrespond to the inferred SAMDP policy.

3. SAMDP for Gridworld

We first illustrate the advantages of SAMDP in a basic gridworld problem (Figure 2). In thistask, an agent is placed at the origin (marked in X), and his goal is to reach the green balland return. The state s ∈ R3 is given by: s = {x, y, b}, where (x, y) are the coordinates andb ∈ {0, 1} indicates whether the agent has reached the ball or not. The policy is trained tofind skills following Mankowitz et al. (2014). We are given trajectories of the trained agent,and wish to analyze its behavior by building the state-action graph for all four modelingapproaches. For clarity, we plot the graphs on the maze using the coordinates of the state.The MDP graph (Figure 2(a)), consists of a vast number of states. It is also difficult tounderstand what skills the agent is using. In the SMDP graph (Figure 2(b)), the numberof states remain high, however coloring the edges by the skills, helps to understand theagent’s behavior. Unfortunately, producing this graph is seldom possible because we rarelyreceive information about the skills. On the other hand, abstracting the state space canbe done more easily using state aggregation. However, in the AMDP graph (Figure 2(c)),the clusters are not aligned with the played skills because the routes leading to and fromthe ball overlap. For building the SAMDP model (Figure 2(d)), we transform the state

space in a way that disentangles the routes: φ(x, y) =

{(x, y), if b is 0

(2L− x, y), if b is 1, where L is

the maze width. The transformation φ flip and translate the states where b = 1. Now thatthe routes to and from the ball are disentangled, the clusters are aligned with the skills.Understanding the behavior of the agent is now possible by examining inter-cluster andintra-cluster transitions.

5

Page 6: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Baram, Zahavy and Mannor

4. SAMDPs for DQNs

We evaluate a pre-trained DQN agent for multiple trajectories with an ε-greedy policy onthree Atari2600 games, Pacman, Seaquest, and Breakout. During evaluation we record theneural activations of the last hidden layer and the DQN value estimations for 120k gamestates. We then apply t-SNE on the neural activations data, and use the coordinates anda value coordinate as the MDP state features, where each coordinate is normalized to havezero mean and unit variance. Examining the resulting SAMDP (Figure 3) it is interestingto note the sparsity of transitions, which implies low entropy. Inspecting the mean imageof each cluster reveals insights about the nature of the skills hiding within and uncovers thepolicy hierarchy as described in Zahavy et al. (2016). The agent begins to play in low value(blue) clusters (e.g., 1,5,8,9,13,16,18,19). These clusters are well connected between themand are disconnected from the others. Once the agent transitions to the ”tunnel-digging”option in clusters 4,12,14, it stays in there until it finishes to curve the tunnel, then ittransitions to cluster 11. From cluster 11 the agent progresses through the ”left banana”and hops between clusters 2,21,5,10,0,7 and 3 in that order.

0 5 10 15 20 25Cluster index

2

0

2

4

Valu

e

DQN value vs. SAMDP value

DQN

SAMDP

0 5 10 15 20 25Cluster index

1

0

1

Corr

ela

tion

Correlation between greedy policy and trajectory reward

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Percentage of extermum trajectories used

0.2

0.3

0.4

0.5

Weig

ht

Greedy policy weight in good/bad trajectories

High reward trajs

Low reward trajs

Figure 4: VMSE for Breakout.

Model Evaluation: We measurethe VMSE criterion (Figure 4, top),as defined in section 2. We inferv by averaging the DQN value esti-mates in each cluster: vDQN (cj) =

1|Cj |

∑i:si∈cj v

DQN (si), and evaluate VSAMDP as defined above. Since the Atari environ-

ment is deterministic, the vSAMDP estimate is accurate with respect to the DQN policy.Therefore, the VMSE criterion measures how well the SAMDP model approximates thetrue MDP. We observe that DQN values and the SAMDP values are very similar, thusindicating that the SAMDP model fits the data well.Eject Button: The motivation for this experiment stems from the idea of shared autonomyPitzer et al. (2011) that allows an operator to intervene in the decision loop in critical times.In the following experiment, we show how the SAMDP model can help to identify whereagent’s behavior deteriorates. Setup. (a) Evaluate a DQN agent, create a trajectory dataset, and evaluate the features for each state (stage 0). (b) Divide the data into two groups:train (100 trajectories) and test (60), then build an SAMDP model (stages 1-4) using thetrain data. (c) Split the train data to k top and bottom rewarded trajectories T+, T− andre-infer the model parameters separately for each (stage 3). (d) Project the test data onthe SAMDP model (mapping each state to the nearest SAMDP state). (e) Eject when thetransitions of the agent are more likely under T− rather then under T+. (f) We averagethe trajectory reward on (i) the entire test set, and (ii) the un-ejected trajectories subset.We measure 36%, 20%, and 4.7% performance gain for Breakout Seaquest and Pacman,respectively. The eject experiment indicates that the SAMDP model can be used to makea given DQN policy robust by identifying when the agent is not going to perform well andreturn control to a human operator or some other AI agent

6

Page 7: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding

5. Discussion

SAMDP modeling offers a way to present a trained policy in a concise way by creatingabstractions that relate to the spatiotemporal structure of the problem. We showed thatusing the right representation, time-aware state aggregation can be used to identify skills.It implies that the crucial step in building an SAMDP is the state aggregation phase. Theaggregation depends on the state features and the clustering algorithm at hand. In thiswork, we presented a basic K-means variant that uses temporal information, however, otherclustering approaches are possible. In future work we will consider other clustering methodssuch as spectral clustering, that better relate to topology. Regarding the state features, inthe DQN example, we used the 2D t-SNE map. This map, however, is built under the i.i.dassumption that overlooks the temporal dimension of the problem. An interesting line offuture work will be to modify the t-SNE algorithm to take into account temporal distancesas well as spatial ones.

Acknowledgments

The research leading to these results has received funding from the European ResearchCouncil under the European Union’s Seventh Framework Program (FP/2007-2013) / ERCGrant Agreement n. 306638.

References

Aijun Bai, Siddharth Srivastava, and Stuart Russell. Markovian state and action abstrac-tions for mdps via hierarchical mcts. In Proceedings of the Twenty-Fifth InternationalJoint Conference on Artificial Intelligence, IJCAI 2016., 2016.

Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structuralassumptions and computational leverage. Journal of Artificial Intelligence Research, 11(1):94, 1999.

Assaf Hallak, Dotan Di-Castro, and Shie Mannor. Model selection in markovian processes.In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discov-ery and data mining. ACM, 2013.

Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algo-rithms. In Advances in neural information processing systems, pages 801–808, 2006.

James MacQueen et al. Some methods for classification and analysis of multivariate obser-vations. 1967.

Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Time regularized interruptingoptions. Internation Conference on Machine Learning, 2014.

Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Adaptive skills, adaptive parti-tions (asap). Advances in Neural Information Processing Systems (NIPS-30), 2016.

7

Page 8: Spatio-Temporal Abstractions in Reinforcement …...Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding policy ˇare decreased, e.g., Singh et al. (1995)

Baram, Zahavy and Mannor

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning. Nature, 518(7540), 2015.

Andrew Moore. Variable resolution dynamic programming: Efficiently learning action mapsin multivariate real-valued state-spaces. In L. Birnbaum and G. Collins, editors, MachineLearning: Proceedings of the Eighth International Conference. Morgan Kaufmann, June1991.

Andrew Ng. Sparse autoencoder. CS294A Lecture notes, 72:1–19, 2011.

Jing Peng and Ronald J Williams. Efficient learning and planning within the dyna frame-work. Adaptive Behavior, 1(4):437–454, 1993.

Benjamin Pitzer, Michael Styer, Christian Bersch, Charles DuHadway, and Jan Becker.Towards perceptual shared autonomy for robotic mobile manipulation. In IEEE Inter-national Conference on Robotics Automation (ICRA), 2011.

David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, Georgevan den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, IlyaSutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, andDemis Hassabis. Mastering the game of go with deep neural networks and tree search.Nature, 529:484–503, 2016.

Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning withsoft state aggregation. Advances in neural information processing systems, pages 361–368, 1995.

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based onapproximating dynamic programming. In In Proceedings of the Seventh InternationalConference on Machine Learning, pages 216–224. Morgan Kaufmann, 1990.

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: Aframework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999.

Gerald Tesauro. TD-gammon, a self-teaching backgammon program, achieves master-levelplay. Neural Computation, 6:215–219, 1994.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.

Tom Zahavy, Nir Ben Zrihem, and Shie Mannor. Graying the black box: Understandingdqns. Proceedings of the 33 rd International Conference on Machine Learning (ICML-16),JMLR: volume48, 2016.

8