Visual Sparse Bayesian Reinforcement Learning: A Framework

Visual Sparse Bayesian Reinforcement Learning: A Framework forInterpreting What an Agent Has Learned

Indrajeet Mishra, Giang Dao, and Minwoo Lee

Department of Computer Science

University of North Carolina at Charlotte

{imishra, gdao, minwoo.lee}@uncc.edu

Abstract— This paper presents a Visual Sparse BayesianReinforcement Learning (V-SBRL) framework for recordingthe images of the most important memories from the pastexperience. The key idea of this paper is to maintain animage snapshot storage to help understanding and analyzingthe learned policy. In the extended framework of SBRL [1],the agent perceives the environment as the image state inputs,encodes the image into feature vectors, train SBRL moduleand stores the raw images. In this process, the snapshotstorage keeps only the relevant memories which are importantto make future decisions and discards the not-so-importantmemories. The stored snapshot images enable us to understandthe agent’s learning process by visualizing them. They alsoprovide explanation of exploited policy in different conditions.A navigation task with static obstacles is examined for snapshotanalysis.

I. INTRODUCTION

There have been many recent developments that adopt

deep learning to reinforcement learning (RL) problems that

are driving innovation at the cutting edge of machine learn-

ing. Mnih, et al. [2] proposed a deep Q network (DQN) func-

tion approximation to play Atari games. DQN has convolu-

tional neural network (CNN) layers to receive video image

clips as state inputs to develop a human-level control policy.

Silver, et al. [3] suggested a deterministic policy gradient,

an actor-critic algorithm for continuous action space. Both

the actor and critic are neural networks where the actor

adjusts the policy in the direction of action-value gradient

and the critic updates the action-value function. Mnih, et al.

[4] discuss a framework which uses asynchronous variant of

actor-critic algorithm where multiple agents are trained in

parallel on the multiple instances of the environment which

makes the learning faster and was able to achieve state

of the art performance in the Atari games. Timothy, et al.

[5] presents an actor-critic model-free algorithm based on

deterministic policy gradient that operates over continuous

action-space. It uses a CNN in order to learn directly from

the raw image pixels and predicts the optimal action to

take. These advancements lead a large number of successful

applications to motion planning [6], game playing [7], natural

language processing [8] and self-driving cars [6].

Although the deep reinforcement learning models have

been highly successful in solving complex problems in many

areas, the black box model makes it difficult to interpret what

an agent learns. Yosinski, et al. [9] pointed out the major

problem of deep learning as lack of visualization tools to un-

derstand the computation performed in the hidden layers. For

this reason, many recent researches [10], [11], [12], [13], [14]

have attempted different types of visualization to explore the

learning process of a deep learning model. However, these

methods focus on visualizing how the features are computed

in the intermediate layers in supervised learning, so it is

not enough to fully explain what it has learned and what

the learned knowledge represents in reinforcement learning.

Although the recent visualization efforts for reinforcement

learning [15], [16] provides ways to interpret the learning of

RL-Agents, there are still needs for tools that can explain

the learning and exploitation process for trusty and robust

model construction through interpretation.

Recently, Lee [1] proposed a Sparse Bayesian Reinforce-

ment Learning (SBRL) approach to memorize the past expe-

riences during the training of a reinforcement learning agent

for knowledge transfer [17] and continuous action search

[18]. SBRL provides an efficient understanding of the state-

action space by analyzing the relevant data samples after

the end of training that can explain how the learning has

been influenced by the agent’s past experience. Although

SBRL can address the lack-of-interpretability issue in deep

reinforcement learning, but it requires handcrafted feature

engineering for state inputs, thus it can not be widely

applicable, especially to the vision-based task where the

agent perceives the environment through the vision sensors.

In this research, we extend the SBRL to make it work

with the RL-based applications by creating end-to-end model

with the visual presentation of experience to retrieve the

significant snapshot images and store them in a snapshot

storage. The proposed framework, named as Visual SBRL

(V-SBRL), answers to the problem of lack of interpretability

for various vision-based reinforcement learning problems and

complements the previous reinforcement learning visualiza-

tion approaches. The V-SBRL agent maintains a snapshot

storage to store important experiences which can be easily

visualized in order to understand what the agent remembers

from the past and why the agent makes all the decisions. For

instance, if an agent does not have any relevant snapshot

for the state or neighboring states that require significant

decision making, the agent does not know how to behave

properly with insufficient knowledge in the new, unexperi-

enced situations. This can be caused by lack of training,

1427978-1-5386-9276-9/18/$31.00 c©2018 IEEE

poor training strategy, or poor modeling of the environment,

which can lead to mistakes or tragic failure. On the other

hand, well-established snapshot storage can make the agent

develop an optimal policy to select a right action at each

time step as explain in section V.

The V-SBRL framework learns to remember important

experiences and to discard the trivial ones to maintain sparse

set of images. The framework contains a sparsity filter to

sift the relevant states only. This helps tackle the other

reinforcement learning problems efficiently by lowering the

computational cost. The small set of images help us analyze

the learning process and its outcomes more efficiently.For

this, the framework has three layers of sparsity control.

First, kernel function in relevance vector machine function

approximator can be adjusted to the level of sparsity. Next,

sparsity filter after mini-batch training of SBRL checks the

validity of trained model to determine the new snapshot

candidates to the storage. Last, the storage filters out entering

the similar samples to existing snapshot by measuring the

similarity with the kernel function.

The main contribution of this paper is that 1) it extendsthe SBRL framework to vision-based problem domains. 2)

For this, adopting an autoencoder, the framework encodesthe raw image state inputs for faster training. 3) It stores

the past experiences to an image snapshot storage, which

is used to visualize what the agent has learned from the

interaction with an environment. 4) Lastly, we also present

the interpretation which explains learning process, learned

policies, and learned behavior in different environments.

In Section II, we review the visualization methods that

have been proposed to understand the computational pro-

cesses of deep learning. Section III explains about the

different components of the framework, State Image Encoder,

SBRL and the Snapshot Storage and how they are used

in the framework as a whole. Section IV discusses the

results achieved by the framework on the the navigation task

and how the learning improves during training. Section V

presents a detailed interpretation of the learned policy and

how the collected snapshots affects the decisions made by

the agent in different situations. It also discusses which

experience plays an important role at any point of the time.

Finally, we conclude in Section VI.

II. RELATED WORK

Olah, et al. [10] combine the existing interpretation meth-

ods such as feature visualization [14] and attribution [19],

[20]) to show what a CNN detects (feature visualization),

and they explain how the detected features affect the outputs

(attribution). It facilitates a way to understand the learnings

of a neural network by visualizing the value computed

by each neuron in all the network layers. For instance,

visualization shows some neurons responsible to detect a

cat’s eye or a dog’s ear.

Ribeiro, et al. [11] explain the prediction of a learning

model with an approach, called LIME. This method focuses

on finding an interpretable model that has an approximately

same behavior as the original model. It seeks for an inter-

pretable representation of the prediction model. In addition,

the authors provide a method SP-LIME that can select a

set of instances representing the task domain efficiently and

explain the predictions with the selected instances in the

domain.

Selvaraju et al. [21] present a generic method to generate

the visual explanation for understanding the decisions made

by CNN. This method uses the gradient that flows through

the last convolution layer to create a localization map high-

lighting the important regions in the image, and it helps to

understand the significance of the connections between the

local receptive fields and the corresponding neurons.

For reinforcement learning interpretation, Zahavy, et al.

[22] propose a model which enables us to visualize the

learning strategy of a DQN agent using a Semi Aggregated

Markov Decision to identify the spatio-temporal abstraction

Processes (SAMDP) and apply t-SNE to the neural activation

data for well-separated representation. This helps us under-

standing the similar groups of activation related to an action,

but it requires a handcrafted feature engineering for it.

Thus, Greydanus, et al. [16] propose a visualization strat-

egy for understanding the policy learned by a reinforcement

learning agent using a perturbation-based saliency method. It

was applied to Atari games where they observe the change in

the behavior policy by making changes in the input images

to understand the salient pixel locations in the image.

The methods proposed in [10], [11], [21] are more suitable

for classification problems. The reinforcement learning visu-

alization methods [16], [15] are not enough to interpret the

learning progress and the learned policy. V-SBRL proposes

a complementary approach to visualize what the agent has

remembered from the past and how it helps the agent to take

actions in the future.

Our proposed framework is analogous to the human learn-

ing and development. Humans often encounter a situation

that is similar to prior experience. Instead of learning from

scratch, we utilize our past experiences to adapt or master

new environments or tasks. Focusing knowledge retention

step, V-SBRL framework models the human-like learning

to help a reinforcement learning agent to act the same way

humans do by remembering the past experiences. It is very

difficult for humans as well as a machine to remember every-

thing from the past because of the limited memory capacity

and computational/retrieval time so we only remembers the

important experiences and use them in the new related tasks.

Details of the proposed method are described in the following

section.

III. METHODS

Fig. 1 illustrates the proposed Visual Sparse Bayesian

Reinforcement Learning (V-SBRL). It consists of three major

components. First, the State Image Encoder deals with the

high dimensionality of the image data. Second, the learning

workhorse, SBRL [1], trains with mini-batch samples to

estimate the Q values for the relevant states and action

inputs. Using relevance vector machines [23], it captures

1428 IEEE Symposium Symposium Series on Computational Intelligence SSCI 2018

Fig. 1: Visual Sparse Bayesian Reinforcement Learning

Framework

the significant samples (relevance vectors) to be examined

as a candidate for snapshot recording. Third, the Snapshot

Storage memorizes the relevant states action pairs from the

candidates found in each episode training. The storage keeps

re-evaluating the importance of relevant states found in the

earlier episodes. The stored images are analyzed to enhance

understanding of the learning experience.

A. State Image Encoder

State Image Encoder (SIE) reduces the dimensionality

of the state image representation into a 64-dimensional

feature vector without loosing any important information.

The reduced dimensionality in feature representation enables

efficient computation in training V-SBRL. Table. I presents

the network architecture of SIE with encoder and decoder

units. The SIE is a CNN-based autoencoder, which uses

4 convolution with 4 max-pooling and 1 fully connected

layer for encoding and symmetric architecture consist of 4

convolution with 4 up-sampling and 1 fully connected layer

for decoding. Turchenko, et al. [24] presented a CNN-based

autoencoder which used similar network configuration of

convolution, pooling and up-sampling and it provided the

state-of-the-art performance. The SIE is pre-trained with a

number of randomly generated images of the environment

states by using the same image as the input to the encoder

as well as the target outputs for the decoder. The difference

between the pixel values of the original image and the

reconstructed image is computed in the loss function to train

the SIE network. The pre-trained SIE with the simulated

sample generation converts the image inputs into feature

vector representation as in Fig. 3. The encoded image seems

to be lossless because of the simple environmental setup.

However, the framework does not make any assumption or

restriction on lossless encoding.

B. Sparse Bayesian Reinforcement Learning (SBRL)

Sparse Bayesian Reinforcement Learning [1] is a learn-

ing framework which follows the human traits of decision

making via knowledge acquisition and retention. It refers to

the past experiences stored in the snapshot storage and then

finding similar tasks to current state, it evaluates the value

of actions to select one in a greedy manner. It collects the

TABLE I: SIE-Architecture

Layer (Type) Shape

input 32x32x3Convolution 32x32x32Max-Pooling 16x16x32Convolution 16x16x64Max-Pooling 8x8x64Convolution 8x8x128Max-Pooling 4x4x128Convolution 4x4x256Max-Pooling 2x2x256Flatten 1x1024Fully Connected 1x64Fully Connected 1x1024Reshape 2x2x256Up-Sampling 4x4x256Convolution 4x4x256Up-Sampling 8x8x256Convolution 8x8x128Up-Sampling 16x16x128Convolution 16x16x64Up-Sampling 32x32x64Convolution 32x32x3Total number of parameters 1,481,219

training samples and then identifies the important moments

(state-action pair), called snapshot, in each training episode.

It adds the previously found relevant snapshots to the col-

lected sample and re-evaluate their importance. The SBRL

uses the relevance vector machines (RVM) [23] to identify

the relevant snapshots and compute the posterior distribution

of the snapshots. The SBRL transforms the state-action pair

into kernel features for the nonlinear mapping. Since the

action and state domain are orthogonal, the kernel function in

Eq. (1) is defined as a product of separately defined Gaussian

kernels for each state ks and action ka. The computed kernel

feature vector is denoted as φ:

φ = ks(s, s′)× ka(a,a

′) (1)

where

ks(s, s′) = e−γs||s−s′||2 ,

ka(a,a′) = e−γa||a−a′||2 .

Here, the Gaussian kernel parameters γs and γa can be

separately defined.

SBRL [1] assumes that the target Q is a weighted sum of

the feature vectors Φw such that:

Qactual = Qprediction + ε = Φw + ε

where ε is zero-mean Gaussian noise with variance σ2.

Tipping et al. [25] suggested the initial variance: σ2 =0.1× var(Qactual).

Let vector α, a set of hyper-parameters controlling the

strength of the prior over the corresponding weights, be

infinity except for one starting as:

αi =||φi||2

||φ�i Qactual||2/||φi||2 − σ2

.

We compute mean and standard deviation of weights w with

A = αI:

Σ = (A+ σ−2Φ�Φ)−1 and μ = σ−2ΣΦ�Qactual.

IEEE Symposium Symposium Series on Computational Intelligence SSCI 2018 1429

αi (when q2i > si) and the noise level σ will be updated as

following:

αi =s2i

q2i − si, σ2 =

||Qactual −Qprediction||2N −M +

∑m αmΣΣΣmm

where si = φ�i C

−1−iφi and qi = φ�

i C−1−iQactual with

the covariance of marginal likelihood without i-th sample

contribution C−i as defined in [25].

The algorithm re-computes ΣΣΣ and μμμ and the followed

process until a convergent condition is met. The relevant

state and action pairs can be traced back by the φ left in

the model. The predict Q-value for new data xnew:

Qnewprediction = k(xbase,xnew)wbase

where xbase are significant samples, and wbase are the

weights along with the samples.

To make the SBRL sparse, we have added a filter which

checks whether the newly found relevant snapshots are good

or not by comparing the mean squared error (MSE) of the

values predicted with new snapshots and the actual target to

the MSE of mean value of the target to the actual target.The

snapshot is added to the storage if the difference between the

above two MSE is less then a threshold value (user selected

parameter) which leads to sparser solution.

1

m×

m∑

i=0

(Qactual(st,at)−Qprediction(st,at))2 <

th × 1

m×

m∑

i=0

(Qactual(st,at)−Qmean)2

(2)

where Qactual is the actual target Q values, Qprediction is

the predicted Q values and Qmean is the mean of all the

actual Q values, th is the threshold value and m denotes the

number of samples.

Apart from the MSE filtering, we also reduce the value

of the kernel parameters, which result in the increase in

the variance of Gaussian kernel and eventually increase

the number of samples with high similarity measurements.

As more similar samples are captured, RVM and Snapshot

Storage can maintain a certain level of sparsity.

After the SBRL training is completed, the collected snap-

shots are used to predict the Q-values of any given state-

action pair by applying the kernel to obtain the kernel

features and then multiplying learned weights to it:

Q(st,at) =

n∑

i=0

k(xi,xt)×wi

where xt = [st,at], and wi denotes the weight of the ith

snapshot xi.

C. Snapshot Storage

The Snapshot Storage maintains the collection of all the

snapshots (for important state-action pairs) learned by the

SBRL during training. We add new snapshots from each

episode to the storage after some filtering and similarity

checks. The snapshots in the storage are appended to the

new training samples in each epoch for training the SBRL so

that we can keep re-evaluating the relevance of the snapshot

found in earlier epochs because many of the snapshots in the

early stage of training can be highly irrelevant. Moreover,

this process implements the relevant experience replay [1].

To control the sparsity of the V-SBRL, we use three meth-

ods, and we have discussed the sparse kernel and the filter in

the previous section. Third and last sparsity control is applied

when the Snapshot Storage encounters a snapshot candidate

that is already in the storage or a similar snapshot exists. For

the same entry, Snapshot Storage updates the corresponding

weights using the following convex combination:

wi = (1− c)×woldi + c×wnew

i,

where woldi is the previous weight of the ith snapshot,

wnewi is the new estimation of the weight of ith snapshot

and c is a convex combination coefficient that controls

the effect of newly estimated weight on the old weight of

the snapshot to be merged. For highly similar (not exactly

matching) entry, it applies additional weighting of the kernel

similarity as in

wi = (1− c)×woldi + c× (

∑

j

k(xj ,xi)×wnewj).

where wnewj is the weights of the snapshots xj which are

similar to the current snapshot xi.

This extra filtering leads to sparser snapshot solutions

and help to avoid having too many similar snapshots in

the storage. At the end of the SBRL training, the storage

contains the manageable number of snapshots for analysis.

The snapshot images are stored with corresponding actions

taken and weight distributions. The magnitude of weights

reflect the significance of the taken moment. Thus, for

analysis, we read the weight distributions and relation to the

learned policy.

IV. EXPERIMENTS

We have applied the V-SBRL framework to the navigation

task with static obstacles. The environment contains 64 cell

blocks that can be obstacles, a goal or open spaces. Fig. 2

shows one of the randomly chosen navigation environments

for training, testing and analysis. In this navigation environ-

ment, an agent interacts with the environment to learn the

optimal policy which helps the agent to reach the goal as

quickly as possible. There are four possible moves that the

agent can take: moving up, moving down, moving left and

moving right. An agent receives a positive reward of +30if it reaches the goal and negative reward of −5 when it

hits an obstacle. A reward of −1 is given as the free space

navigation cost.

We run the experiment on the 8x8 cell navigation task

environment. We have created a visual representation of the

navigation task environment using the following notations:

• Each state of the navigation task environment (e.g. (0,0)

or (5,3)) is represented by 4x4x3 pixels.

• All 4x4 pixels at that position will have pixel values (0,

0, 255) if it is the current position of the agent.


Fig. 2: A navigation task with static obstacles. The blue block

represents the agent, the green block represents the goal and

the red blocks represent obstacles.

• All 4x4 pixels at that position will have pixel values (0,

255, 0) if it is the goal.

• All 4x4 pixels at that position will have pixel values

(255, 0, 0) if it contains an obstacle.

• All 4x4 pixels at that position will have pixel values

(255, 255, 255) if it is a free space.

• States are encoded into a 64 feature vectors.

Fig. 3: Encoding and decoding with State Image Encoder

Fig. 3 shows some samples for encoding and decoding

from pre-training. The outputs of the encoder are decoded to

recover the original state inputs. In the figure, the first column

displays the randomly selected, original image inputs, the

second column in the middle displays the encoded features

and the last column represents the reconstructed, decoded

images.

Fig. 4 shows the collected snapshots during training an V-

SBRL agent. The curve represents the total rewards earned

by the agent in each episode. We observe that V-SBRL

performs at par with one of the state-of-the-art algorithm,

DQN. The images stored in Snapshot Storage are shown in

the plot and the snapshot image set is augmented as the

agent gains more experience. The agent initially has very

small number of image snapshots, which increases as the

agent gains more experience in the environment. However,

it stops increasing after 250 epochs as the agent has identified

the important snapshot to remember as it converges to near-

optimal policies.

V. SNAPSHOT IMAGE ANALYSIS

We analyze both training and testing stages to examine

the efficacy of the recorded images in the Snapshot Storage.

Fig. 5a shows the snapshot images just after the first epoch

of training. Due to the limited experience, the agent has

the small number of snapshots and even their weights are

premature (mostly negative, denoted by the red frame) which

are increased as the training proceeds. Snapshot Storage at

the end of training contains increased number of snapshots

with trained weights. The snapshots which are closely related

with the important moments for optimal decisions have

positive weights and the other snapshots have negative or

close-to-zero weights (Fig. 5b).

Fig. 6 presents the means and standard deviations of the

weights for the corresponding snapshots. This information

reveals how important each snapshot is, how closely it is

related to the learned optimal policy, and how confident

the agent is about the impacts of each snapshot. The agent

remembers the same state with different actions as a snap-

shot but the action which is most relevant for taking has

higher weights and is selected as an optimal action. On the

other hand, the snapshots with poor actions (i.e., hitting the

obstacles or moving out of the working space) have negative

weights which warns or prevents the agent from taking those

actions.

We observe that the weight variances of the snapshots

that are part of the optimal trajectory from the training start

location are very low. On the contrary, the snapshots which

do not fall in the optimal trajectory have larger variances

which imply that the agent is not confident enough about

those areas of the environment. This comes from the lack of

exploration thus lack of experience. Thus, this analysis gives

us a hint for improving exploration strategy or experience

replay to increase more sampling on the under-samples areas.

The estimated Q-values after the training for all the states

are presented in Fig. 7. The contour plot shows the maximum

Q-value for the states over all the possible actions at that

given state. For better understanding of how the snapshot

affects the estimation of the Q values some of the snapshots

are placed over the Q-plot. They indicate that snapshots

with higher weights causes the Q-value to be higher and

the snapshots with lower weights lead to small Q-values.

Low variance of the snapshot in the high-Q value region

illustrates that agent has explored this region enough and

hence it is certain about the weights it has learned whereas

the snapshots in the unexplored territory has high variance.

This plot can gives a lucid understanding of how the learned

policy is developed using the stored snapshot because the

agent has taken the actions which lead to higher Q-value

hence the snapshots with high similarity and high weights

contribute more to that decision making.

Fig. 8 shows the exploitation trajectory followed by the

agent from start location [1,3] and start location [0, 0]

(same as training) along with the snapshot at some of the

locations where the agent made an important decisions by


Fig. 4: Snapshot recording while training: It shows the sum of rewards for V-SBRL and DQN during the training along with

the snapshot storage at some epochs in V-SBRL.Each snapshot contains the image of the environment state and the action

taken from the state (denoted by arrow).

(a) Snapshots after 1 epoch

(b) Snapshots after 600 epoch

Fig. 5: Collected snapshots during training. Red boundary

represents the negative weights and green boundary repre-

sents positive weights, The arrow represents the action taken

in that state.

Fig. 6: The obtained snapshots after training with the cor-

responding weight distributions: the numbers below each

snapshot represent the mean and standard deviation of the

weight.


Fig. 7: Relevant snapshots over Q contour plot: The plot also

shows few of the snapshot from the storage and it shows that

the snapshots which are part of optimal trajectory contribute

positively to decision making and agent is more confident

on this learning, whereas the snapshots in non-optimal or

not-explored paths do not have same contribution.

changing the direction of the motion. Snapshots in Fig. 8a

shows an optimal trajectory followed by the agent. The three

snapshots shown with the connection to the denoted state

are highly relevant snapshots to the current state, which is

evaluated with the kernel function by the agent. Correctly

capturing most significant decisions at the right moment

with corresponding positive weights lead the agent to follow

the optimal path. On the other hand, Fig. 8b shows a non-

optimal trajectory from a different start location (not same

as training). The first row of the snapshot does not contain

proper decision records, which lead the agent to make a

wrong decision at the first junction. At this first circled point

in the trajectory, all the snapshots have very low similarity

and hence the wrong decision is expected. This tells us

the training strategy from the same starting point generates

biased samples, which cause lack of exploration when it is

away from the optimal path (the dashed line).

The V-SBRL enables us to understand the policy learned

by the agent and how the learning evolved over the training.

Having a tool to visualize the learned policy helps us better

understanding “why” and “how” the agent makes a deci-

sion. For instance the Fig. 8b shows non-optimal trajectory

followed by the agent, which we can be attributed to the less

exploration of that region after seeing the related snapshots

which bears low similarity with the state–this shows that

the agent does not have similar experience to make a right

decision.

Though the proposed framework works well in this task

but it has collected 89 snapshots which can be further

reduced. It can be reasoned by the fact that it is a discrete

environment hence states-action pairs bears a low similarity

(a) Optimal path selection

(b) Wrong decision causes a suboptimal solution.

Fig. 8: Exploitation of the learned policy with different

starting positions: small images on the left are the snapshots

for each circled moment. The numbers under each snapshot

are weights and kernel similarity values.

between each state which results in more number of snap-

shots. The V-SBRL will be more useful in a continuous

state-action environment, which can lead to sparse snapshots

storage. The continuous domains have large number of

similar state-action samples observed and they can be easily

reduced one snapshot representation using the suggested

sparsity control and further improved one in the future.

When the quality of snapshots are ensured, the recorded

image snapshots can be a good knowledge representation

for analyzing transfer learning. It will be able to adapt to

the similar task environments using the previous experiences

which was may be from a similar but a different task. For

example, if we change position of some of the obstacles in

this navigation task, the agent would still be able to find

a solution if the relevant snapshots are well collected. The

agent will easily recall and reference its past experience with

high similarity, especially when the new tasks do not alter

the source task to a very large extent (smooth shaping [26]).


VI. CONCLUSIONS

In this paper, we present a visualizable and explainable

reinforcement learning framework V-SBRL which extends

SBRL [1] to enable visual perception to store the important

past experiences as images. This helps us to understand how

the agent makes decisions based on previously memorized

experiences. Finding the most similar snapshots to the given

state, the agent retakes the remembered decisions of the

same or similar situations (this action choice is determined

through Q function evaluation). The framework contains

State Image Encoder, a CNN auto-encoder which reduces

the dimensionality of the image inputs to lower dimensional

feature vector for efficient computation. This automated

feature mapping enables the SBRL to capture the significant

samples in the mini-batch samples in the reduced feature

space. The Snapshot Storage stores the relevant experiences

found in each episode but at the same time it also updates

the weight of each snapshot if it changes in later episodes.

We have proposed three techniques to control the sparsity

of the snapshot storage: 1) evaluating snapshots using MSE

to check if the snapshot are good, 2) capturing the similar

samples efficiently via the choice of the kernel and its

parameters, and 3) filtering out the similar snapshots to avoid

unnecessary snapshot entries in the storage. The V-SBRL

provides us the small number of experiences for an agent

to remember in order to develop an optimal policy in other

next tasks.This leads us to the next step of our research to fur-

ther investigate sparse module/kernel/filter design. Improved

sparsity is expected to make the analysis easier and lower

the computational complexity. To examine sparsity and effi-

cacy of model, we will extend the experiments to complex

problems in continuous state-action space (i.e., Five, Atari

or Flash games). We believe that the V-SBRL would give

more meaningful analysis based on a sparser set of snapshot

images.

ACKNOWLEDGMENT

This work was supported, in part, by funds provided by the

University of North Carolina at Charlotte. The Titan Xp used

for this research was donated by the NVIDIA Corporation.

REFERENCES

[1] M. Lee, “Sparse bayesian reinforcement learning,” Ph.D. dissertation,Colorado State University, 2017.

[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, p. 529, 2015.

[3] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic policy gradient algorithms,” in Proceedings ofthe 31st International Conference on International Conference onMachine Learning (ICML), 2014.

[4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deepreinforcement learning,” in Proceedings of the 33rd International Con-ference on International Conference on Machine Learning (ICML),2016, pp. 1928–1937.

[5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deepreinforcement learning,” CoRR, vol. abs/1509.02971, 2015. [Online].Available: http://arxiv.org/abs/1509.02971

[6] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End toend learning for self-driving cars,” arXiv preprint arXiv:1604.07316,2016.

[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. A. Riedmiller, “Playing atari with deepreinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online].Available: http://arxiv.org/abs/1312.5602

[8] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky,“Deep reinforcement learning for dialogue generation,” CoRR, vol.abs/1606.01541, 2016. [Online]. Available: http://arxiv.org/abs/1606.01541

[9] J. Yosinski, J. Clune, A. M. Nguyen, T. J. Fuchs, and H. Lipson,“Understanding neural networks through deep visualization,” CoRR,vol. abs/1506.06579, 2015. [Online]. Available: http://arxiv.org/abs/1506.06579

[10] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye,and A. Mordvintsev, “The building blocks of interpretability,” Distill,2018, https://distill.pub/2018/building-blocks.

[11] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?:Explaining the predictions of any classifier,” in Proceedings of the22nd ACM SIGKDD international conference on knowledge discoveryand data mining. ACM, 2016, pp. 1135–1144.

[12] H. Li, Z. Xu, G. Taylor, and T. Goldstein, “Visualizing the losslandscape of neural nets,” CoRR, vol. abs/1712.09913, 2017. [Online].Available: http://arxiv.org/abs/1712.09913

[13] G. Montavon, W. Samek, and K. Muller, “Methods for interpretingand understanding deep neural networks,” CoRR, vol. abs/1706.07979,2017. [Online]. Available: http://arxiv.org/abs/1706.07979

[14] C. Olah, A. Mordvintsev, and L. Schubert, “Feature visualization,”Distill, 2017, https://distill.pub/2017/feature-visualization.

[15] T. Zahavy and N. Baram, “Graying the black box: UnderstandingDQNs.” 2017. [Online]. Available: https://arxiv.org/pdf/1602.02658.pdf

[16] S. Greydanus, A. Koul, J. Dodge, and A. Fern, “Visualizing andunderstanding atari agents,” arXiv preprint arXiv:1711.00138, 2017.

[17] M. Lee and C. W. Anderson, “Can a reinforcement learning agentpractice before it starts learning?” in International Joint Conferenceon Neural Networks (IJCNN), 2017, pp. 4006–4013.

[18] M. Lee and C. W. Anderson, “Relevance vector sampling for rein-forcement learning in continuous action space,” in Proceedings of 15thIEEE International Conference on Machine Learning and Applications(ICMLA), 2016, pp. 774–779.

[19] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolu-tional networks: Visualising image classification models and saliencymaps,” arXiv preprint arXiv:1312.6034, 2013.

[20] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-volutional networks,” in European conference on computer vision.Springer, 2014, pp. 818–833.

[21] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh,and D. Batra, “Grad-cam: Why did you say that? visual explana-tions from deep networks via gradient-based localization,” CoRR,abs/1610.02391, vol. 7, 2016.

[22] T. Zahavy, N. Ben-Zrihem, and S. Mannor, “Graying the black box:Understanding dqns,” CoRR, vol. abs/1602.02658, 2016. [Online].Available: http://arxiv.org/abs/1602.02658

[23] M. E. Tipping, “The relevance vector machine,” in Advancesin Neural Information Processing Systems. MIT Press, 2000,vol. 12, pp. 652–658. [Online]. Available: http://papers.nips.cc/paper/1719-the-relevance-vector-machine.pdf

[24] V. Turchenko, E. Chalmers, and A. Luczak, “A deep convolutionalauto-encoder with pooling-unpooling layers in caffe,” arXiv preprintarXiv:1701.04949, 2017.

[25] M. E. Tipping, A. C. Faul et al., “Fast marginal likelihood maximi-sation for sparse bayesian models,” in Proceedings of InternationalConference on Artificial Intelligence and Statistics (AISTATS), 2003.

[26] T. Erez and W. D. Smart, “What does shaping mean for computationalreinforcement learning?” in 7th IEEE International Conference onDevelopment and Learning (ICDL), 2008, pp. 215–219.


Documents

Visual Sparse Bayesian Reinforcement Learning: A Framework