c 2017 Ewaldo Eder Carvalho Santana Jr.ufdcimages.uflib.ufl.edu/UF/E0/05/08/71/00001/CARVALHO_SANTAN… · Ewaldo Eder Carvalho Santana Jr. May 2017 Chair: Jose C. Pr´ ´ıncipe

A FRAMEWORK FOR PATTERN CONSOLIDATION IN COGNITIVE ARCHITECTURES

By

EWALDO EDER CARVALHO SANTANA JR.

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2017

c⃝ 2017 Ewaldo Eder Carvalho Santana Jr.

2

To the most lovely of all, my mother Oxum

3

ACKNOWLEDGMENTS

I thank my advisor the Jose C. Prıncipe for giving me opportunity to fulfill my dream

of becoming the very best, the best there ever was. I also thank the University of Florida

for the graduate scholarship.

I am also very thankful to my friends in CNEL. I specially thank my homies Ryan,

Evan, Matt, Goktug, Mihael and Austin for the friendship, support and eventual baby

sitting.

Most importantly, I thank my family for the love and for allowing me to stay so long

abroad. Inez, Lucas, Livia and Zion you are the most important people in the world, I

love you!

Lastly, I would not be able to pull up any science without the ever constant love

of God. Thanks to my mother Oxum and my guides da Ilha, D. Maria and all the

anonymous supporters.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Deep Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . 192.1.1 Faster Computers and Big Data . . . . . . . . . . . . . . . . . . . . 202.1.2 Elaborate Initialization Techniques and Learning Algorithms . . . . 212.1.3 Task Specific Architectures . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Temporal Processing with Neural Networks . . . . . . . . . . . . . . . . . 252.3 Deep Predictive Coding Networks . . . . . . . . . . . . . . . . . . . . . . 272.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 292.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 332.6 Combining CNNs and RNNs: Convolutional Recurrent Neural Networks. . 362.7 Content Addressable Memories . . . . . . . . . . . . . . . . . . . . . . . . 37

3 A FRAMEWORK FOR DYNAMIC ADDRESSABLE MEMORIES . . . . . . . . 42

3.1 Memory Reading and Writing in Recurrent Neural Networks . . . . . . . . 453.1.1 Type I: DiffRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.1.2 Type II: CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Content Addressable Memory . . . . . . . . . . . . . . . . . . . . . . . . . 483.3 Differentiable Random Access Memory . . . . . . . . . . . . . . . . . . . 513.4 Hybrid Access Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5.1 Initialization and Algorithmic Choices . . . . . . . . . . . . . . . . . 573.5.2 Adding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.5.3 Copy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5.4 Sequence Generation . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 ADDRESSABLE MEMORIES AS PART OF A DIFFERENTIABLE GRAPHICSPIPELINE FOR VIDEO PREDICTION . . . . . . . . . . . . . . . . . . . . . . . 65

4.1 On the Need of a Differentiable Computer Graphics Pipeline . . . . . . . . 654.2 A 2D Statistical Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . 68

4.2.1 Preliminary Considerations and Relevant Literature Review . . . . 69

5

4.2.2 Variational Autoencoding Bayes . . . . . . . . . . . . . . . . . . . . 724.2.3 Proposed Statistical Framework . . . . . . . . . . . . . . . . . . . . 76

4.3 Perception Updating Networks . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.4.1 Bouncing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 Moving MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.3 Visualizing the RNN-to-CAM Connections . . . . . . . . . . . . . . 884.4.4 Snapshotting “What” Directly from Pixels . . . . . . . . . . . . . . . 89

4.5 Rules of Thumb for Model Choice . . . . . . . . . . . . . . . . . . . . . . 92

5 SCALING UP PERCEPTION UPDATING NETWORKS . . . . . . . . . . . . . 94

5.1 Convolutional Recurrent Neural Networks for Unsupervised Learning ofVideos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 ConvRNN + PUN: Combining Convolutional RNNs and Perception UpdatingNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3.1 Moving MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3.2 Real Videos: Kitti Dataset . . . . . . . . . . . . . . . . . . . . . . . 106

6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6

LIST OF TABLES

Table page

3-1 Adding Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3-2 Copy Problem: Percentage of correctly copied bits. . . . . . . . . . . . . . . . . 61

3-3 Sequence generation cost function (negative log likelihood, NLL) on the testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3-4 Classification accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4-1 Comparison between Snapshot PUN conv PUN on the single digit movingMNIST benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5-1 Hyperparameter choices per experiment . . . . . . . . . . . . . . . . . . . . . . 99

5-2 Recognition rate (in percentage %) for object recognition in Coil-100 dataset . 101

5-3 Recognition rate (in percentage %) for face recognition in Honda/UCSD dataset 101

5-4 Experiments with hidden PUN. Average negative log-likelihoods (nats) on videoprediction experiments with the Moving MNIST benchmark. . . . . . . . . . . . 105

5-5 Hyperparameters and quantitative results on the test set of Kitti Dataset. . . . . 109

7

LIST OF FIGURES

Figure page

1-1 Felleman & Van Essen (1991) diagram of wiring in the visual cortex. . . . . . . 14

2-1 Deep Neural Network for temporal processing. . . . . . . . . . . . . . . . . . . 26

2-2 Principe & Chalasani (2014) schematic diagram of a Deep Predictive CodingNetwork with two layers showing bottom-up and top-down information flow. . . 29

2-3 Example of convolutional neural network (CNN) layer computation. In this examplea single channel input, filter and output are illustrated. . . . . . . . . . . . . . . 34

2-4 Example of convolutional neural network (CNN) layer computation with zeropadding and strides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2-5 Visual representation of a convolutional recurrent neural network. . . . . . . . . 38

2-6 Diagram of Differentiable Random Access Memory (DiffRAM). . . . . . . . . . 41

3-1 Schematic diagram of a memory augmented recurrent neural network. . . . . . 46

3-2 Adding problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3-3 Neural Turing Machine operations in the Copy problem. . . . . . . . . . . . . . 62

3-4 Sample desired and generated sequences using NTM2 and LSTM. . . . . . . . 64

4-1 Steps of the 2D graphics or rendering pipeline that inspired our model. . . . . . 66

4-2 How to get similar results using convolutions with delta-functions and spatialtransformers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4-3 Variational autoencoder graphical model. . . . . . . . . . . . . . . . . . . . . . 73

4-4 Block diagram of a Variational Autoencoder with Gaussian prior and reparametrizationtrick. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4-5 A schematic block diagram for a Perception Updating Network. . . . . . . . . . 80

4-6 Results on the Bouncing Shapes dataset. . . . . . . . . . . . . . . . . . . . . . 83

4-7 Results of a Convolutional Perception Updating Network. . . . . . . . . . . . . 85

4-8 Performance curves in the test task of two implementations of the proposedarchitecture (conv PUN and STN PUN) and an equivalent LSTM baseline. . . . 86

4-9 Sample rollouts of a 2 layer LSTM convolutional Perception Updating Network. 87

4-10 A piece of the schematic block diagram for a Perception Updating Networkand t-SNE embedding of the codes sent from the RNN controller to CAM. . . . 90

8

4-11 Snapshot Perception Updating Network. See Figure 4-5 and compare it tothe convolutional Perception Updating Network model. . . . . . . . . . . . . . . 91

5-1 Convolutional Perception Updating Network as a hidden layer of a deep convnet. 96

5-2 Schematic diagram of the Recurrent Winner-Take-All (RWTA) network. . . . . . 98

5-3 Sample videos from Coil and Honda/UCSD datasets. . . . . . . . . . . . . . . 99

5-4 128 decoder weights of 7x7 pixels learned on Coil-100 videos. . . . . . . . . . 100

5-5 Deep residual U-net with Perception Updating Networks output. . . . . . . . . . 106

5-6 Definition of a single resnet block used in the experiments. . . . . . . . . . . . 107

5-7 Dilated convolution with a filter of 3x3 pixels with dilation rate of 1x1. . . . . . . 108

5-8 Qualitative results on the test set of Kitti Dataset. . . . . . . . . . . . . . . . . . 108

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

A FRAMEWORK FOR PATTERN CONSOLIDATION IN COGNITIVE ARCHITECTURES

By

Ewaldo Eder Carvalho Santana Jr.

May 2017

Chair: Jose C. PrıncipeMajor: Electrical and Computer Engineering

One of the most essential functions in human sensory processing is the ability to

group a sequence of stimuli that impresses the senses into a single coherent experience

of “what” happened in the world. For example, humans interpret another person’s

actions as a whole, not as a sequence of independent poses in different scenes.

Although the interpretation of individual scenes is fundamental, the complete experience

understanding relies on appropriate temporal context.

In order to capture sensory inputs in video, recent approaches such as Deep

Predictive Coding Networks and Hierarchical Temporal Memories rely on single temporal

step prediction and optimization. Unfortunately, this approach is not generic enough for

most real world video, audio and text analysis.

The proposed work enhances the Cognitive Architecture for Sensory processing

by organizing their internal representations in explicit “what” and “where” components,

which has not been addressed in the literature. Currently very long vectors of internal

states of the top layers are fed to external classifiers to decide what has been presented

to the system input, however this method may not scale up when millions of images (or

videos) have been processed. Our hypothesis is that the combination of a recurrent

neural network, which models short term working memories, and a long-term content

addressable memory , inspired by the functional connections between the neocortex

and the hippocampus, will provide a solution that scales better.

10

We investigate recurrent neural networks that can be trained with backpropagation

through time with an external addressable memory. With recurrent architectures and

backpropagation through time we do not have to rely on the Markov assumption for

learning models of sequential data. Also, with addressable memory extensions we

can decouple the memory capacity of the proposed architecture from the number of

adaptive parameters, thus scaling better for practical engineering applications. And most

importantly, the addressable memory can be used to consolidate the decaying dynamic

states of recurrent neural networks.

To illustrate this framework in practice, we investigate the task of video prediction.

In video prediction we have to calculate future frames using previous frames as context.

We do so by explicitly modeling the moving objects in the scene and their dynamics as

separate components. This way, we can snapshot representations of “what” is in the

scene and model its dynamics, or “where” the object is in the scene, as separate latent

factors. For this part of our work, beyond the inspiration of addressable memories from

neuroscience, we took inspiration from computer graphics pipelines while designing our

network, which is also an example of how to use efficient computer science discoveries

into neural network design.

From a practitioner perspective, our contributions to the neural network literature

are about novel architectures, architectural constraints and design guidelines. We

recommend the interested reader to find inspiration in this work beyond the direct

applications of image and video processing, but as example of how to convert

consolidated computer science and physics knowledge into neural networks that

can be optimized with backpropagation.

To summarize, the main contributions of this thesis to the literature are organized as

follows:

• In Chapter 3 we discuss a generalized frameworks for memory augmentedrecurrent neural networks (Santana et al., 2017). There we show networks thatlearn to read and write to content addressable memories.

11

• We focus on memory reading mechanisms and video prediction in Chapter 4. Thisis the chapter we propose a novel variational statistical framework for videos withdecoupled “what” and “where” object representations. This is also the chapter weintroduce the algorithm of Perception Updating Networks (Santana & Principe,2016).

During the time we were working on the the present thesis we published other

papers on unsupervised learning for self-driving cars data (Santana & Hotz, 2016),

image filtering (Santana & Principe, 2015), video-audio sensor fusion (Santana et al.,

2015), information theoretic learning autoencoders (Santana et al., 2016a), etc. That

work contributed for the knowledge presented here but will not be discussed in details

here as the papers itemized above.

12

CHAPTER 1INTRODUCTION

Modeling sequential data is a challenge in many different fields. Examples are text

in Language Modeling, speech waveforms in voice recognition and image sequences in

video analysis. Humans can interpret the context of these different signals despite noise,

scaling, rotation and some degrees of deformation. Attempts to model this cognitive

ability from a statistical perspective usually rely on the Markov assumption, which is valid

only for very simple cases (Rabiner, 1989). When that assumption is valid, we can factor

the joint distribution of the sensory input as a product of transition probabilities.

Recently (Principe & Chalasani, 2014) proposed a Cognitive Architecture for

Sensory Processing inspired by human sensory processing, called Deep Predictive

Coding Networks (DPCN). Figure 1-1a represents the anatomical diagram of the

visual cortex (Felleman & Van Essen, 1991) showing the flow of information from the

retina (bottom) until it reaches the hippocampus, the brain structure that consolidates

memories. We can see a distributed, hierarchical set of highly interconnected subsystems

that extract distinct information from the visual scene that constitutes the percept. In the

top of the architecture we have the hippocampus that consolidates these features as

long term memories that can be associatively recalled. These recollected features

act as causes or priors for the representations in the lower layers. Thus, the overall

architecture tries to predict the outer world based on its internal representations.

Whenever the model falsifies its hypothesis with a poor prediction it updates its memory

and consequently its perception of the world (Fuster, 2003). This provides the amazing

cognitive capabilities of humans and other higher order animals, but there is still the

open problem of what comes first, perception or memory?

The DPCN model attempts to preserve some of this distributed, hierarchical,

bidirectional, online and self-organizing flow of information in Figure 1-1b. DPCNs

are built from identical blocks containing a generalized state space model, which are

13

Auto-

Associative

memory

Auto-

Associative

memory

Short-term

Memory

Backward

connection

Forward

connection

Cortex

Lateral

connection

Hippocampus

Figure 1-1. Felleman & Van Essen (1991) diagram of wiring in the visual cortex.In the Cognitive Architectures for Sensory Processing framework the hippocampus

generates causes that guide the features extraction in the lower layers.

organized hierarchically. DPCN uses an empirical Bayes framework, with top-down

and bottom-up information flow, which projects the incoming video frames on an

overcomplete and sparse basis, learned from the data (similar to V1). Through inference

these features represent the input video frames at different scales and the causes and

states of the top layer successfully discriminate objects in a self organizing way, even

under shift and rotation transformations. In Principe & Chalasani (2014)’s framework,

perceptions are represented by the single step predictions of DPCNs, which are

multidimensional transient vectors in time that only exist while the image is present

at the input retina. Therefore in our previous work, the user has to capture the top states

and causes synchronously with the presentation of the images and all the published

results with DPCN utilize a classifier, following the trend in deep learning. Performance

was as good or better than other unsupervised convolutional models presented in the

14

literature at the time, but the user must be in the loop and the classifier framework is

weak because one must know how many classes exist, and train the classifiers. This

dissertation will continue the research in the cognitive architecture and implement

the time to space mapping involved in consolidating the DPCN transient causes and

states in permanent memory that can be organized by content, in a self organizing way,

following the spirit of the cognitive architecture.

What is missing in DPCNs is the ability to take snapshots to consolidate the

perceptions of the sensory processing networks and an associative recall mechanism

for bringing those perceptions back to the context dynamics. To give a practical example

of the limitations of a network without memory consolidation, we can think about a video

where a person appears and disappears from the camera sight. When the person is

present, Principe & Chalasani (2014) showed that the DPCN could generate online

representations of its facial features that could be used for classification. Nevertheless,

when the person leaves the scene, the DPCN continues to represent the background

without explicit knowledge that the person has left, it will simply continue to represent

the new sensory input. If the subject ever comes back, the internal representations of

its face will be already blurred with representations of the background. With a memory

consolidation mechanism, when the facial features were first represented, they could be

stored and compared with the representations of future sensory input.

The issue we still have to face is what is the best way to model the combination

of the working memory with the Content Addressable Memory (CAM). In neural

network theory there are two basic types of memory mechanisms (De Vries & Principe,

1992): the finite window memory that can be implemented by a tap delay line, and the

exponential decaying memories that can be implemented by infinite impulse filters, with

the gamma memory as an hybrid. The tap delay line is very constraining because one

has to know a priori the memory depth. For instance, Karpathy et al. (2014) showed that

15

a deep convolutional network, with filters convolving both space and time, did not have

better results than simply classifying each frame and voting to classify the entire video.

Alternatively, Recurrent Neural Networks (RNN) are a more appropriate model

to represent the past information, since they implement nonlinear Infinite Impulse

Response (IIR) feature extractors that were successful in problems such as to transcribe

speech signals to text (Hannun et al., 2014), text translation (Sutskever et al., 2014),

etc. Recently, Li & Prıncipe (2016) showed that RNNs can also be implemented in

Reproducing Kernel Hilbert Spaces (RKHS). RNNs trained with features extracted with

feedforward neural networks (Donahue et al., 2014) (Srivastava et al., 2015) had better

results than the purely convolutional neural network proposed in (Karpathy et al., 2014).

Unfortunately, even RNNs memories are reliable only up to a certain point. Later,

we will discuss how IIR models still have uniformly decaying memories. Another problem

with RNN memories is that they do not scale well and backpropagation through time

(their main learning algorithm) is only reliable up to a certain input length (usually 100

time points) (Pascanu et al., 2012). Backpropagation Through Time (BPTT) behavior

can be bounded by limiting the length of the input training batch and using modern

stochastic gradient optimization techniques such as gradient clipping (Pascanu et al.,

2012) and Adam (Kingma & Ba, 2014). But the number of trainable parameters grows

quadratically with the size of the dynamic state of the RNN. For instance, the sequence

translation model cited above (Sutskever et al., 2014) used a 3 layer RNN, each one

with a memory of size 1000, which amounts to 3 million adaptive parameter just for the

hidden-to-hidden transition matrices.

Here we look to brain sciences for inspiration to solve the memory problem

of RNNs. Specially, we comment on the interactions between the neocortex and

the Hippocampus. Hippocampus region III (CA3) can arguably be regarded as a

autoassociation or attractor network involved in spatial functions and memory (Rolls,

2007). Region I (CA1), on its turn, records information from CA3 and back-projects it to

16

the neocortex. Thus, the Hippocampus and the neocortex implement complementary

memory types, with the later being used for rapidly changing unstructured memorization

and the former for building semantic representations of what has been stored and how

to retrieve that information (Rolls, 2007). This will be our inspiration to design the top

layer of our cognitive sensory processing system, where we can associate the outputs of

DPCNs with the sensory cortex, extracting features from the input data. These features

are fed to the hippocampus, where the CA3 plays the role of the RNN that is capable

of representing its variable length input in its state within the short term past. However

this is not sufficient as we discussed above, because we would like to consolidate this

information permanently and organize it with the past stored representations that the

system acquired previously.

The RNN research community has recently started to pay more attention to this

missing architectural feature and proposed developments along the following lines: an

RNN should not rely solely on its dynamic hidden states when computing its next state

and output, but it should be able to store and retrieve previous states using an attention

mechanism that considers the context of the current input and next desired output.

On that line of thought, Bahdanau et al. (2014) proposed RNNSearch, a recurrent

encoder-decoder where all the hidden dynamic states of the encoder are saved and

partially retrieved by the decoder. Graves et al. (2014) proposed Neural Turing Machines

that explicitly define a content and location addressable memory inspired by memory

tapes in Turing Machines. Weston et al. (2014) Memory Networks on the other hand,

store variable length inputs and learn to rank and retrieve the relevant ones at query

time. RNNSearch, although it has been successfully applied to automatic sentence

translation, has the same downside as feedforward neural networks of representing

variable length inputs by other variable length outputs since its memory is just a non

self-organizing stack, which for instance makes its application to unsupervised clustering

harder. Memory networks provide a general framework for using memory appends,

17

but its application storing input sequences the way they are is not easily scalable or

biologically plausible. Memory Networks memories can also be interpreted as storing

temporal windows similarly to FIRs. In this proposal we develop our contributions

reasoning more closely to NTMs, instead.

Here, we propose a general framework for reinterpreting RNNs in a way that allows

us to propose a family of architectures where addressable memories augment dynamic

neural networks. We show that NTM can be seen as a specific case of this framework,

we also propose alternative architectures.

Afterwards, we use our findings on memory augmented RNNs to propose a

neural 2D graphics pipeline. This pipeline will be used for modeling videos with explicit

snapshotting of “what” is in the scene independent of “where” it is. The memory in

this new system can be interpreted as a sprites (or object) database. The remaining

components of the architecture learn where to place the sprite in the scene and model

its movement for video prediction. The final result is an architecture that memorizes

perceptions and updates its representation, which motivated us to call the system

Perception Updating Networks.

In the next chapter we review the relevant literature, especially memory structures

in neural networks, Deep Learning and RNN, on top of which we propose to build our

contributions.

18

CHAPTER 2BACKGROUND

Here we are interested in analyzing a vector valued sequence xt where the (usually

time) index t may have different length for different realizations. For example, in a

video xt are the pixels of a frame, it could also be a vectorial representation of words

for text analysis or pieces of speech signal. We will review the relevant literature of

Signal Processing and Deep Learning for this thesis, starting with deep feedforward

neural networks that deals with different sequence xt independent of the temporal

context. Deep feedforward neural networks (DFNN) are relevant for feature extraction,

for instance, but we are mostly interested in finding structure in variable-length data

and recursive processing, which is fundamental to enable language understanding

(Fitch et al., 2005), action recognition (Donahue et al., 2014) and voice transcription

(Hannun et al., 2014) between other applications. For that objective, we will also review

the literature on temporal signal processing focused on Timed Delayed and Recurrent

Neural Networks later in this Section.

2.1 Deep Feedforward Neural Networks

Typically in Machine Learning a multiple step pipeline of signal processing is

required from the input data to the final task output (Bishop, 2006). For example,

removing the mean and renormalization, outlier removal, dimension reduction, and

finally the classification or regression task. The main philosophy motivating Deep

Learning is to learn all the preprocessing steps and the ultimate task directly from

data (Bengio, 2009). Since this entire process mostly uses Artificial Neural Networks

(ANN), Deep Learning can be also considered the third generation of ANNs. This is the

generation with networks several hidden layers deep, also called Deep Neural Networks

(DNN). The first generation of ANNs introduced several adaptive artificial neurons

(here called processing elements-PEs) and the Perceptron learning rule (Rosenblatt,

1958). The second generation was sparkled by the backpropagation algorithm applied

19

to Multilayer Perceptrons (MLP) (Rumelhart et al., 1988), which is nothing but an

application of the chain rule from basic Calculus, to gradient computation in nonlinear

multilayer architectures. This generation provided the first class of Universal Learning

Machines (ULMs) that can be trained directly from data. The generation branded as

Deep Learning or DNNs, can be largely credited to three main factors being used

together:

• Faster computers and huge amounts of training data (Big Data)

• Elaborate initialization techniques and learning algorithms

• Task specific architectures

2.1.1 Faster Computers and Big Data

With faster CPUs and General Purpose Graphical Processing Unities (GPGPU)

it became possible to train larger neural networks in practical time. It is well know that

ANNs with at least one hidden layer are ULMs. Thus, ANN can theoretically learn

any function as long as it has enough PEs on its hidden layers. However, the system

still uses the same basic representation space to construct the input-output map.

Discriminative applications such as object recognition in pictures, word spotting in audio

streams, etc. benefit from a more versatile representation structure at multiple spatial

or temporal scales that mimic better the structure of the input space. This asks for

several hidden layers with many PE and hundred of thousands of parameters to reliably

approximate real world problems and generalize well. Such architectures also require

a huge number of training data to construct the internal representations. Only very

recently there is sufficient data to accomplish proper training of such large architectures.

In such cases, ANNs with hundreds of thousands or even millions (Krizhevsky et al.,

2012) of parameters can now be trained with the current version of cloud computing or

special GPU clusters, which was unthinkable a few years ago.

We should also consider novel programming frameworks that leverage this extra

computational power. Specially the open source libraries that allow fast prototyping of

20

different architectures with code that compiles seamlessly to CPU or GPU. The number

of scientific citations to these libraries makes it clear how important it is to provide a

simple abstraction for fast scientific experimentation for deep neural network research.

In this work, we used Theano (Bergstra et al., 2010) and Tensorflow (Abadi et al., 2015),

which are Python frameworks for Machine Learning that between many other features

provides GPU abstraction and automatic differentiation, which guarantees that we are

using the correct gradients even when testing new complex architectures, such as the

ones proposed here. We build our models using Keras (Chollet, 2015), an open source

deep learning library on top Tensorflow and Theano, to which the present author is also

a voluntary contributor. All the code used in this thesis will be made open source for fast

reproducibility.

2.1.2 Elaborate Initialization Techniques and Learning Algorithms

When the parameters of an ANN are initialized with random values, the function

approximated by the network may be too distant from the required mapping. In such

cases the ANN may need too much time to be trained, specially when using sigmoidal

activation functions where the vanishing gradient problem (Hochreiter et al., 2001) slows

converge down. The existence of very low gradient regions, because of the non-convex

nature of the problem, makes this a daunting task. One of the first solutions to overcome

this problem was unsupervised pre-training with Auto-Encoder (AE) (Vincent et al.,

2010). An AE consists of two parts: an encoding function that transforms the data into

a code space, and a decoding function that reconstructs the original input from its code.

For the simple case where the encoder is a linear function, the decoder is simply its

transpose, and the hidden codes are the principal components of the input. Thus AE

pre-training places the weights of the ANN in the directions of significant statistical

properties of the data. Note that initialization is not a problem for Kernel Machines and

RBFs because their transformation functions are always centered in the data, which is

an advantage. But for randomly initialized DNNs, starting far from the data domain is a

21

problem. The process of initializing a DNN using unsupervised pre-training, in its most

practical form, is a layer-wise procedure. First, starting from the original data input, an

encoder/decoder function pair is trained to represent the data as a first layer code set.

Afterwards, another encoder/decoder pair is trained, but this time using the codes from

the previous pair as input. After enough encoder/decoder pairs are trained, the encoders

are stacked up and used as an initialized DNN, which on its turn, can be fine-tuned

using backpropagation of error for task specific problems. Hinton & Salakhutdinov

(2006) used a Restricted Boltzman Machine (RBM) to train the encoder/decoder pairs.

RBMs are undirected graphical models and can be trained using Gibbs sampling and an

approximate Markov chain Monte Carlo (MCMC) method called contrastive divergence.

Vincent et al. (2010) also showed that it is possible to train the encoder/decoder pairs

using a technique similar to non-linear component analysis and obtained initializations of

similar quality.

However, we should note that with larger datasets such sensible initialization are

not mandatory. Glorot & Bengio (2010) showed that it suffices to initialize the random

weights to cover the linear region of the sigmoidal nonlinearities (or to use piecewise

linear activations) and train the network with appropriate extensions of the Stochastic

Gradient Descent (SGD) rule and mini-batches. This is a common misconception

between practitioners and neophytes to Deep Learning. They sometimes refer to older

approaches, such as the debuting paper by Hinton & Salakhutdinov (2006) that focus

on pre-training and batch mode fine-tuning. While recent approaches have almost

completely abandoned pre-training in favor of appropriate random initializations and

mini-batches. Also, instead of the second order methods, such as Conjugate Gradient

(Hinton & Salakhutdinov, 2006), recent literature has focused on SGD approaches that

only approximate the diagonal of the Hessian of the cost function. Examples of such

algorithms are RMSprop (Tieleman & Hinton, 2012) and ADAM (Kingma & Ba, 2014),

22

that rely on moving average estimates of second order statistics of the gradient and

momentum to optimize through slow gradient regions.

Here we reproduce the most common nonlinearities used for deep learning, to

make clear the statements above:

tanh : tanh(x) =ex − e−x

ex − e−x,

logistic : σ(x) =1

1 + e−x,

ReLU : relu(x) = max(0, x),

softmax : softmaxi(x) =exp(xi)∑j exp(xj)

,

(2–1)

where ReLU stands for rectified linear unity, which is the nonlinearity that provides best

results in practice and is also less sensible to precise initialization, since its linear region

is obviously larger than that of the sigmoids. Softmax returns a probability distribution

and is usually used as the last nonlinearity to represent classes probabilities. They can

also be used as weights to average a vector, which is an application that we will discuss

later in this chapter.

For more complicated problems, such as video or image analysis with little

training data, transfer learning (Bengio, 2012) is the state-of-the-art approach. For

instance, Razavian et. al. (Razavian et al., 2014) showed that the penultimate layer of

a Deep Neural Network trained to classify the IMAGENET dataset (Deng et al., 2009)

provides features for other image analysis tasks that surpass all the previously proposed

feature extraction techniques such as SIFT (Lowe, 2004), HOG (Zhu et al., 2006) and

unsupervised learning such as Sparse Coding and Independent Component Analysis

(ICA) (Hyvarinen et al., 2004).

In a nutshell, initialization is fundamental and the most up to date suggestion

to initialize a Deep Neural Network is to use a network that has been trained to

classify a large dataset in a similar domain or to use random initialization and train

23

the network with mini-batches on a large dataset. Although theoretically sound, for

practical applications unsupervised learning and auto-encoders should be only the third

option of initialization.

2.1.3 Task Specific Architectures

Even though DNNs are also ULMs, correlations in the input data make the

training process complicated. For example, for dealing with temporally correlated

data Recurrent Neural Networks (Haykin, 2004) were proposed in the second ANN

generation mentioned above. On the other hand this third-generation exploited very

effectively convolution based architectures, also called Convolutional Neural Networks

(CNN) (LeCun et al., 1998). CNNs were developed to exploit local dependencies in

data such as images, where neighboring pixels present strong correlations (LeCun

et al., 1998). This is implemented using PEs that receive only part of the input, much

like the Neocognitron (Fukushima, 1980) architecture that was inspired by the space

selective receptive fields of the simple-cells of the visual cortex (Hubel & Wiesel, 1968).

The main difference between the Neocognitron architecture and CNNs, is that the

former use shared parameters (or weights) for all the local receptive fields (LeCun et al.,

1998). Using shared parameters across local receptive fields implements a convolution

operation, thus the name CNN. Invariance to local shifts in space is achieved in CNNs

through local pooling or strided convolutions (Springenberg et al., 2014). In other words,

a pooling layer downsamples the output map of a convolutional layer by forwarding

only the maximum activation of local regions. Which is an approximate behavior to

complex cells in the cortex (Hubel & Wiesel, 1968). This new type of architecture was

the core technology that powered most of the large scale image processing networks,

such as the DNN that won the IMAGENET 2012 (Krizhevsky et al., 2012), and also

the applications for speech recognition (Abdel-Hamid et al., 2014) in which case the

convolutions were applied to the spectrogram of the audio data. More recently, CNNs

24

have been also applied to text processing (Kim, 2014), where the convolutions are

applied over words vectorial representations of words or characters.

2.2 Temporal Processing with Neural Networks

When the input has a temporal dimension, and the task is temporal pattern

recognition, DFNNs are no longer an appropriate model. De Vries & Principe (1992)

proposed a unifying framework for neural networks that can solve such problem. The

basic architectures can be interpreted as either defining a fixed length window in the

input series or implementing an infinite convolution through time. For the case where

we define a fixed length window in time we have finite impulse response (FIR) filters,

moving average (MA) and time-delayed neural networks (TDNN). We can write the MA

model as

ht = xt +

N∑i=1

wixt−i + b, (2–2)

where N is the size of the temporal window, ht is the generated signal and wi and b are

free parameters of the model.

Sandberg & Xu (1997) showed that the functions generated by any tap-delay

line followed by an MLP or DNN (essentially what a TDNN is) are myopic maps, in

other words, they are universal approximators on a functional space of decaying time

functions and their memory depth has to be preselected to the application. In practice

this means that TDNNs have limited memory capacity, which could be increased by a

longer tap-delay line but with the cost of a huge increase in the number of parameters.

We illustrate TDNNs and how their tap-delay line can be calculated with the Gamma

memory in 2-1.

On the other hand, with an infinite impulse response (IIR) we have models that

implement an infinite convolution over the input sequence. These models can also be

rewritten in a recursive form such as auto-regressive (AR) models

ht = xt +

N∑i

wiht−1 + b (2–3)

25

time

Figure 2-1. Deep Neural Network for temporal processing.(a) time-delayed neural network (b) gamma memory, for µ = 0 we have the regular

tap-delay line.

where N is now the depth of recursion. De Vries & Principe (1992) showed that an IIR

can also be implemented with the Gamma memory defined as a cascade arrangement

of generalized delay operators given by

xt = µxt + (1− µ)xt−1 (2–4)

26

for the case where µ = 0, we have a regular tap-delay and the MLP using the Gamma

memory as the first layer would reduce to a nonlinear FIR. For the other values of µ the

memory has an IIR impulse response itself with time constant controlled by µ.

It is important to understand the value of the recurrent parameter m for memory

based applications (De Vries & Principe, 1992). The recursive parameter in (2–4) acts

a control of the time axis scale, i.e. as a compromise between memory depth (D) and

its resolution (R). For a L order filter, built with generalized delay operators, R = M + L.

It turns out that for the generalized delay operator, M = 1/µe, so for large values of

m(0 < m < 1) the memory has high temporal resolution but low depth, while for

values close to 0 the depth is long but the resolution is poor. This is a fundamental

feature of linear recurrent systems when used as memory systems. The only way to go

beyond this limitation is with nonlinear memory functions which use gating such as the

Long-Short Term Memory (Hochreiter & Schmidhuber, 1997).

The combination of both approaches is the auto-regressive moving average

(ARMA). For shift-invariant linear models, the transfer function of ARMA models can

be defined as a linear combination of decaying complex exponentials reiterates the

assertion that all these transformations are based on linear combinations of complex

exponential functions with uniformly decaying memories.

Possible ways to implement nonlinear IIRs are using recursive equations such as

those proposed by DPCNs and RNNS. DPCNs are also task specific architectures that

combine the power of convolutional neural networks and the Markov assumption to

exploit temporally varying signals. RNNs can be interpreted as nonlinear IIRs and exploit

temporal structures in a longer time scale beyond the Markov assumption. We devote

the next two sections to enter in details about DPCNs and RNNs.

2.3 Deep Predictive Coding Networks

DFNNs do not have a temporal context and the feature extraction is implemented

in a single swipe through the architecture. This means that all the feedback provided by

27

the upper layers to the lower layers is that of backpropagtion. During the feedforward

pass, there is no feedback at all. DPCNs on the other hand propose to combine a

bottom-up flow, similar to that of DFNNs, but driven by priors given by a top-down flow.

This bottom-up plus top-down flow is leveraged using the context of the input sequence

it across different t ’s.

DPCNs are hierarchical systems of equally defined layers. A layer is defined by a

set of adaptive weights A,B,C , which are respectively the transition, causes rescaling

and observation matrix, and a pair of dynamic variables xt , ut which are called states

and causes and evolve in time t. An important difference between DPCNs and DFNNs

is that here the layer outputs are the dynamic variables xt , ut and they are not calculated

from a single projection followed by a nonlinearity, they are instead optimized with

Expectation-Maximization (EM) to fit a generative model of the input data. A block

diagram of DPCNs is shown in 2-2

The parameters and outputs of the l-th layer of a DPCNs are alternately optimized

to minimize the following energy function:

E(xt , ut , θ) =N∑n=1

(1

2||u(l−1,n)t − C (l)x (l)t ||22 + λ||x (l)t − A(l)x

(l)t−1||1 +

K∑k=1

|γ(l)t,k · x(l)t,k

)

+ β||u(l)t ||1 +1

2||u(l)t − u

(l+1)t ||22 − logP(θ),

(2–5)

where

γ(l)t,k = γ0

[1 + exp(−[B(l)u(l)t ]k)

2

]and θ = A,B,C , (2–6)

where l = 0 represents the input data. Matrix C reconstructs the layer input u(t l − 1) from

sparse codes x (l)t . These codes evolve sparsely from previous representations A(l)x (l)t−1,

where A are transition matrices. The sparseness of the codes x are controlled by the

components u that are also sparse but evolve from priors coming from upper layers in a

28

Figure 2-2. Principe & Chalasani (2014) schematic diagram of a Deep Predictive CodingNetwork with two layers showing bottom-up and top-down information flow.

top-down flow u(l+1)t = C (l+1)x(l+1)t−1 . The prior probability logP(θ) is used as an l2-norm

regularization.

Here the EM algorithm is implemented with Stochastic Gradient Descent of the cost

function above. In Figure 2–5 it is clear another difference between DPCNs and DFNNs,

where to calculate the higher layer activations ut the later would only use the current

context xt . On the other hand, DPCNs not only implement a temporal context through

Axt−1, but also a top-down flow via u(l+1). DPNCs provided better results to simple video

classifications than DFNNs Principe & Chalasani (2014).

Nevertheless, the temporal context provided by the very previous time step t − 1 is

only sufficient when the data has homogeneous dynamics through time since DPCNs

encodes variable length time series as variable length causes with uniformly decaying

memories. For more complex structures, we have to extrapolate these temporal

limitations. RNNs on the other hand can learn longer term dependencies. We will

talk about RNNs in the next section.

2.4 Recurrent Neural Networks

RNNs are networks with PEs forming a directed cycle. They were initially proposed

as a cognitive model by several authors such as Jeff Elman and Michel I. Jordan, see

(Principe et al., 1999). From a statistical perspective, several authors proposed the

backpropagtion through time (BPTT) (Robinson & Fallside, 1987), (Werbos, 1988) and

the Real-Time Recurrent Learning (RTRL) algorithms (Williams & Zipser, 1989) to train

29

RNNs for sequential data prediction and temporal pattern recognition. In its simplest

form, given a vector valued input sequence xt , the dynamics of a RNN evolves as

ht = f (Whhht−1 +Wxhxt + b), (2–7)

where ht is the dynamic state of the RNN,Whh is the hidden-to-hidden connection

matrix,Wxh is the input-to-hidden connection matrix, b is a bias vector and f is

a differentiable nonlinearity such as the hyperbolic tangent. RNNs are Universal

Computers (Siegelmann & Sontag, 1995), from myopic maps perspective (Sandberg &

Xu, 1997), this means that RNNs have infit memory extent adapted directly from data.

This concept is more general than ULMs, nonetheless it is equally hard to fully explore in

practice with finite connections. As universal computers, RNNs can implement arbitrary

sequence to sequence mappings. In practice, RNNs have difficulty in learning long-term

dependencies due to the vanishing gradient problem, which is a consequence of the

uniformly decaying nature of myopic maps, since the derivative of a myopic map is also

myopic. Notice that in the following derivative

∂ht∂Whh

= f ′ht−1∂ht−1∂Whh

(2–8)

the gradient f ′, are small numbers with absolute values between 0 and 1, thus vanishing

with the total gradient at each time step t. To combat that, second order RNNs were

proposed. The first and most popular solution is named Long Short Term Memory

(LSTM) network (Hochreiter & Schmidhuber, 1997), where the gradients are kept using

gating connections, similar to digital logic gates, but differentiable and trainable with

BPTT. LSTMs equations in its most recent formulation (Gers et al., 2000) can be written

as

30

it = logistic(Whiht−1 +Wxixt + bi)

ft = logistic(Whf ht−1 +Wxf xt + bf )

ot = logistic(Whoht−1 +Wxoxt + bo)

gt = tanh(Whght−1 +Wxgxt + bg)

ct = ft ⊙ ct−1 + it ⊙ gt

ht = ot ⊙ tanh(ct),

(2–9)

where, i , f , o and g are respectively the input, forget, output and add gates. They control

how much information is accepted, forgot and exposed from the cell unity c . The ⊙ is

the element-wise multiplication. Thus, LSTMs have two dynamic states, the cell ct and

the output state ht . LSTMs can learn long term dependencies by storing information in

ct and short, rapidly changing dependencies in its output ht , hence its name. Also, due

to the multiplicative connections between states and inputs, LSTMs can be mapped to

a finite state machine both in training and representation (Giles et al., 1992), (Omlin &

Giles, 1996).

Other approaches to avoid the vanishing gradient problem are Echo State Network

(ESN) (Jaeger, 2002) and Gated Recurrent Units (GRU) (Chung et al., 2014). ESNs

avoid the vanishing gradient by not adapting the hidden-to-hidden connections of the

network, focusing only on the hidden-to-output connections and appropriate weight

initialization. Sutskever et al. (2013) showed that using ESN-like initialization and

gradient descent adaptation provides better results than using fixed weights. The

ESN literature was the first to propose to initialize the hidden-to-hidden connections

to orthogonal matrices with spectral radius bounded to be close to one. Orthogonal

initialization will be the default choice for theWhh throwout this proposal, unless explicitly

stated otherwise.

31

GRUs provides a similar approach to LSTMs to keep the gradients from vanishing,

i.e. by using gating connections much like leaky integrators in the Gamma memory

showed above. Just like in the Gamma memory leaky integrators can be adapted

to focus on the appropriate time scale of the input data, GRUs implement the leaky

integrator in the hidden state, thus controlling the time scale of the dynamic representation,

similarly to LSTMs cells. The GRU equations are the following:

rt = logistic(Whrht−1 +Wxrxt + br)

zt = logistic(Whzht−1 +Wxzxt + bz)

ht = tanh(Whh(rt · ht−1) +Wxhxt + bh)

ht = (1− zt) · ht−1 + zt · ht ,

(2–10)

where rt is the reset gate that defines how much of the previous state ht−1 will be

exposed to the proposed state ht and the update gate zt integrates between the

previous and proposed state to generate the new state ht . Thus, zt adaptively controls

the temporal scale of the hidden state, while rt works as a forgetting factor.

Although, LSTMs and GRUs were originally developed as ad hoc to the vanishing

gradient problem, Jozefowicz et al. (2015) implemented an extensive search for different

architectures to better solve problems with long-term temporal dependencies and the

best architectures were not much different or significantly better than those two. Thus, in

this work we will either use LSTMs or GRUs wherever an RNN is necessary.

Specific cases of RNNs are sequence to vector and sequence to sequence

mappings. In sequence to vector we have an input xt , t ∈ 1, ... ,T that is transformed

by the RNN to ht , t ∈ 1, ... ,T , after that either hT or a (weighted) average of all ht is

used a vector representation that can be input to an MLP for classification, for example.

In sequence to sequence applications that initial vector representation is used as input

(also called key or conditioning) to another RNN that generates an output sequence.

32

Example of sequence to sequence applications are recurrent encoder-decoders (Socher

et al., 2011), (Sutskever et al., 2014).

However, as we discussed above, the problem is not only training with vanishing

gradients, but also that an application may require high resolution for some components

of the history and not for others, in a given memory depth. Either LSTM and GRU does

not allow for learning independently these two characteristics of memories, so they

are not general mechanisms for storing information in time. RNNs bring potentially the

capability of configuring the memory requirements for time processing, but they still

suffer from the problem of efficient training and they are unable to control the resolution

of the memory trace independently where it occurs in time.

RNNs can cope with dependencies in time. But, just like fully connected neural

networks, they are not efficient to learn spatial invariant transformations. A better suited

architecture for this task, as mentioned before, are convolutional neural networks. We

discuss convolutional neural networks (CNN) in more details in the next section.

2.5 Convolutional Neural Networks

A convolution operation in neural networks can be expressed as follow. Assume an

input batch Xb,h,w ,c with dimensions b images per batch, where each image in the batch

has h rows, w columns and c channels. That batch is to be convolved with a set of filters

Wi ,j ,c,k with i rows and j columns. k is the number of output channels. Each one of the k

channels in the convolutional filter operate over all the c channels in the input at once. In

an equation, the result of the neural network convolution is given by

Yb,hy ,wy ,k = X ⋆W =i−1∑α=0

j−1∑β=0

c∑γ=1

Xb,hy−i/2+α,wy−j/2+β,γWα,β,γ,k , (2–11)

where ⋆ denotes the convolution operation. We can visualize the convolutional filter

W sliding over the images X as depicted in Figure 2-3. The size of the output map

depends on several choices, for example, if we want only calculate the values where

the convolutional filter totally overlaps with the input, in which case the number of rows

33

Figure 2-3. Example of convolutional neural network (CNN) layer computation. In thisexample a single channel input, filter and output are illustrated.

The inputs are represented in blue and the outputs in green. The darker shade of blueare the convolutional filter values being operated on that spatial location. This image has

been adapted from (Dumoulin & Visin, 2016).

and columns in the output are smaller than input. Another choice is to force the output

to be of the same size as the input, by zero padding the former before the convolution.

There are also strided convolutions, where not all the values in the double summation

in (2–11) are calculated. This is a more efficient way to do downsampling, since it avoid

computations at all, instead of pooling regions. Strided convolutions, with zero padded

inputs are illustrated Figure 2-4

It is interesting to note the complexity of the current architecture of the CNNs versus

the RNN architectures. The current implementations of CNNs in the literature are trying

to achieve universal features for static images to cope with the variability of how a given

object can appear in teh scene (due to rotations, translations and scale). Likewise, we

should improve the architectures of RNNs to achieve of similar goal to process events in

34

Figure 2-4. Example of convolutional neural network (CNN) layer computation with zeropadding and strides.

Each frame in the sequence above represents how each value of the output iscalculated. Note that the convolutional filter (gray) strides 2 pixels at a time, as opposedto one pixel at a time as depicted in Figure 2-3. In this example a single channel input,filter and output are illustrated. The input is padded with zero values (white squares)..

The inputs are represented in blue and the outputs in green. This image has beenadapted from https://github.com/vdumoulin/conv_arithmetic.

35

https://github.com/vdumoulin/conv_arithmetic

time. The result of this combination are Convolutional Recurrent Neural Networks that

we review in the next section. Also, bear in mind that further improvements to RNNs are

the essence of this thesis.

2.6 Combining CNNs and RNNs: Convolutional Recurrent Neural Networks.

In order to generalize the shift and scale invariant properties of CNNs to temporal

series, Convolutional Recurrent Neural Networks (ConvRNN) were proposed. ConvRNNs

were initially used in the context of supervised learning for classifying static images

(Liang & Hu, 2015), in which case the same input batch is represented as input

several times instead of using a time series. ConvRNNs were also applied for weather

forecasting (Xingjian et al., 2015). After these first applications, several other papers

followed up with unsupervised learning applications of ConvRNNs. In unsupervised

learning ConvRNNs were used for video prediction (Kalchbrenner et al., 2016) (Lotter

et al., 2016) (Finn et al., 2016a), optical flow estimation (Patraucean et al., 2015),

algorithmic learning (Kaiser & Sutskever, 2015) and feature extraction from videos

(Santana et al., 2016b).

The motivation of using ConvRNNs for unsupervised learning is that they can be

interpreted as locally connected RNNs with shared parameters across the input images.

Similarly to how CNNs are interpreted as locally connected shared parameters MLPs.

We show a visualization of a ConvRNN in Figure 2-5. Also, we can rewrite all the RNN

equations mentioned above using convolutions, which are as follows:

ConvRNN

ht = f (Whh ⋆ Ht−1 +Wxh ⋆ Xt + b), (2–12)

36

ConvGRU

Rt = logistic(Whr ⋆ Ht−1 +Wxr ⋆ Xt + br)

Zt = logistic(Whz ⋆ ht−1 +Wxz ⋆ Xt + bz)

Ht = tanh(Whh ⋆ (Rt · Ht−1) +Wxh ⋆ Xt + bh)

Ht = (1− zt) · Ht−1 + Zt · Ht ,

(2–13)

ConvLSTM

It = logistic(Whi ⋆ Ht−1 +Wxi ⋆ Xt + bi)

Ft = logistic(Whf ⋆ Ht−1 +Wxf ⋆ Xt + bf )

Ot = logistic(Who ⋆ Ht−1 +Wxo ⋆ Xt + bo)

Gt = tanh(Whg ⋆ Ht−1 +Wxg ⋆ Xt + bg)

Ct = Ft ⊙ Ct−1 + It ⊙ Gt

Ht = Ot ⊙ tanh(Ct),

(2–14)

The ConvLSTM is the mostly used ConvRNN architecture. Later in this thesis,

we will use ConvRNNs to write an efficient end-to-end differentiable reinterpretation of

DPCNs. We will use them to scale our memory augmented models to large images.

In the next chapter, we will augment RNNs with content addressable memories

(CAM), but first, in the next section we discuss what CAMs are in the context of neural

networks.

2.7 Content Addressable Memories

Content Addressable Memories (CAM) in the context of computer hardware are

memories used for very high speed searching applications (Pagiamtzis & Sheikholeslami,

2006). Another important special type of memory is the Random Access Memory (RAM)

that is addressed with a location array and returns the stored value. On the other hand,

CAMs are addressed with a data word and returns the addresses with similar values

stored.

37

ht-2 ht-1 ht

Figure 2-5. Visual representation of a convolutional recurrent neural network.Here we show an unfolded model for 3 time steps and represent a locally connected

convolutional filter being applied to a region of the input frames. This same operation,with shared parameters is used throughout the input image. This way for each time step,

we have images as inputs, images as hidden states and images as outputs. This iscontrary to conventional recurrent neural networks, where all the operations are based

on dot products between vectors.

In the field of neural networks, CAMs are also called Neural Associative Memories

(Palm et al., 1997). CAMs were used as a correlation model to store data (Kohonen,

2012). In the simple form without internal dynamics, CAMs receive an input vector x and

return an output vector y calculated as

y = f (Wx), (2–15)

where f is a function such as sign(x) = 1 if x > 0,else − 1 for binary or the

identity function for linear CAMs. When the network is required to store a sequence

38

of input vectors x1, x2, ... , xN , the mean squared error solution for the CAMW is the

autocorrelation

W =

∑Nn=1 xix

Ti

N, (2–16)

where xTi is a transposed row vector. Similarly, when the network is required to

output a desired vector yi , whenever an input xi is presented, the solution forW is

the cross-correlation

W =

∑Nn=1 yix

Ti

N. (2–17)

Unfortunately, the memory capacity of correlation based of CAMs for perfect recall

is N patterns (the number of orthogonal vectors in N dimensions). Also they can only

reliably recollect about 0.14 times the number of PEs (Amari, 1988). Hasanbelliu &

Principe (2008) proposed a CAM memory implemented using the kernel trick and

Reproducing Kernel Hilbert Spaces (RKHS) that has an unconstrained memory capacity

only limited by the physical memory of the machine. Given an input vector x , the kernel

CAM (KCAM) output is

y =

N∑n=1

ynκ(xn, x), (2–18)

where κ is a Mercer kernel, such as the Gaussian:

κσ(xi , xj) =exp(−(xi − xj)2/2σ2)√

2πσ2, (2–19)

and (xn, yn) are the associated input-output pairs. KCAM were proved to have a larger

memory capacity and better quality data recollection than linear CAM Hasanbelliu &

Principe (2008).

It is important to note that the main difference between associative memories and

regressors is in the number of exemplars. Principe et al. (1999) argued that regressors

39

are meant to find a single optimal hyperplane describing as much information about the

entire dataset, while linear associative memories want to output a response that is as

close as possible to each memorized input-output pair. Thus, for regressors want need

more data than free parameters, when for associative memories we want the opposite.

A different implementation of CAMs involves an internal dynamics. The most

prominent example of such dynamic CAM are Hopfield networks (Hopfield, 1982). The

dynamics of a Hopfield network can be described as the follows. The input signal is the

initial memory states and is called s0 = xi . The CAM state evolves following a recursive

equation st = f (Wst−1):

st =

+1 ifW st−1 ≥ θi

−1(2–20)

After t = N such recursions, the network converges to the recollected pattern yi = sN .

Again, parametersW can be initialized to the cross-correlation between the pairs (yi , xi).

A differentiable RAM (DiffRAM) for neural networks can be implemented using a

discrete distribution vector p with A values. Each value indicates the weight of each

address and its expected value is the content retrieval. In an equation, let a memory

M ∈ RA,B where each word has length B stored in A memory slots, we can retrieve a

word as

m =∑i∈A

Mipi , (2–21)

see Figure 2-6 for an illustration.

This mechanism of using a distribution to weight different values in a matrix (or

tensor) and retrieve values using moments, as above, is referred as differentiable

attention in the Deep Learning literature (Xu et al., 2015), (Bahdanau et al., 2014),

(Gregor et al., 2015).

40

0.5 0.4 0.6 0.0 0.2

0.51.15.02.51.5

0.0 0.0 0.0 1.9 0.0

0.9

0.0

0.1

Memory

Addressing signal

(location addressing

must add up to 1)

{0.5*0.9 0.4*0.9 0.6*0.9 0.0*0.9 0.2*0.9

0.5*0.01.1*0.05.0*0.02.5*0.01.5*0.0

0.0*0.1 0.0*0.1 0.0*0.1 1.9*0.1 0.0*0.1

sum

colla

pse

0.45 0.36 0.54 0.19 0.18

Output addressed memory

Figure 2-6. Diagram of Differentiable Random Access Memory (DiffRAM).Here we show a memory containing 3 elements, each one a 5 dimensional vector, seered matrix. The differentiable addressing signal (green vector), has one value of eachmemory element. The addressing signal here retrieves the first row almost perfectly.Note that this leak where the addressing signal is not zero for all elements but one

sometimes happens in practice. The reason for such leaks to happen is the smoothnature of the operation that allows differentiability but such undesired cases.

Unfortunately, CAM, KCAM, RAM and Hopfield networks only work when the inputs

are fixed length patterns. To extend associative memories to deal with variable length

data, while mimicking the neocortex-hippocampus operation, we have to embed such

memories in a dynamic architecture. In the next chapter we propose a framework for

defining such architectures.

41

CHAPTER 3A FRAMEWORK FOR DYNAMIC ADDRESSABLE MEMORIES

Let a vector valued time series xt ∈ RD be an input to a Recurrent Neural Network.

In the present framework, this input time series can be the top most dynamic causes

generated by DPCNs. It can also be any simple time-series such as text, simple videos

or any other data which regular RNNs may be successfully applied to. We are interested

in the dynamic states ht generated by the recurrent network when it is programmed to

model the time structure of xt . Also, at any given point during the presentation of the

time series to the RNN, we want to store the most interesting state features ht which

may range from a few to an unbounded number of samples. The maximum number of

stored state features ,N, is the capacity of the system. Here, we refer to this collection

of N state features as our time-to-space embedding. In the Cognitive Architecture

framework applied to video, these stored states may correspond, for instance, to stable

face representations. Such stored states can be used to cluster different DPCN’s causes

series or as a substitute to transient causes in the case the DPCN is in predictive mode,

where we use it as generative model to sample data.

Conventional RNNs only have access to the last ht at each point of time. On the

other hand, in this new model the N-valued list of interesting states can be read from or

written during processing.

We introduce a theoretical framework that formalizes this idea as a nonlinear

generalization of the model presented in De Vries & Principe (1992). More precisely, the

proposed model is a nonlinear time-variable gating AR. Mathematically, we can define it

as

ht = f (Wxhxt +Whhht−1 + b + gt,h(ht−2, ht−3, ...)), (3–1)

where gt,h is an adaptive auto-regressive time-variable gating function learned from

data. A motivation for using the gating element g is to allow mathematical expressions

such as the CAM, DiffCAM, etc to be plugged as pieces of conventional RNNs. These

42

RNNs can be interpreted as simple for-loops where the internal steps are solely inner

products. With the inclusion of g we can represent complex operations such as nested

for-loops, memory addressing, table lookup, etc. This extension is done to solve the

fundamental problem with all the recurrent structures that is the lack of flexibility to deal

independently with the resolution of a memory trace, independent of where it occurs in

time. This model can be generalized to also store interesting inputs, thus becoming a

nonlinear time-variable gating ARMA.

ht = f (Wxhxt +Whhht−1 + b + gt,x(xt−1, xt−2, ...) + gt,h(ht−2, ht−3, ...)), (3–2)

where gt,x is the moving-average time-variable gating function also learned from

data. We present this ARMA generalization for theoretical completeness. We will not

focus on ARMA models in the following experiments. Examples for the MA part are

Question-Answering tasks in Natural Language Processing (NLP), where the output

answers are words in the input Weston et al. (2014), and, consequently, the model

benefits from having stored copies of input keywords.

At each time step these functions g bring a small number of previous stored inputs

and states to the calculation of the current ht . To illustrate this, imagine that at h10, the

value h1 is relevant in the dynamics, so a conventional AR model could implement this

with a term ht−9, but for h11, only a blurred version of h1 will be present in h2. A time

variable function gt,h could instead capture a snapshot of h1 and present it with arbitrary

precision for all ht , and, by this way, decoupling temporal resolution from time constants.

Similarly, we could interpret gt,x as arbitrarily long temporal windows in the input series

with time-varying sparse connections. Since both g functions are time varying, they

are more appropriate for real world non-stationary signal modeling, in other words, g

could keep presenting a relevant previous state hi while the input is stationary, and it

could change the relevant previous state under top down control. Back to our Cognitive

43

Framework, this is why we need to keep representations of different individuals in focus

when predicting their behavior in a video. As soon as a person leaves the camera’s

field of view, the system no longer has to keep its facial representation as part of the

dynamics.

Also, this formulation works as a unifying signal processing interpretation of models

such as Memory Networks Weston et al. (2014) and NTMs Graves et al. (2014). For

Memory Networks, gt,x is implemented as a theoretically infinite window (in practice it is

just as long as the number of input sequences) and the rule for choosing relevant inputs

is learned from data. For NTM gt,h is learned from data as a fixed length addressable

memory. Since we are not interested in storing unprocessed input values, here we focus

in models based on NTM. Note that NTM (Graves et al., 2014) does not implement a

MA memory, gt,x , it only has an AR memory, nor it accepts directly information from a

top down input representing the past information stored by the system in its interaction

with the world. Another missing information in Graves et al. (2014) is that they do not

report how to design an NTM to be used for feature extraction in unsupervised learning.

They only focus in cases where the memory reservoir is large enough to learn simple

copy-pasting operations, which, in our application, is not desired and means overfitting

the input.

Reinforcement Learning NTM (Zaremba & Sutskever, 2015) has a MA that was

trained with hybrid reinforcement and supervised learning. But they do not formalize

their model under a Signal Processing framework as proposed here. The stability

conditions and the space of solutions for the above mentioned models is not well

defined, but the formulation in (3–2) could help us define some necessary conditions.

Which we plan to investigate in future work as well.

In summary our problem has two complementary sides: the necessity of a

temporal resolution that goes beyond the one provided by decaying exponentials,

and a time-to-space mapping (xt ,∀t) → z to extract events in time. The temporal

44

resolution could be solved by appropriate snapshots of states while the space mapping

is simply the concatenated snapshots. In a generative model the input time series is

modeled as

P(X ) =

T∏t=1

P(xt |Ht−1, z), (3–3)

where X = (x1, ... , xT ) and Ht is the sequence’s history up to point t. Note that we do

not conform to the Markov assumption, which would simplify to Ht = xt−1.

Using the associative memory consolidation of the Cognitive Architectures shown

in Figure 1-1b as guidance, in the next sections we show interpret methods for using

addressable memories to implement gt,h in a differentiable way that can be learned from

data with backpropagation through-time.

3.1 Memory Reading and Writing in Recurrent Neural Networks

Here we introduce a reinterpretation of the NTM model under the light of (3–1). We

want to implement gt,h as an addressable memory Mt ; in order to do so, we have we

have to define how to address (or read) and write to specific memory locations. Also,

reading and writing should be dynamic and possibly switch during the presentation of

each time point in the input series as in:

Mt = write (read (ΘMM ,Mt−1) , xt ,ΘxM) . (3–4)

Here Mt can be either a matrix Mt ∈ RA,B , where the A different rows (or memory

locations) are separate B-dimensional words. More generally the memory can be a

tensor Mt ∈ RA1,A2,...,AN ,B , where we have a spatial organization of N such addressable

dimensions Ai . A schematic diagram to illustrate 3–1 on such case is shown in Figure

3-1.

The choice of the appropriate type of memory depends on the properties of the

temporal structure we are trying to capture. We do not have an extensive list of available

45

Figure 3-1. Schematic diagram of a memory augmented recurrent neural network.We show a representation unfolded in time for a total of 5 time steps. For 3 of which the

model receives new inputs and 3 of which we observe its outputs. There is one timestep overlap between the input and output stages. Notice the interconnection betweenthe recurrent neural network states h and the memory module M. The arrows betweenthem indicate content addressable read and write operations. The intermediate state z

when the input and output stage overlap can be used as a fixed length representation ofthe entire input signal, since it has to contain all the information necessary for calculating

the output, when the architecture stops receiving new inputs.

options and we leave the investigation about the best memory architectures for future

work. Here we define two types of memory.

3.1.1 Type I: DiffRAM

The DiffRAM Type is useful for signals where the characteristics of the events are

homogeneous in time. For example, in the simple Question-Answering task (Weston

et al., 2014) where specific vector valued inputs are the answers. Another use for

46

DiffRAMs is implementing dynamical systems with simple arithmetic operations on the

input or states.

Note that an issue arises when they are not homogeneous, and since before the

system encounters them it does not know what type they are, the user must know it

is appropriate to use this type of network Assume for example that we need to store

in memmory a linear combination of two inputs. In this case it is sufficient to assign

one memory location and add to it the interesting inputs when they are presented. The

Adding problem in the experimental section illustrates such case.

3.1.2 Type II: CAM

In the most general case of answering unknown questions after the data has been

presented, it is essential to have knowledge of what is being stored to memory. This

can be implemented using Content Addressable Memories controlled as part of an

RNN dynamics. Also content addressing can implement location addressing using

key-value mapping techniques. Thus, Type II memories can theoretically implement

Type I memories. Back to the face recognition in video example, given in the beginning

of this chapter, a representation h of each person in the video could be stored in a

memory location. To retrieve relevant statistics for classification we should address

these memories by content.

Thus, a motivation for using one of these two types of memory would be DiffRAMs

(Type I) when we are interested in where information is stored and CAM (Type II) when

the important information itself should be used in retrieval process. But a final word

about what is the best option for each problem can only be given experimentally. This is

why in the next sections we derive all the options so they can be tested experimentally.

Let us now present the general algorithm that represents the operation of 3–1 when

gt,h is an addressable memory. Given a time t, a multidimensional input time series xt

and the previous states of the architecture, where ht−1 is the AR dynamic state of the

47

model, rt−1 the previous vector from the read function and wt−1 the previous write vector,

the architecture operates as:

1. Using ht−1, update the reading vector rt = fr(rt−1, ht−1)

2. Read from memory mt = read(rt ,Mt−1)

3. Using the input and the read vector, update the semantic representation RNN:ht = RNN(xt ,mt , ht−1)

4. Using ht , update the writing vector wt = fw(wtm1, ht)

5. Write to memory Mt = write(Mt−1, ht ,wt).

Note that although 3–1 uses a conventional RNN update equations, we could also

use GRU or LSTM-like updates to calculate ht . In the next section, we present in more

detail how to implement the read and write operations for using CAMs, DiffRAM and the

hybrid of both as in NTM.

We would like to mention that in the next sections whenever we define an affine

transformation or a specific nonlinearity, the choices can be substituted by Multilayer

Perceptrons or other appropriate neural network. In any case, all the parameters of the

network should be learned from data using backpropagtion through time, where the cost

function depends on the problem as well. This implies that what is stored, and when it is

stored to the content addressable memory is also learned to minimize the cost function.

3.2 Content Addressable Memory

Here we discuss Algorithm 1. To implement content addressing, given the context

representation ht−1, as defined in the algorithm above we calculate an addressing

B-dimensional data word as

rt =Wkht−1 + bk . (3–5)

Given rt and using a similar reasoning as similar to the one proposed by LSTMs

and GRUs, we implement a gating mechanism that allows the network to switch focus

48

Data: input sequences xt , initial values h0, r0. Weights (W , b) for each neuralnetwork.

Result: memorized representations Mfor t ∈ {0, 1, ...} doht ← RNN(xt , ht−1, rt−1)rt =Wkht−1 + bkgt = logistic(Wght−1 + bg)rt = gt rt + (1− gt)rt−1mt = K(Mt , rt)mt,i = Mt,iktmt,i = κσ(Mt,i , kt)σt = σ0 · logistic(Wσht−1 + bσ)βt = relu(Wβht−1 + bβ)rt = softmax(βt rt)mt =

∑i Mt,i rt,i

wt = logistic(Wwht + bw)et = logistic(Weht + be)at = tanh(Waht + ba)Mt = Mt−1(1− wt ⊗ et)Mt = Mt + wt ⊗ at

endAlgorithm 1: Dynamic Content Addressable Memory Network.

between long and short term memories using second order interactions. Such gating

mechanism can be implemented as

gt = logistic(Wght−1 + bg),

rt = gt rt + (1− gt)rt−1.(3–6)

Given rt we can use it to read from memory in two ways, with a projection or

retrieving the closest matching memory slot. Depending on the choice we have

either projection-based CAM (CAMp) or matching-based CAM (CAMm). If we use

the projection retrieved value from memory, it is

mt = K(Mt , rt), (3–7)

where K is an appropriate kernel, that can be for example the linear

49

mt,i = Mt,ikt , (3–8)

or the Gaussian kernel

mt,i = κσ(Mt,i , kt), (3–9)

where Mt,i denotes the i -th row of the matrix Mt and mt,i the i -th value of vector mt .

An advantage of the present framework, compared to other RKHS methods, is that

the kernel size can also be easily learned from data using the same cost function that is

used to train all the other parameters and backpropagation through time. For instance,

here we can calculate the kernel size as

σt = σ0 · logistic(Wσht−1 + bσ), (3–10)

where we used the logistic function to enforce positivity and σ0 is the maximum kernel

size allowed, here fixed to σ0 = 1.

The matching access is based on a scale invariant projection, in other words

we normalize the projection between the generated key rt and the memory M. This

generates a probability distribution over the addressable dimensions:

βt = relu(Wβht−1 + bβ)rt = softmax(βt rt), (3–11)

where βt is the inverse of the temperature of the Softmax and defines how spread the

probability distribution over addressable dimensions is. Addressing is the expectation of

Mt−1 given that distribution

mt =∑i

Mt,i rt,i . (3–12)

Getting mt completes the read function. Now let us talk about how to write. In order

to bound the number of adaptive parameters of the write operation, here we use a lower

50

range outer product, ⊗. Given, the updated semantic state ht , we calculate three vectors

wt ∈ RA1,l ...,AM ,b, etRb,B and at ∈ ,B, where b << B, represent the address, erase, and

add vectors, respectively. We calculate them as follow

wt = logistic(Wwht + bw),

et = logistic(Weht + be),

at = tanh(Waht + ba).

(3–13)

Both wt and et are bounded to [0, 1] because they define respectively if an address

will be affected or erased, which are supposed to be approximately binary operations.

Finally, we write to Mt as

Mt = Mt−1(1− wt ⊗ et)

Mt = Mt + wt ⊗ at .(3–14)

In the same way that Hopfield Networks extends CAM with an internal dynamic

mechanism, we could also extend the previously proposed networks. If we think of

RNNs being implemented as for loops, dynamic CAMs in this context are essentially

nested for loops, where the external loop runs over time t and the internal loop runs for

a fixed number of iterations or until the addressable memory converges to an attractor.

Note that although it has been argued that modern second order RNNs do not rely

on attractors (Jozefowicz et al., 2015), here we hypothesize that to pair RNNs with an

architecture that does converge has the power to augment resulting model’s capacity.

But we leave experimental validation of this type of architecture for future work.

3.3 Differentiable Random Access Memory

This section discusses Algorithm 2. To implement random addressing, as

mentioned in the previous chapter, we need to define a probability distribution over

valid address locations that is independent of their content. In a previous work, we

51


Result: memorized representations Mfor t ∈ {0, 1, ...} doht ← RNN(xt , ht−1, rt−1)rt =Wkht−1 + bkgt = logistic(Wght−1 + bg)µt =Wµht−1 + bµσt = σ20 · logistic(Wσ2ht−1 + bσ2)

r ct,i =1√2πσ2texp

(− (i−µt)2

2σ2

)r ct = softmax(Wwht−1 + bw)gt = logistic(Wght−1 + bg)rt = gt r

ct + (1− gt)rt−1

rt = softmax(βt rt)mt =

∑i Mt,i rt,i


endAlgorithm 2: Differentiable Random Access Memory Network.

proposed to use a Gaussian distribution (see Santana et al. (2017)), in which case,

the memory addressing resembles the reading operation of Deep Recurrent Attentive

Write (DRAW) networks Gregor et al. (2015). Resembling RNNSearch Bahdanau

et al. (2014), we could also use a multinomial distribution. Each distribution has a

number of free parameters and those parameters should be calculated from h, for the

case of an isotropic Gaussian distribution, we only need to calculate a mean µt and

a variance σ2t . To simplify the equations, we assume a single addressable dimension

A, but all the equations can be easily extended to multiple dimensions by defining

the addressing distribution over all A1, ... ,AN . We can calculate the parameters for a

Gaussian addressing as

52

µt =Wµht−1 + bµ

σt = σ20 · logistic(Wσ2ht−1 + bσ2)

r ct,i =1√2πσ2t

exp

(−(i − µt)

2

2σ2

),

(3–15)

where r ct is the addressing distribution with i ∈ N ranging from all valid addresses.

For a multinomial distribution, there is more freedom in the distribution shape over

the attended addresses. On the other hand, the number of trainable parameters is

proportional to the number of addresses. In an equation, we have

r ct = softmax(Wwht−1 + bw). (3–16)

Again, we propose to use gating over r to combat vanishing gradients and ease the

process of switching between long and short term dependencies:


rt = gt rct + (1− gt)rt−1.

(3–17)

For the write operation, wt can be calculated similarly to rt , but from ht instead of

ht−1. Given, rt the read is simply the expectation over the valid address values

mt =∑

i∈A1,...,An

Mt−1rt . (3–18)

Given wt , et and at , the write operations are similar to DCAM’s as in (3–14).

3.4 Hybrid Access Memory

Here we extend the Graves et al. (2014)’s approach to use multidimensional

addressable memories. The model we will describe in details is in Algorithm 3 Hybrid

access combines matching based addressing and a differentiable random addressing

53


Result: memorized representations Mfor t ∈ {0, 1, ...} doht ← RNN(xt , ht−1, rt−1)rt =Wkht−1 + bkσt = σ20 · logistic(Wσ2ht−1 + bσ2)

r ct,i =1√2πσ2texp

(− (i−µt)2

2σ2

)r ct = softmax(Wwht−1 + bw)kt = tanh(Wkht−1 + bk)βt = relu(Wβht−1 + bβ)r ct = softmax(βtK(Mt , kt))Ki(X , y) =

Xiy||x ||·||y ||

Ki ,σ2(X , y) =κσ2(Xi ,y)∑

j κσ2(Xj ,Xj )·∑l κσ2(yl ,yl )

gt = logistic(Wght−1 + bg)r gt = gtr

ct + (1− gt)rt−1

sAit = softmax(Wsht−1 + bs)r sit = r

gt ⋆ s

Ait mt =

∑i Mt,i rt,i


endAlgorithm 3: Differentiable Random Access Memory Network.

around the location retrieved by content. They can do so by first retrieving the address

of a given content word and shifting around that address. In equations, we start with a

content key calculated as

kt = tanh(Wkht−1 + bk). (3–19)

Note that here we are bounding the content key to [−1, 1]. Again, the content

addressing is done as an expectation over valid addresses

βt = relu(Wβht−1 + bβ).rct = softmax(βtK(Mt , kt)), (3–20)

54

also β > 0 works as the inverse temperature of the Softmax, and controls how spread

the distribution is. This time, K should be a scale invariant similarity measure. One can

use, for example, either the cosine similarity as in Graves et al. (2014)

Ki(X , y) =Xiy

||x || · ||y ||, (3–21)

or the Cauchy-Schwarz divergence, which is similar to the cosine similarity, but in RKHS

Principe et al. (2000)

Ki ,σ2(X , y) =κσ2(Xi , y)∑

j κσ2(Xj ,Xj) ·∑l κσ2(yl , yl)

, (3–22)

To allow the network to choose between this new proposed address distribution and

the one used in the previous time step, we can integrate those values as


r gt = gtrct + (1− gt)rt−1

(3–23)

Once we have r gt , we compute a multidimensional shift in the address space.

sAit = softmax(Wsht−1 + bs)

r sit = rgt ⋆ s

Ait ,

(3–24)

where ⋆ denotes circular convolution in the Ai -th addressable dimension. We sequentially

repeat (3–24) for all the addressable dimensions to get to the distribution r st . Thus, this

operation is similar to differentiable RAM centered around the content addressed

locations. Since (3–24) smooths the address weights, Graves et al. (2014) suggested

to sharpen them using element-wise power and renormalization. This last step can be

implemented as:

55

γt = 1 + relu(Wγht−1 + bγ)

wt =(w st )

γt∑(w st )

γt,

(3–25)

where, the summation above is across all the address dimensions Ai . Note that (3–24)

is an extended version of the architecture implemented in Graves et al. (2014). Since

the original formulation has only one dimension to address, when storing information to

memory about complex data structures, NTMs have to rely on the content addressable

memory to jump to different locations or force larger shifts. We hypothesize that allowing

M to be organized in several dimensions we make it easier to store complex data

structures. In the experimental section we validate this hypothesis. For faster reference,

we will only refer to NTM when talking to this specific implementation. From a practical

point of view, st is also a probability distribution calculated with softmax and probability

distributions are ill defined in large dimensional spaces van Handel (2014). When using

single precision on GPUs st can become unstable. In other words, refining st across

several dimensions not only potentially helps to store complex data structures, but it is

also useful for numerical stability reasons.

The write probability, wt can be calculated similarly to rt , but from the updated state

ht , instead of ht−1. Also, the read and write functions are the same as those for DRAM,

see (3–18) and (3–14), respectively.

Note that whenever we refer to NTM1D , we mean an NTM with a single addressable

dimension as implemented in Graves et al. (2014). For two dimensions, we refer to it

as NTM2D . For simplicity we use CAM and DiffRAM instead of “NTM with only content

addressing” or “NTM with random access only”, respectively.

3.5 Experiments

In this section we compare the proposed architectures between each other and

the RNN architectures proposed in the previous chapter. We are specially interested in

56

problem specific cost functions during training and generalization to unseen samples.

The success of using the studied memory augmented RNNs to complex problems such

as consolidating and clustering DPCN causes, is dependent on their ability to learn

simple operations such as memorization, copying, memory transformation, etc. The next

experiments were designed by this author and others to test some of these abilities as

much as possible while keeping the tasks simple and reproducible.

The preliminary results are focused on a modified Adding problem Hochreiter &

Schmidhuber (1997) to quickly validate the benefits of adding an external memory

to simple RNNs. The Copy problem Graves et al. (2014) that tests the ability of the

networks to remember variable length sequences as well as to learn simple algorithms.

Finally, we propose the Rotation problem to test the ability of the proposed architectures

to work as complex content addressable memories and generative models for time

series. The Rotation problem is focused around the MNIST dataset to also investigate

the clustering properties of the internal self-organized representations by these

architectures.

3.5.1 Initialization and Algorithmic Choices

In all the following experiments, we initialized the memory tensors M0 with zeros

for DiffRAM and CAMp only. Note that for CAMm and NTM the norm of M is part of the

denominator when calculating the cosine similarity, in that case, we initialize the memory

with a small constant with value 0.001. The biases b were initialized with zeros and the

weight matrices using Glorot initialization Glorot & Bengio (2010). The hidden-to-hidden

matrices to the RNNs were initialized with random orthonormal matrices. Another

exception is that the bias utilized to calculate the addressing word for NTM is initialized

to random values to avoid zeros in the denominator of (3–21).

For NTM1 we limited the address shifts to (−2,−1, 0, 1, 2). The shifts for NTM2 is

limited to (−1, 0, 1) for each dimension to guarantee that both methods have similar

capabilities and facilitate comparison.

57

We tested CAMm using the cosine similarity measure as in NTM, and CAMp with

Gaussian kernel with adaptive kernel size.

All recurrent networks had 100 hidden neurons. This value ties the sizes of all the

other layers that get the outputs of the RNNs as inputs. We trained the models with

ADAM optimization algorithm with learning rate 0.0001.

3.5.2 Adding

In this modified adding problem we force the networks to have very little hidden

states and test their ability to generalize to completely unseen input lengths. Here we

compared NTMs using a GRU (2–10) as internal RNN to generate states h. The final

output of each network is yt = logistic(Wyht + by). The other compared methods are the

simple RNN, GRU and an LSTM. All the NTMs models have 10 hidden states only. The

memory tensor is M ∈ R10,1 for NTM1, CAMm and CAMp. It is M ∈ R3,3,1 for DiffRAM and

NTM2. The conventional RNN models have 20 hidden states.

The input signal for the adding problem is made of two variable length sequences,

the first one with real values between -1 and 1 and the second with flags -1, 0 or 1,

where only two elements assume the value 1. These two elements mark the numbers in

the first sequence that should be added. Letting the two marked values be called X1 and

X2, the target for a given input is 0.5X1+X24

, which makes sure the target is a real value

between 0 and 1. The compared models were trained with Adam Kingma & Ba (2014)

with learning rate equal to 10−3 for 10000 gradient updates with batch size 100, which is

a total of 106 random sequences for training. The gradients for training were clipped to

maximum norm 10 Pascanu et al. (2012) to avoid exploding gradients. During training,

the minimum length of the input sequences is 50, the maximum is 70 and the error

signal is only emitted after the entire sequence is presented. During test, the length is

100 to test the ability of compared methods to generalize to harder problems. Targets

values between 0 and 1 justifies the choice of the output nonlinearity. The cost function

for training is the negative log-likelihood L = −dt log(yt)− (1− dt)log(1− yt). According

58

Figure 3-2. Adding problem.(a) and (b) from a sample input sequence of the test set. (c) are the values of Mt for the

best performing method DRAM when that input sequence was presented. Notice thedifference in contrast in the third row when the second positive peak in the second row ispresented. That second peak indicates the second value to be added and the difference

in contrast as the result of the model accumulating its value in the memory unity tocomplete the addition.

Table 3-1. Adding Problem.This table compares our proposed architectures with the original Neural Turing

Machines and classic recurrent neural networks architectures. Notice that the modelwith differentiable random access memory performs best as expected in arithmetic

applications.NTM1 NTM2 CAMm CAMp DiffRAM Simple RNN GRU LSTM

ACC (%) 81.2 82 88.8 63.2 92.2 13.4 64.4 65.4

to Hochreiter & Schmidhuber (1997) a sequence is considered correctly added if

output-target absolute error is lower than 0.04. We show in Table 3-1 the accuracy for

the test set. Accuracy is calculated as the number of outputs with error less than 0.04.

Before discussing the results, it is important to note that those methods could

achieve better results with more training since the ADAM algorithm performs an

59

automatic form of learning rate annealing or larger hidden states. This problem can

be easily solved by LSTMs Hochreiter & Schmidhuber (1997) and even simple RNNs

when properly initialized and given enough hidden states Sutskever et al. (2013). But,

previous works have not investigated how the trained networks generalize to sequence

lengths not present in the training set. Nevertheless, it is interesting to note the speed

of convergence for such a simple setting. Here we observed that NTM2 learned faster

than NTM1, which is surprising given the problem simplicity. But also did not rejected our

hypothesis that smaller shifts across several dimensions is better than larger addressing

shifts across a single dimension. We believe that this is due to too much spreading in

the shifting operation and we plan to investigate solutions for this in the future. Also,

CAMm had better results than CAMp which makes sense, since here we want to store

values in the memory and do not need the nonlinear transformation performed in RKHS

to solve this problem.

In the first and second rows of Figure 3-2 we show a sample test sequence. In

the third row we show the memory of the best method DRAM evolving through time. In

this third row we can note a sudden change in color contrast when the second element

to be added is presented (a little after t = 40). Thus, this network was the fastest to

learn to simply store values to its extra memory and retrieve the result by the end of the

sequence, just like one would add values using an ALU in digital circuits.

From this first experiment we conclude that although we may feel tempted to always

use NTMs with both content and random addressing, which seems to be the most

complete memory architecture, we should still account for the problem complexity and

the number of adaptive parameters we are allowed to use.

3.5.3 Copy Problem

Graves et al. (2014) proposed the copy problem as a sequence-to-sequence

transformation. The input sequence is a sequence of 8 bits vectors with minimum length

of 1 and a maximum length of 10 vectors followed by an “end of sequence” marker. That

60

Table 3-2. Copy Problem: Percentage of correctly copied bits.Comparison between two of our proposed methods and the Neural Turing Machines

(NTM). Copying values from memories is a purely content addressable challenge, wherethe value to be copied is retrieved by content. Thus NTM1 performs best. But we show

that our NTM2 performs better than the original NTM1 when a larger region of thememory space needs to be addressed in a single step.

NTM1-3 shifts NTM1-7 shifts NTM2 DRAMACC (%) 99.2 50.1 80.3 64.1

pattern is concatenated with a sequence of zero valued sequence of maximum length

20. During the presentation of the zeros, the network is expected to output the same

sequence presented before the marker in the same order as the original input. In the

test set, the length of the input sequences is 100. The cost function for training was the

negative log-likelihood. The test accuracy was measured as the average number of bits

correctly represented in the output.

This problem does not fit well our generative model (3–3) since all the samples of

the sequence are statistically independent vectors, in which case z could not be smaller

than the input sequence itself for an exact generation. Nevertheless, this problem

revealed the advantages and limitations of the models we studied.

First, as shown in Graves et al. (2014) LSTMs fail to completely learn the training

set and does not generalize well to the test set. We were successful to reproduce

the results using NTM1 when the location shift was fixed to a maximum of 3 values

(i.e no shift, 1 to the left or 1 to the rigth). The model learned the training set and

generalized well to the test set. NTM2 and DiffRAM learned the training set faster, but

didn’t generalize well to the test set. We observed that NTM2 overfitted the training

set, using a small segment of the multidimensional memory to implement a simple 1D

memory. In other words, we used a square memory of size 11x11 to train NTM2 and the

model focused on its main diagonal, running out of space when larger sequences were

used as input. We could obtain better results with different 2D memory configurations

such as 5x20, which made easier for NTM2 generalize to larger sequences, but this

61

Figure 3-3. Neural Turing Machine operations in the Copy problem.Upper row: write and read distributions over memory locations. Lower row: desired andoutput of the network, cost is only calculated over the second half of the output. The firsthalf of the input (from rows 0 to 20) are ignored because those are the time steps when

the input is being presented.

only confirms that the linear nature of this problem does not require multidimensional

memories and is not the appropriate choice to compare these architectures.

We summarize the results in the test set in Table 3-2. In Figure 3-3 we can see the

position of the write and read distributions of the NTM1 that obtained the best results

in our experiments. Note how the network first write to a linear sequence of inputs and

later reads from the very same positions in order. Just like one would write a simple

program for copy-pasting a sequence using an intermediate memory.

Nevertheless, NTM1 also had problems when the input series was larger than the

number of memory locations. In such cases, the network didn’t readapt its memory.

62

Table 3-3. Sequence generation cost function (negative log likelihood, NLL) on the testset.

NTM1 NTM2 LSTMNLL 0.733 0.654 0.748

Further investigation is necessary to test if this problem is due to only the lack of

structure in the input or if the model is also overfitting.

3.5.4 Sequence Generation

In this experiment we wanted to test the ability of the networks to generate

sequences given only the first frame and how long the output sequence should be. This

is to test how well the studied architectures work as associative memories themselves.

The input frame is a 784-point vector made of a flattened 28 by 28 pixels image from

the MNIST dataset. The 785th point of the input vector denotes how long should be the

output video. This last input point was calculated as

l =l − 510, (3–26)

’ where l ∈ 1, 2, ... , 10 for training and l = 20 for testing. The desired sequence is a

smooth rotation of the input digit from 0 to 180 degrees. The spacing between each

angle varies with the sequence length. In this experiment we compared NTM1, NTM2

and LSTM trained using the negative log-likelihood between output and desired time

series as cost function. The final cost for the test set is shown in Table 3-3. Sample

generated sequences are shown in Figure 3-4. To further understand the operations of

these networks, we trained a classifier using the representations of the dynamic states

right after the presentation of the input sequence z = h1. The training was done using

the first 3000 samples from the MNIST training set; testing was done on the first 3000

samples of the test set. The classification accuracy is shown in Table 3-4.

The performance of both NTMs were similar with NTM2 slightly better. Both were

better than the compared LSTM. We noticed that the classification accuracy using the

first hidden states was poor when compared to supervised DNNs for classification.

63

desired

NTM2

LSTM

Figure 3-4. Sample desired and generated sequences using NTM2 and LSTM.The compared methods never had to generate sequences larger than 10 frames during

test. Notice, nevertheless that the proposed NTM2 were able to continue dreamingabout more images while the LSTM overfits to the training sequence length.

Table 3-4. Classification accuracy.We used the sequence generators’ dynamic states as feature. In other words, we

classified the z representations, see Figure 3-1 for context.NTM1 NTM2 LSTM

ACC (%) 85 86 83

Visualizing the features embedding, we noticed an overlap between the representation

of digits “4” and “9”. Further test is needed understand if the network was not good for

classification because the representations were independent of the input share, in other

words, we have to test if the network learned a shape invariant rotation operation. In

future work, we should test this hypothesis rotating images from an external dataset.

In the next chapter, we use the memory mechanism studied here as part of a novel

architecture for video prediction.

64

CHAPTER 4ADDRESSABLE MEMORIES AS PART OF A DIFFERENTIABLE GRAPHICS PIPELINE

FOR VIDEO PREDICTION

The ability to predict future frames in video has several applications, for example

we can cite video compression, planning for robotics and image enhancement. Video

prediction was one of the original goals of DPCNs (Principe & Chalasani, 2014). In

this chapter we investigate a neural network architecture and statistical framework that

models frames in videos using principles inspired by computer graphics pipelines. The

proposed model explicitly represents “sprites” or its percepts inferred from maximum

likelihood of the scene and infers its movement independently of its content. We impose

architectural constraints that forces resulting architecture to behave as a recurrent

what-where prediction network. The sprites in the scene are stored in an addressable

memory similarly to those investigated in the previous chapter, thus avoiding the

necessity of explicitly recalculating the shape of objects in the scene at every time step,

as usually done when using conventional RNNs for video prediction. We snapshot the

sprites using several mechanisms and address them using the methodology explained

in the previous chapter. Thus, the model specified here can be seen as a member of the

family of models described in the previous chapter. Nevertheless, this new architecture

has modules specific for video generation which we describe later in this chapter.

Developing what-where prediction networks with snapshot perceptions was on of the

main goals set for this thesis, this chapter shows how we achieved that. We call this

model Perception Updating Networks.

4.1 On the Need of a Differentiable Computer Graphics Pipeline

The current computer graphics pipelines are the result of efficient implementations

required by limited hardware and high frequency output requirements. These requirements

were also achieved with the use of explicit physics and optic constraints and modeling

with constantly improving data structures (Shirley et al., 2015).

65

Geometric Primitives

Figure 4-1. Steps of the 2D graphics or rendering pipeline that inspired our model.We start with geometric primitives such as sprites, vectors, points, etc. The first step ofthe pipeline is modeling transformation that represents the geometric primitives in the

world coordinate. Clipping is the process of discarding the world representations that willnot appear in the final image due to limited view angle. View transformation is the act of

rotating, translating and deforming the objects in the view to comply with the point ofview of the camera. Scan conversion finalizes the image generation with an array of

values that compose the image to be displayed.

In contrast Convolutional Neural Networks brute force search and match to get

features that are scale, rotation and translation invariant. Also, for a long time in machine

learning, image (Olshausen et al., 1996) and video (Hurri & Hyvarinen, 2003) generative

models had been investigated with statistical approaches that model images down to

the pixel level (Simoncelli & Olshausen, 2001), sometimes assuming neighborhood

statistical dependencies (Osindero & Hinton, 2008). In video prediction, the current state

66

of the art uses variations of deep convolutional recurrent neural networks (Kalchbrenner

et al., 2016) (Lotter et al., 2016) (Finn et al., 2016b).

As a parallel to the classic machine learning approach to image interpretation and

prediction, there is a growing trend in the deep learning literature for modeling vision

as inverse graphics (Kulkarni et al., 2015)(Rezende et al., 2016)(Eslami et al., 2016).

These approaches can be interpreted into two groups: supervised and unsupervised

vision as inverse graphics. The supervised approach assumes that during training an

image is provided with extra information about its rotation, translation, illumination,

etc. The goal of the supervised model is to learn an auto-encoder that explicitly factors

out the content of the image and its physical properties. The supervised approach is

illustrated by Kulkarni et al. (2015).

The unsupervised approach requires extra architectural constraints, similar to

those assumed in computer graphics. For example, Reed et al. (2016) modeled the

content of a scene with a Generative Adversarial Network (Goodfellow et al., 2014)

and its location with Spatial Transformer Networks (Jaderberg et al., 2015). The full

model is adapted end-to-end to generate images whose appearance can be changed

by independently modifying the ”what” and/or ”where” variables. A similar approach

was applied to video generation with volumetric convolutional neural networks (Vondrick

et al., 2016). In two papers by Google DeepMind (Rezende et al., 2016) (Eslami et al.,

2016) they improved the ”where” representations of the unsupervised approach and

modeled the 3D geometry of the scene. This way they explicitly represented object

rotation, translation, camera pose, etc. Their approaches were also trained end-to-end

with REINFORCE-like stochastic gradients to backpropagate through non-differentiable

parts of the graphics pipeline (Rezende et al., 2016) or to count the number of objects in

the scene (Eslami et al., 2016). Those papers also used Spatial Transformer Networks

to model the position of the objects in the scene, but they extended it to 3D geometry so

it could also model rotation and translation in a volumetric space.

67

Other approaches inspired by the graphics pipeline and computer vision geometry

in machine learning uses the physics constraints to estimate the depth of each pixel in

the scene and camera pose movements to predict frames in video (Mahjourian et al.,

2016) (Godard et al., 2016).

The new approach we developed is closer to the unsupervised approach of vision

as inverse graphics. More precisely, here we investigate frame prediction in video.

Contrary to the work by Reed et al. (2016) here we first limit ourselves to simple

synthetic 2D datasets and learning models whose representations can be visually

interpreted. This way we can investigate exactly what the neural network is learning

and validate our statistical assumptions. Most importantly, we can verify what the

memory unit of our architecture is able to snapshot and memory from the scenery. Also,

we investigate the behavior of Spatial Transformer Networks and question it as the

default choice when limited compute resources are available and no scale invariance is

required.

First in the next section we will pose a statistical model that is appropriate for

machine learning but inspired by the graphics pipeline. This will allow us to train a

memory augmented neural network using end-to-end backpropagation, just like we

did in the last chapter. From an experiment perspective, here instead of learning to

represent variable length video streams as fixed length vectors, we want to learn to

predict future frames in video using the extra power of addressable memories to avoid

redundant computations.

4.2 A 2D Statistical Graphics Pipeline

This section starts with a high level description of the 2D graphics pipeline, followed

by a discussion of how to implement it with neural network modules, and finally we

define a formal statistical model.

68

convolution result

spatial transformer result

δXY

convolution

spatial transformer

Figure 4-2. How to get similar results using convolutions with delta-functions and spatialtransformers.

Input sprite is 8× 8 pixels and the outputs are 64× 64 pixels. Note that in the convolutionthe result shape is rotated 180 degrees and its center is where the delta equals to one at

pixel (x = 16, y = 16). Note also that the edges of the spatial transformer results areblurred due to bilinear interpolation. A matrix can be read as “zoom-out” 8 times and

translate up and left in a quarter of the resulting size.

4.2.1 Preliminary Considerations and Relevant Literature Review

The 2D graphics pipeline starts from geometric primitives and follows with modeling

transformations, clipping, viewing transformations and finally scan conversion for

generating an image, see Figure 4-1. Here, we will deal with previously rasterized

bitmaps, i.e. sprites, and will model the translation transformations, rotation and clipping

with differential operations. This way, the steps in the pipeline can be defined as layers

of a neural network and the free parameters can be optimized with backpropagation.

For our neural network implementation, we assume a finite set of sprites (later we

generalize it to infinite sprites) that will be part of the frames in the video. The image

69

generation network selects a sprite, s, from a memorized sprite database Si∈{1,...,K}

using an addressing signal c :

s =∑j

cjSj , where

∑j

cj = 1.

(4–1)

Note that this is the same location addressing mechanism discussed in the previous

chapter. For interpretable results it would be optimal to do one-hot memory addressing

where cj = 1 for Sj = S and cj = 0 otherwise. Note that (4–1) is differentiable w.r.t

to both cj and Sj so we can learn the individual sprites from data. We can force cj add

up to 1 using the softmax nonlinearity. This approach was inspired by the recent deep

learning literature on attention modules (Bahdanau et al., 2014) (Graves et al., 2014)

and a more detailed discussion is shown in the previous chapter.

When the number of possible sprites is too large it is more efficient to do a

compressed representation. Instead of using an address value c we use a content

addressable memory where the image generator estimates a code z that is then

decoded to the desired sprite with a (possibly nonlinear) function d(z). If we interpret

the addressing value z as a latent representation and the content addressable memory

d(z) as a decoder, which is essentially a content addressable memory as discussed

in previous chapters. Also, we can use the recent advances in neural networks for

generative models to setup our statistical model. We will revisit this later in this section.

The translation transformation can be modeled with a convolution with a Delta

function or using spatial transformers. Note that the translation of an image I (x , y) can

be defined as

I (x − τx , y − τy) = I (x , y) ⋆ δ(x − τx , y − τy), (4–2)

70

where ⋆ denotes the image convolution operation. Clipping is naturally handled in such a

case. If the output images have finite dimensions and δ(x−τx , y−τy) is non-zero near its

border, the translated image I (x−τx , y−τy) will be clipped. Another way of implementing

the translation operation is using Spatial Transformer Networks (STN) (Jaderberg et al.,

2015). An implementation of STN can be defined in two steps: resampling and bilinear

interpolation. Resampling is defined by moving the position of the pixels (x , y) in the

original image using a linear transform to new positions (x , y) as

xy

= Ax

y

1

, where

A =

A11 A12 A13A21 A22 A23

.(4–3)

We assume the coordinates in the original image are integers 0 ≤ x < M and

0 ≤ y < N, where M × N is the size of the image I . Once the new coordinates are

defined, we can calculate the values of the pixels in the new image I using bilinear

interpolation:

I (x , y) = wx1,y1 I (x1, y1) + wx1,y2 I (x1, y2)+

wx2,y1 I (x2, y1) + wx2,y2 I (x2, y2)

(4–4)

where (x1, x2, y1, y2) are integers, x1 ≤ x < x2, y1 ≤ y < y2 and

71

wx1,y1 = (⌊x⌋ − x)(⌊y⌋ − x)

wx1,y2 = (⌊x⌋ − x)(⌊y⌋+ 1− y)

wx2,y1 = (⌊x⌋+ 1− x)(⌊y⌋ − y)

wx2,y2 = (⌊x⌋ − x)(⌊y⌋+ 1− y)

(4–5)

To avoid sampling from outside the image we clip the values ⌊x⌋ and ⌊x⌋ + 1

between 0 and M and the values ⌊y⌋ and ⌊y⌋ + 1 between 0 and N. We omitted that in

(4–5) for conciseness. Note that (4–4) is piecewise differentiable w.r.t I .

We can define translation through operations with

A =

1 0 τx

0 1 τy

. (4–6)

Also, we can rotate the image ρ radians counter clockwise with

A =

cos ρ sin ρ 0

− sin ρ cosρ 0

. (4–7)

Image rescaling is achieved on that framework by rescaling in the right square

submatrix A1:2,1:2. We illustrate in Fig. 4-2 how to get similar results using convolutions

with a delta-function and spatial transformers.

Our proposed statistical framework is based on the Variational Autoencoding Bayes

framework (Kingma & Welling, 2013). In the next subsection we review the Gaussian

and Gumbel-Softmax variational autoencoders (Jang et al., 2016) (Maddison et al.,

2016).

4.2.2 Variational Autoencoding Bayes

The Variational Autoencoding Bayes framework, also know as variational autoencoders

(VAE), proposed by Kingma & Welling (2013) uses neural networks to invert an

intractable generative model pθ(z)pθ(x |z), with unknown parameters θ and unobserved

72

z x

θ

ϕ

p(x|z)

q(z|x)

Figure 4-3. Variational autoencoder graphical model.An observable variable x is generated from unobserved factors z according to

pθ(z)pθ(x |z), where the parameters θ are not observed either. We approximate z bylearning a tractable recognition model qϕ, that approximates the posterior pθ(z |x), givena known prior pθ(x). Here which here take the form of a neural networks. Dense linesrepresent the generative model and dashed lines the learnt recognition (or inference)

model.

latent variables z , as depicted in Figure 4-3. The neural network is used to infer qϕ(z |x)

that approximates the true posterior pθ(z |x), assuming a known prior pθ(z).

Given a set of observations x ∈ {x1, x2, ... , xM}, the recognition model qϕ(z |x) is

trained to optimize the evidence lowerbound (ELBO):

L(θ,ϕ; xi) = −DKL(qϕ(z |xi)||pθ(z)) + Eqϕ(z |xi ) [log pθ(xi |z)] , (4–8)

where DKL(q||p) =∑i q(i) log

q(i)p(i)

is the Kullback-Liebler divergence between two

distributions and Eqϕ(z |xi ) [log pθ(xi |z)] is the autoencoder reconstruction cost (e.g. mean

square error for continuous variables and binary crossentropy for discrete variables). In

practice (4–8) assumes different forms depending on the prior distribution pθ(z). The

original formulation of VAE (Kingma & Welling, 2013) assumed a Gaussian prior. Also of

interest for the present work is the Categorical distribution (Jang et al., 2016) (Maddison

et al., 2016).

In order to make −DKL(qϕ(z |xi)||pθ(z)) differentiable with respect to the parameters

ϕ, it was proposed the reparametrization trick (Kingma & Welling, 2013), that allows

73

+

encoder

(MLP or CNN)

x

x~

z

sm

linear layerv

linear layer

*Gaussian

sampler

decoder

(MLP or CNN)

cost function:

see equations

(4-8) and (4-10)

x x~

m v

ξ

vm

Figure 4-4. Block diagram of a Variational Autoencoder with Gaussian prior andreparametrization trick.

The trainable parameters are on the encoder and decoder networks, and the linearlayers (conventional fully connected) that generate the mean m and standard deviationv . Notice that in practice, as represented, the weights of the encoder network are shared

between the m and v pathways.

us to sample and differentiate through qϕ(z |x). For the Gaussian with zero mean and

identity covariance prior case, the reparametrization tricks looks like:

Gaussian: z ∼ qϕ(z |x) = mϕ(x) + vϕ(x)ξ, (4–9)

74

where mϕ and vϕ are learned mean and standard deviations functions and ξ ∼ N (0, I).

For this zero mean identity covariance Gaussian case, the Kullback-Liebler divergence

becomes:

DKL(qϕ(z |xi)||pθ(z)) =1

2

∑i

1 + 2 log v(xi)−m(xi)2 − v(xi)2. (4–10)

For illustration purposes and to make clear how to use VAE in practice, we show a

block diagram of an autoencoder that uses shared layers to calculate m and v as usually

done in practice (Kingma & Welling, 2013), see Figure 4-4.

Another prior distribution pθ(z) relevant for the present work is the Categorical

distribution. For example, assuming that z are vectors of one-hot encoded variables,

in other words a sparse vector where all elements are 0 but one, that has the value of

1. The Gaussian reparametrization trick is not a good fit in this case, complicating the

issue further, sampling discrete random variables is not a differentiable operation. To

cope with this problem an approximation using Softmax distributions and Gumbel noise

were proposed (Jang et al., 2016) (Maddison et al., 2016). This approximation, called

Gumbel-Softmax uses the following reparametrization trick (Jang et al., 2016):

Gumbel-Softmax: z ∼ qϕ(z |x) = softmax(mϕ(x) + ζ), (4–11)

where ζ is a random variable sampled from the Gumbel distribution using the inverse

CDF method

ζ = − log− log u

u ∼ U(0, 1).(4–12)

75

Using the Gumbel-Softmax reparametrization, the Kullback-Liebler diverce part of

the ELBO is:

DKL(qϕ(z |x)||pθ(z)) = softmax(mϕ(x))

(log softmax(mϕ(x))− log

1

M

), (4–13)

where M is the dimmensionality of the latent space z . A similar block diagram to the

one presented in Figure 4-3 can be used to train a Gumbel-Softmax autoencoder (Jang

et al., 2016), with the main differences being in how the reparametrization trick is defined

and this new cost function.

In the next subsection we use our preliminary considerations and the VAE

definitions to propose the statistical framework we want to optimize here.

4.2.3 Proposed Statistical Framework

This section states the main theoretical contributions we developed in this chapter.

Considering the tools defined above, we can define a statistical model of 2D images the

explicitly represents sprites and their positions in the scene. We can use the free energy

of this statistical model to optimize a neural network. Let us start with a static single

frame model and later generalize it to video.

Let an image I ∼ pθ(I ) be composed of sprite s ∼ pθ(s) centered in the (x , y)

coordinates in the larger image I . Denote these coordinates as a random variable

δxy ∼ pθ, where θ are the model parameters. pθ(δxy) can be factored in two marginal

categorical distributions Cat(δx) and Cat(δy) that models the probability of each

coordinate of the sprite independently. For the finite sprite dataset, pθ(s) is also a

categorical distribution conditioned on the true sprites. For this finite case the generative

model can be factored as

pθ(I , s, δ) = pθ(s)pθ(δxy)p(I |s, δxy), (4–14)

76

assuming that “what”, s, and “where”, δxy , are statistically independent. Also, in such

case the posterior

pθ(s, δ|I ) = pθ(s|I )p(δxy |I ) (4–15)

is tractable. One could use for instance Expectation-Maximization or greedy

approaches like Matching Pursuit to alternate between the search for the position and

fitting the best matching shape. For the infinite number of sprites case, we assume

that there is a hidden variable z from which the sprites are generated as p(s, z) =

pθ(z)pθ(s|z). In such case our full posterior becomes

pθ(z , s, δ|I ) = pθ(z , s|I )p(δxy |I ) =

pθ(z |I )pθ(s|I , z)p(δxy |I ).(4–16)

We can simplify (4–16) assuming pθ(z |s) = pθ(z |I ) for simple images without

ambiguity and no sprite occlusion. For a scalable inference in the case of unknown θ

and z and intractable pθ(z |s) we can use the auto-encoding variational Bayes (VAE)

approach (Kingma & Welling, 2013). Using VAE we define an approximate recognition

model qϕ(z |s). In such case, the log-likelihood of the i.i.d images I is log pθ(I1, ... , IT ) =∑Ti log pθ(Ii) and

log pθ(Ii) = DKL(qϕ(z |si)||pθ(z |si))+

DKL(pθ(z |si)||pθ(z |Ii))+

L(θ,ϕ, δxy , Ii).

(4–17)

Again, assume that the approximation pθ(z |s) = pθ(z |I ) we have DKL(pθ(z |si)||pθ(z |Ii)) =

0 and the free energy (or variational lower bound) term equal to

77

L(θ,ϕ, δ, I ) = −DKL(qϕ(z |si)||pθ(z))+

Eqϕ(z |s,δ)pθ(δ|I )[log pθ(I |z , δ)],(4–18)

where we dropped the subindices xy and i to simplify reading. Here we would like to

train our model by maximizing the lower bound (4–18), again inspired by VAE. We

can do so using the reparametrization trick assuming qϕ(z |s) and the prior pθ(z) to be

Gaussian and sampling (4–9) as:

z = mϕ(I ) + vϕ(I ) · ξ, (4–19)

where ξ ∼ N (0,σI), I is the identity matrix, the functions m(I ) and v(I ) are deep neural

networks learned from data.

One can argue that given z and a good approximation to the posterior qϕ, estimating

δ is still tractable. Nevertheless, we preemptively avoid Expectation-Maximization or

other search approaches and use instead neural network layers lx and ly :

δxy = softmax(lx(I ))⊗ softmax(ly(I )), (4–20)

with ⊗ denoting the outer product of marginals. We also use a variational approximation

for qϕ(δxy |I ) ≈ pθ(δxy |I ). Since the position variables Ix(I ) and Iy(I ) are categorical

random variables, in this case we use the Gumbel-Softmax variational trick (4–11) for

sampling. With this extra reparametrization, the final form of our evidence lower bound

becomes:

L(θ,ϕ, δ, I ) = −DKL(qϕ(z |si)||pθ(z))+

−DKL(qϕ(δx |I )||pθ(δx))−DKL(qϕ(δy |I )||pθ(δy))

Eqϕ(z |s,δ)qϕ(δx |I )qϕ(δy |I )[log pθ(I |z , δ)],

(4–21)

78

where we show the factored statistically independent marginals qϕ(δx |I ) and qϕ(δy |I )

to make it explicit how the final cost functions will look like. We can substitute (4–10) in

(4–21) for the Kullback-Liebler divergence of the Gaussian model qϕ(z |s) and (4–13)

twice for the Categorical models qϕ(δx |I ) and qϕ(δy |I ).

Such amortized inference is also faster in training and test time than EM and will

also cover the case where I is itself a learned low dimensional or latent representation

instead of an observable image. Bear this in mind while we use this approach even in

simple experiments such as those with moving shapes in the Experiments section. This

will help us to understand what can be learned from this model. Also, this will be crucial

when we scalte our model in the next chapter.

Beyond images, we extend the model above to videos, i.e. sequences of images

I (t) = {I (0), I (1), ...}, assuming that the conditional log-likelihood log pθ(It |HIt) =

log pθ(It |Hδt ,Hzt) follows (4–17), where HIt is the history of video frames prior to time

point t. Also Hδt and Hzt are the history of position coordinates and the history of

latent variables of the sprites respectively. We should observe that one can make

the assumption that the sprites don’t change for a given video I (t) and only estimate

one sprite st=0 or hidden variable zt=0. This assumption can be useful for long term

predictions, but requires that the main object moving in the scene doesn’t change.

In the next section, we propose a neural network architecture for maximizing our

approximate variational lower bound 2D videos.

4.3 Perception Updating Networks

This section proposes a group of neural architectures for optimizing the lower

bound (4–18). This is a specific case of the more general framework presented in the

previous chapter, but with modules specifically tuned for video and image generation,

such as convolutions and spatial transformers. A schematic diagram is represented in

Fig. 4-5. The core of our method is a Recurrent Neural Network (RNN) augmented

with task specific modules, namely a sprite addressable memory and modeling

79

spritestranslate

rotate

Add

Background

It It+1

c

δXY

ρ

Figure 4-5. A schematic block diagram for a Perception Updating Network.This configuration uses both convolutions with delta functions for translation and spatial

transformers for rotation. It also shows the optional background underlay. Here, thesprites module is an external memory that is addressed by the RNN. Thus, Perception

Updating Networks are a specific case of the memory augmented framework presentedin the previous chapter. For an equivalent schematic diagram unfolded in time, we refer

the reader to Figure 3-1.

transformations layers. RNNs augmented with task specific units were popularized

by Graves et al. (2014) in the context of learning simple differentiable algorithms and

served as inspiration for us as well. Here since we explicitly model the perceived sprites

as s or z and update it and its location and/or rotation though time we decided to call our

method simply Perception Updating Networks.

Here an input frame at time t, It , is fed to the RNN that emits 2 signals: a memory

address that selects a relevant sprite and transformation parameters. If we are doing

the translation transformation using convolutions and delta functions this output is equal

to (4–20), see Algorithm 4 and Algorithm 5. If using STN, the translation operation

returns the matrix A used in (4–3), see Algorithm 6. Note that we could use both,

letting convolutions with δ to the translation is constraining A as in (4–7) to do rotation

transformations only. We describe the general case where both δxy and STNs are used

in Algorithm 7.

Beyond deciding between STNs vs δxy , a few other free parameters of our method

are the type of RNN (e.g. vanilla RNN, LSTM, GRU, ConvRNN, etc), the number of

neurons in the hidden state of the RNN and neural network architectures that infer the

correct sprite and modeling transformation parameters. Our hyperparameter choices are

investigated separately in each experiment in the next Section.

80

Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersm, I and a content addressable memory CAM as defined in 4–1

Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))ξ ∼ pθ(z)ct = m(ht)st = CAM(ct)It+1 = st ⋆ δxyIt+1 = µIt+1 + (1− µ)B

endAlgorithm 4: Convolutional Perception Updating Networks (conv PUN) with ContentAddressable Memory.

Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , l

Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))ξ ∼ pθ(z)zt = mϕ(ht) + vϕ(ht) · ξst = d(zt)It+1 = st ⋆ δxyIt+1 = µIt+1 + (1− µ)B

endAlgorithm 5: Convolutional Perception Updating Networks (conv PUN) with scalablesprites memory in the form of a variational decoder.

Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , f

Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)a = f (ht)

A =

[a11 a12 a13a21 a22 a23

]ξ ∼ pθ(z)zt = mϕ(ht) + vϕ(ht) · ξst = d(zt)It+1 = STN(st ,A)It+1 = µIt+1 + (1− µ)B

endAlgorithm 6: Spatial Transformer Perception Updating Networks (STN PUN).

81

Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , l , f

Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))ρ = f (ht)

A =

[cos ρ sin ρ 0− sin ρ cos ρ 0

]ξ ∼ pθ(z)zt = mϕ(ht) + vϕ(ht) · ξst = d(zt)at = STN(st ,A)It+1 = at ⋆ δxyIt+1 = µIt+1 + (1− µ)B

endAlgorithm 7: Convolution Perception Updating Networks with Spatial Transformerrotations.

In the next section we present experiments with the proposed architecture on

synthetic datasets.

4.4 Experiments

In this section we experiment with several implementations of the proposed

Perception Updating Networks. We start with a simple synthetic dataset made of

videos where one of 3 moving shapes moves with constant speed bouncing in the

edges of an image. This illustrates the working of the finite memory and the addressing

scheme in (4–1) and Algorithm 4. Afterwards we show results on the moving MNIST

dataset (Srivastava et al., 2015) commonly used in the literature of generative neural

network models of videos.

4.4.1 Bouncing Shapes

In this first experiment we generate videos of one of three shapes moving on a

non-zero background. The shapes are a square, triangle and cross. The image size is

20 × 20 pixels and the shapes are 8 × 8 pixels. The pixel values are between 0 and 1.

The shapes are picked with equal probability and they move at constant speed of 1 pixel

82

ground truth

convolutional PUN

LSTM

spatial transformer PUN

b) convolutional PUN learned sprites

a) one step ahead prediction

10x10 sprites

sample δxy when sprites 10x10

6x6 sprites

sample δxy when sprites are 6x6

Figure 4-6. Results on the Bouncing Shapes dataset.Three 8x8 sprites (a square, a cross and a triangle) were used to generate videos. Theshapes move in a 20x20 pixels canvas with a Toeplitz background and bounce on thecorners. a) We show one step ahead predictions with the compared methods. b) Wealso show the learned sprites for the convolutional implementation of the proposed

Perception Updating Networks when we over- and under-estimate the size of the desiredsprites. The internal RNN for both methods had 100 neurons. The sprite selection layer

has a single neuron connection 100 inputs to 3 outputs.

per frame. The shapes start from random initial positions, their movement directions is

random as well.

We tested two implementations of the proposed architecture: one using only

convolutions, referred to as convolutional PUN (conv PUN) in the figures, and another

using using spatial transformers, called Spatial Transformer PUN. For the parameters

of the convolutional PUN the RNN used was a Long Short Term Memory (LSTM) with

100 cells. The RNN in the Spatial Transformer PUN had 256 cells. In the convolutional

PUN, the location layers used to calculate δxy , lx and ly , output vectors of size 20 pixels

and we used the finite addressable memory described in (4–1). The background is also

learned from data as weights of neural network. This background served to make the

task more difficult and force the network to avoid just exploiting any non-zero value. After

83

the convolutional composition It = st ⋆ δxy , we added the background to form a new

image using It = µ · It + (1 − µ)B, where µ is a differentiable mask that accounts for the

“transparency” of the image It . B is the learned 20 × 20 pixels background image. For

complex shapes this mask shape could be calculated as another module in the network,

similarly to the approach in Vondrick et al. (2016). See Algorithm 4.

In the following experiments, the training videos were 10 frames long. At test time

the network is fed the first 10 frames of a video and asked to predict the next 10. Results

for the compared methods are shown in Fig. 4-8. For the baseline method, we did a

hyperparameter search on conventional LSTMs with a single linear output layer until we

found one that had comparable results at test time. That network had 256 hidden cells.

Also, note that although the scale of the mean square error is the same, the results from

our proposed architecture look smoother than those learned by the LSTM as shown in

Fig. 4-6.

Given such a simple experiment, it is elucidating to visualize values learned by

each piece of the network. As expected the sprite memory learned the 3 investigated

shapes in transposed order since they are reverted by the convolution operation to

compose the frame. We also experimented with choosing the size of the learned sprites

st smaller and larger than the true shapes. We observed that for larger shapes such

as 10 × 10 the sprites converge to the correct shapes but just using part of the pixels.

For smaller shapes such as 6 × 6 pixels, instead of learning a part of the correct shape,

the convolutional Perception Updating Network learned to compensate for the lack of

enough pixels with more than one non-zero value in the location operation δxy (see

Fig. 4-6). This allow us to suggest to the interested practitioner that in order to get

interpretable results it is better to use sprites larger than the expected size than smaller.

For the spatial transformer PUN the image is calculated as:

84

Figure 4-7. Results of a Convolutional Perception Updating Network.The first row show the predicted video of a bouncing triangle and the second row shows

“where” variable that encoder the position of the sprite in the scene. This decouplingbetween “what” (the triangle) and “where” is what gives Perception Updating Network

interpretability, efficiency and generalization. It is important to notice on this image thatthe quality of the first predicted frame is not as good as the others, the reason is that weinitialize the internal RNN state with a vector zeros and it takes one step for the state tobe updated to a more useful value. Future work should address training the initial state

as well.

A = f (ht),

It+1 = STN(st ,A),

(4–22)

see Algorithm 6 for context.

We noticed that the spatial transformer PUN was not able to learn the training

videos using an equivalent architecture to the convolutional PUN one. We had to use

multiple layers to define the function f (ht). In other words, in the convolution based

method δxy can be estimated by a single affine transformation of the state ht but A

cannot. We also had to use smaller learning rates to guarantee convergence: 0.0001 for

STN while the δxy -based model worked with a value 10 times larger.

If we don’t use the softmax nonlinearity to construct δxy the representations learned

by the convolutional PUN are still interpretable, but the performance of the overall model

in training and test set is worth. Overall, it is interesting to conclude that under this

framework the “what” and “where” can only be distinguished if we impose architectural

constraints. The reason is the commutative property of the convolution operation. In

Figure 4-7 we show a predicted video with the corresponding learned δxy that exactly

represents the center of the moving object.

85

Figure 4-8. Performance curves in the test task of two implementations of the proposedarchitecture (conv PUN and STN PUN) and an equivalent LSTM baseline.

Note that the spatial transformer based PUN was not able to generalize to the test set,i.e. they did not work well for generating videos when getting its own previous outputs asnext step inputs. Final errors in the test set are conv PUN: 0.033, STN PUN: 0.227 andLSTM: 0.035. Notice that we fixed a small 100 hidden neurons PUN and increased the

baseline LSTM until it had equivalent performance. We had to increase the total numberof trainable parameters by using a baseline with 256 hidden neurons.

As a note on rotation, we ran experiments where the sprite are rotated by a random

angle before being placed in the image. This new type of videos cannot be learned

using only convolutional based Perception Updating Networks unless we increase the

number of sprites proportionally to the number of possible angles. Spatial transformer

based Perception Updating Networks can handle this new type of video naturally.

Nevertheless, if the number of rotation angles is finite or can be discretized we found

that we could learn to generate the videos faster if we combined the convolutional

approach with a mechanism to select the appropriate angle from a set of possibilities.

4.4.2 Moving MNIST

The Moving MNIST benchmark uses videos generated by moving 28 × 28 pixel

images of hand written digits in a 64 × 64 pixels canvas. Just like in the Bouncing

Shapes dataset, the digits move with different different speeds in different directions and

can bounce in the walls. Unlike the Bouncing Shapes dataset, there are 60000 different

86

Figure 4-9. Sample rollouts of a 2 layer LSTM convolutional Perception UpdatingNetwork.

Notice that the quality of the predicted sprite doesn’t change, contrary to other methodsthat gets blurry with time. On the other hand, our proposed network forgets correct

movement. A possible solution could be a parametrization of the movement, in otherwords, parameterize the “where” variables update.

sprites for training and 10000 for test, making it impractical to use a discrete memory

module. Instead, we use the memory representation denoted by (4–19) followed by

st = d(zt) as written in Algorithm 5.

We trained a convolutional Perception Updating Network using 2 layer LSTMs each

one with 1024 cells for 200 epochs, with 10000 gradient updates per epoch. The latent

variable z had 100 dimensions and the decoder d(·) was a single hidden layer MLP

with 100 hidden neurons and softplus activation function. The output layer of this MLP

has 784 neurons, which is the size of an MNIST image, and sigmoid activation function.

In the test set we obtained a negative log-likelihood of 239 nats with the proposed

architecture, while a 2 layer LSTM baseline had 250 nats. Note that the our method was

optimized to minimize the lower bound (4–18), not only the negative likelihood. These

results are not as good as those obtained by the Video Pixel Networks (Kalchbrenner

et al., 2016) that obtained 87 nats on the test set. Nevertheless, both approaches are

not mutually exclusive and instead of a fully connected decoder we could use a similar

PixelCNN decoder to generate sprites with higher likelihood. In this first paper we

decided instead to focus in defining the statistical framework and interpretable “what”

and “where” decoupling.

87

When running the proposed method in rollout mode, feeding the outputs back as

next time step inputs, we were able to generate high likelihood frames for more time

steps than with a baseline LSTM. Also, since the sprite to be generated and its position

in the frame are decoupled, in rollout mode we can fix the sprite and only use the δxy

coming from the network. This way we can generate realistic looking frames for even

longer, but after a few frames we observed the digits stopped moving or moved in the

wrong direction (see video in the companion code repository). This means that the

LSTM RNN was not able to maintain its internal dynamics for too long, thus, there is still

room for improvement in the proposed architecture.

In Fig. 4-9 we show sample rollout videos. The network was fed with 10 frames and

asked to generate 10 more getting its own outputs back as inputs and the companion

code repository for an animated version of this figure.

This experiment also suggests several improvements in the proposed architecture.

For example, we assumed that the internal RNN has to calculate a sprite at every time

step, which is inefficient when the sprites don’t change in the video. We should improve

the architecture with an extra memory unity that snapshots the sprites and avoid the

burden of recalculating the sprites at every step. We believe this would a possible way

to free representation power that the internal RNN could use to model the movement

dynamics for even more time steps. We investigate that later in this chapter.

4.4.3 Visualizing the RNN-to-CAM Connections

One of the most valid criticisms of the architectural models we presented in the

last chapter is that the external memory unit needs to be at least as big as the input

source for a complete sequence reconstructions. A consequence of a large content

addressable memory is repeated memorized values. In the PUN framework with CAM,

we force the model to learn a single decoder-like memory system. Here we visualize

the addressing signal from the controller RNN to the CAM to proof it learns to cluster

redundant samples into meaningful compact representations.

88

In this experiment we continue working with the Moving MNIST dataset, but this

time we are not interested in the video prediction benchmark directly and focus into

movies with a single digit in the scene. We train a convolutional PUN for predicting

the next frames during 200 epochs. The size of the hidden code zt is 100 dimensions

and we reduce it to 2 dimensions using t-SNE. Results are shown in Figure 4-10. For

quantitative results, we compared the performance of linear classifiers in the codes

codes zt vs in the raw MNIST images. We obtained almost 3 times smaller error

probability using the learned codes zt from the moving digits in a larger scene than

using the raw images focused around the digit (see Figure 4-10).

This allow us to conclude that Perception Updating Networks can double as a video

predictor and an unsupervised object detection/feature extraction system.

4.4.4 Snapshotting “What” Directly from Pixels

Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , l , f . Localization Network LN.

Result: video predictions It , t ∈ {1, 2, 3, ...}A = LN(I0)s = STN(I0,A)for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))It+1 = s ⋆ δxyIt+1 = µIt+1 + (1− µ)B

endAlgorithm 8: Snapshot Perception Updating Networks.

Although powerful in our experiments, a limitation of learning a limited set of

sprites is that the model does not learn how to snapshot what is important for scene

reconstruction on the go. In this experiment we propose an alternative PUN model that

does not have a content addressable memory. Instead, this new model snapshots a

sprite proposal from the first frame of the video itself and uses it for frame prediction.

This new model, named Snapshot PUN is represented in Figure 4-11 and detailed in

Algorithm 8.

89

spritesIt

c

zt: 100 dimensional vector

t-SNE accuracy of a linear

classifier: 98.1%

Figure 4-10. A piece of the schematic block diagram for a Perception Updating Networkand t-SNE embedding of the codes sent from the RNN controller to CAM.

Note that the codes clustered by label. We represent in dashed lines the parts of thePUN do not contribute to the t-SNE embedding. For a quantitative depiction of the

visually separated clusters, we trained a linear classifier on the hidden codes zt , in whichcase we obtained 98.1% accuracy. A linear classier on the raw 28x28 pixels MNIST

obtains 94.5% accuracy.

Note that this new Snapshot algorithm is not mutually exclusive to the previous

versions PUN and could be used in tandem depending on the application. But note that

this new algorithm requires that the object of interest must be totally visible when the

snapshot is taken. The snapshot is taken by calculating a transformer matrix A using a

localization network, LN, that consists of a 5 layer convolutional neural network that take

90

translate

rotate

Add

Background

It It+1δXY

ρ

Localization

Network (LN)

Spatial

Transformer

first frame

fixed snapshot

sprite

A

Figure 4-11. Snapshot Perception Updating Network. See Figure 4-5 and compare it tothe convolutional Perception Updating Network model.

Note that this model does not update the snapshot sprite. This can mean both efficiency,via avoiding extra calculations, but also less flexibility, since it does not adapt in case of

changes in the object in the video (e.g. rotations, deformations, new objects, etc). Futurework should address a strategy to know when to snapshot objects in the scene and

when to update them.

as input the very first frame of the video and outputs the cropped digit. That digit is used

as a fixed sprite throughout the entire video prediction.

Since the snapshot is based solely on spatial transformer networks, it can converge

faster, but it is also more unstable and harder to use. We compare this snapshot model

with the convolutional PUN on video prediction. Results are shown in Table 4-1. Note

that our main conv PUN generalizes better and that is why we propose it to be our main

model.

91

Table 4-1. Comparison between Snapshot PUN conv PUN on the single digit movingMNIST benchmark.

Results show negative log likelihood (smaller is better) on the test set. Note thatSnapshot PUN captures the sprite from the scene using spatial transformers that arebased on bilinear interpolation, better results could be obtained with extra rescaling or

cleaning of the sprite before composing the scene. Snapshot PUN Conv PUN91.474 85.998

4.5 Rules of Thumb for Model Choice

In this chapter, we presented 5 different PUN algorithms. In the previous chapter we

presented another 3 memory augmented neural networks. Here we summarize rules of

thumb that should help choosing which of those algorithms is more appropriate for new

problems.

The first question to be asked is what is the purpose of the model being trained.

If the objective is to learn an autoencoder, i.e. learn a fixed length representation for

a variable length video, the methods from Chapter 3 are more appropriate. When

the fixed length representation is the result of an arithmetic operation, DiffRAM is

more appropriate. If the fixed length representation should be useful for sequence

reconstruction, NTM, NTM2 or DCAM must be preferred.

Perception Updating Networks should be the model of choice if the user is

interested either on video prediction (for example, for model based controls, video

compression, filtering, etc.), object detection or interpretable representations of videos.

In the PUN family, conv PUN is the most simple and efficient model since they converge

faster and learn clustered representations of the objects in the scene. If generalization is

important but the object of interest appears completely in the scene, Snapshot PUN is a

more robust model. A combination of Snapshot PUN and conv PUN could help in cases

where the object of interest changes during the video. For example, while the object

in the scene is constant, we can use Snapshot PUN, when it changes, we can use the

conv PUN decoder-like memory to draw the new object and the Snapshot PUN can

92

continue the generation from there. Notice though, that learning how to detect change in

the object of interest on the fly is left for future work.

In the next chapter we focus in scaling this video prediction network for real world

videos. We do so by proposing a new type of DPCN, called Recurrent Winner Take

All Networks and later combining it with the memory and graphics pipeline modules

proposed in this chapter.

93

CHAPTER 5SCALING UP PERCEPTION UPDATING NETWORKS

One of the main criticisms of Perception Updating Networks (PUN)1 was regarding

its strong 2D assumptions and potential weakness to represent 3D videos. The 2D

assumptions and 2D graphics pipelines should represent well flat scenes, platform

based video games and top down views such as those of drones, satellites, robots, etc.

On the other hand, the 2D assumption would fail to represent 3D videos of roads, robot

navigation, human actions, etc, which are the vast majority of videos available.

To extend PUNs to 3D one might generalize the graphics pipeline proposed in

the previous chapter to represent the extra dimension of space. We could define

volumetric convolutions to place 3D voxels (the equivalent of the 2D sprites) in space.

With this added extra dimension one would also have to model 3D translation, rotation,

perspective transformations, and also deal with occlusion, view angle, etc. While this

might be the most complete extension of PUNs, the required neural network technology

would be beyond the scope of the analysis presented here. It would also require

innovations only recently developed in the literature (Wu et al., 2016)(Ravanbakhsh

et al., 2016)(Dai et al., 2016)(Gadelha et al., 2016)(McCormac et al., 2016)(Yan et al.,

2016).

We leave the development of 3D-PUN for future work. Here we propose an

alternative extension based on spatio-temporal embeddings with Convolutional

Recurrent Neural Networks (ConvRNN) (Santana et al., 2016b) and Perception

Updating Networks as an extension of Convolutional layers. Our hypothesis is that we

can project the input videos to a non-linear manifold where the flat spatial dimensions

assumption holds true. We implement PUN on that manifold and finally perform the

decoding transformation back to the original video space. The PUN layer implements

1 These comments were kindly provided by ICLR 2016 anonymous reviewers

94

different nonlinearities than a conventional convolutional layer, thus learning more

complex mappings when both are used in tandem.

The motivation for this hypothesis was our recent success in unsupervised learning

with ConvRNNs (Santana et al., 2016b). We were able to learn projections of videos

of 3D rotating shapes that were linearly separable for object recognition. Here we will

first review those results and later learn a PUN in the latent space of ConvRNNs. We

can interpret that as learning with a PUN a spatio-temporal trajectory in a hyperplane

of a convolutional auto-encoder. In this new space, a PUN memory no longer means

only a sprite or a position, but a video memory in the embedded space. We illustrate the

architecture of this proposed extension in Figure 5-1.

Both the ConvRNN results and ConvRNN+PuN presented in this chapter are

original work. The ConvRNN results in this chapter were already submitted for

publication (Santana et al., 2016b). The ConvRNN+PUN combination is unpublished

work. In the next section we review the ConvRNN architecture already discussed in

Chapter 2. We show the modifications we made to that model and show some results

in unsupervised learning of videos. Those results are the main motivation for combining

ConvRNN+PUN and the experiments that concludes this chapter.

5.1 Convolutional Recurrent Neural Networks for Unsupervised Learning ofVideos

In this section we revisit our published results with ConvRNNs (Santana et al.,

2016b) to illustrate their ability to embedded multi-dimensional time series in linearly

separable spaces. Furthermore, in place of using computationally expensive EM

algorithms to compute the sparse states and causes, just like DPCNs, our method

(Santana et al., 2016b) uses convolutional recurrent autoencoders with Winner-Take-All

(Makhzani & Frey, 2015) regularization to encode the states in feedforward manner.

Makhzani and Frey proposed Winner-Take-All (WTA) Autoencoders (Makhzani &

Frey, 2015) which use aggressive Dropout, where all the elements but the strongest of

95

conv

k, 5, 5

↓(2, 2)

free parameters:

k: convolutional channels

c: channels in the output (1 for gray, 3 for color)

H: hidden neurons of PUN's LSTM

N: memory neural network

conv

k, 5, 5

↓(2, 2)

PUN Mask-Network Background

Network

ConvLSTM

k, 5, 5

sigmoid

m * p + (1-m)*b

p

m

b

merged layers

..

.

conv

k, 5, 5

↑(2, 2)

conv

k, 5, 5

↑(2, 2)

..

.

pre-processing

convnet

(1 to 3 layers)

post-processing

convnet

(0 to 2 layers)

Figure 5-1. Convolutional Perception Updating Network as a hidden layer of a deepconvnet.

(Optionally) Using a PUN in the hidden space of a convnet transforms it a ConvolutionalRecurrent Neural Network (green). If used as a hidden layer the PUN’s memory no

longer have a one-to-one relationship with pixels in the output scene, but withspatio-temporal memories, similarly to the gists of Deep Predictive Coding Networks. To

reduce the number of computations, we implemented the convolutions in thepre-processing sub-network (blue) with k-filters of 5x5 pixels with strides 2x2 that

downsample the output. Equivalently, the post-processing sub-network (yellow) hasconvolutions with k-filters of 5x5 pixels with depth-to-space upsampling. The PUN layeroutput is merged with conventional convolutional layers via gated addition (red), where

the soft-binary mask ”m” balances the contribution of each layer.

a convolutional map are zeroed out. This forces sparseness in the latent codes and the

convolutional decoder to learn robust features. In our method we extended convolutional

Winner-Take-All autoencoders through time using convolutional RNNs. WTA for a map

xf ,r ,c in the output of convolutional layer can be expressed as in (5–1). The indices f , r , c

96

represent respectively the number of rows, the number of columns, and the number of

channels in the map.

WTA(xf ,r ,c) =

xf ,r ,c , if xf ,r ,c = max

r ,c(xf ,r ,c)

0, otherwise. (5–1)

Thus,WTA(xf ,r ,c) has only one non-zero value for each channel f . To backpropagate

through (5–1) we use ∇WTA(xf ,r ,c) = WTA(∇xf ,r ,c). In the present paper, we apply

(5–1) to the output of the convolutional maps of the ConvRNNs after they have been

calculated. In other words, the full convolutional map hidden state is used inside the

dynamics of the ConvRNN, WTA is applied only before they are fed as input to the

convolutional decoder.

We also proposed to learn smoothness in time with architectural constraints using

a two-stream encoder as shown in Figure 5-2. In the present work, this two stream

approach inspired the skip-network we used for the ConvRNN+PUN. Originally, this

two-stream architecture was inspired by the dorsal and ventral streams hypothesis in

the human visual cortex Goodale & Milner (1992). Roughly speaking, the dorsal stream

models ”vision for action” and movements and the ventral stream represents ”vision

for perception”. In our proposed architecture one stream is a stateless convolutional

encoder-decoder and the other stream has a convolutional RNN encoder, thus a

dynamic state. Using Siamese decoders for both streams, we force the stateless

encoder and the convolutional RNN to project into the same space—one which can be

reconstructed by the shared weights decoder. It is important to stress that from the point

of view of spatiotemporal feature extraction with the ConvRNN, the stateless stream

works as for regularization. As any other sort of regularization its usefulness can only

be totally stated in practice and the practitioner might optionally not use it. Nevertheless,

we opted for using the full architecture in all the experiments of this paper. In Appendix

97

inputreconstruction of

frame 1

shared encoder

indicates siamese decoders

with shared parameters

Convolutional RNN

frame 1

z-1

WTA

WTA

prediction of

frame 2

Figure 5-2. Schematic diagram of the Recurrent Winner-Take-All (RWTA) network.This is the modified architecture we used to investigate ConvRNNs ability to embedvideos into a separable space. Upper stream is the static encoder-decoder. Lower

stream denotes the temporal, dynamic encoder based on a ConvRNN.

A, we show how this proposed architecture enforces spatiotemporal smoothness in the

embedded space.

Given an input video stream xt , denoting the stateless encoder by E , the decoder

D, and the convolutional RNN by R, the cost function for training our architecture is the

sum of reconstruction and prediction errors:

Lt = E[(xt−1 −D(E(xt−1)))2 + (xt −D(R(xt−1)))2

], (5–2)

where E denotes the expectation operator. Notice that as depicted in Figure 5-2,

E and R have shared parameters. During training, we observe a few input frames

t = [1, 2, ...,T ] and adapt all the parameters using backpropagation through time

(BPTT) (Werbos, 1990). Notice that due to BPTT both streams of our architecture are

adapted while considering temporal context. Thus, the stateless encoder E will learn

richer features than it would if trained on individual frames.

To illustrate the capabilities of such proposed architecture we applied it to two

datasets, the Coil100 and Honda/UCSD Faces Dataset for a direct comparison with

DPCN and other unsupervised learning techniques. Sample videos of both datasets are

98

(a)

(b)

Figure 5-3. Sample videos from Coil and Honda/UCSD datasets.a) Coil 100 dataset (Nene et al., 1996) and b) Honda/UCSD face dataset (Lee et al.,

2005).

Table 5-1. Hyperparameter choices per experiment

COIL100 Honda FacesChannels per layer 128 256Filter size (encoder) 5x5 5x5Filter size (decoder) 7x7 7x7All models were trained using ADAM optimization rule with learning rate 0.001

All models were 4 layers deepAll models had 2 convolutional layers before the ConvRNN layer.

WTA was applied only right before the last layer.

shown in Figure 5-3. A list of the hyperparameters used in those experiments are shown

in Table 5-1.

The COIL-100 dataset (Nene et al., 1996) consists of 100 videos of different

objects. Each video is 72 frames long and were generated by placing the object on a

turn table and taking a picture every 5◦. The pictures are 128x128 pixels RGB. For our

experiments, we rescaled the images to 32x32 pixels and used ZCA pre-processing.

99

Figure 5-4. 128 decoder weights of 7x7 pixels learned on Coil-100 videos.

The classification protocol proposed in the COIL-100 Nene et al. (1996) uses 4

frames per video as labeled samples, the frames corresponding to angles 0◦, 90◦, 180◦

and 270◦. Chalasani and Principe(Chalasani & Principe, 2015) and Mobahi et. al.

(Mobahi et al., 2009) used the entire dataset for unsupervised pre-training. For this

reason, we believe the results in this experiment should be understood with this in

mind. Note that the compared methods enforce smoothness in the representation of

adjacent frames, and since the test frames are observed in context for feature extraction,

information is carried from labeled to unlabeled samples. In other words, this experiment

is better described as semi-supervised metric learning than unsupervised learning.

Here, we followed that same protocol, using 14 frames per video. Results are reported

in Table 5-2. We used encoders with 128 filters of 5x5 pixels and a decoder with 7x7

pixels. The decoder filters are shown in Fig. 5-4

The Honda/UCSD dataset consists of 59 videos of 20 different people moving their

heads in various ways. The training set consists of 20 videos (one for each person),

∼ 300 − 1000 frames each. The test set consists of 39 videos (1-4 per person),

∼ 300 − 500 frames each. For each frame of all videos, we detected and cropped the

100

Table 5-2. Recognition rate (in percentage %) for object recognition in Coil-100 datasetMethod Accuracy

DPCN no context Chalasani & Principe (2015) 79.45Stacked ISA + temporal Le et al. (2011) 87

ConvNets + Temporal Mobahi et al. (2009) 92.25DPCN + temporal + top down Chalasani & Principe (2015) 98.34

Proposed method 99.4

Table 5-3. Recognition rate (in percentage %) for face recognition in Honda/UCSDdataset

Sequences Lengths MDA SANP CDN ProposedMethod

50 Frames 74.36 84.62 92.31 100100 Frames 94.87 92.31 100 100Full Video 97.44 100 100 100

References: MDA (Wang & Chen, 2009), CDN (Chalasani & Principe, 2015),SANP (Hu et al., 2011) .

faces using Viola-Jones face detection. Each face was then converted to grayscale,

resized to 20x20 pixels, and histogram equalized.

During training, the entire training set was fed into the network, 9 frames at a time,

with a batch size of 32. After training was complete, the training set was again fed

into the network. For each input frame in the sequence, the feature maps from the

convolutional RNN were extracted, and then (5,5) max-pooled with a stride of (3,3). In

accordance with the test procedure of Chalasani and Principe Chalasani & Principe

(2015), a linear SVM was trained using these features and labels indicating the identity

of the face. Finally, each video of the test set was fed into the network, one frame

at a time, and features were extracted from the RNN in the same way as described

above. Each frame was then classified using the linear SVM. Each sequence was

assigned a class based on the maximally polled predicted label across each frame

in the sequence. Table 5-3 summarizes the results for 50 frames, 100 frames, and

the full video, comparing with 3 other methods, including the original convolutional

implementation of DPCN Chalasani & Principe (2015). The results for the 3 other

101

methods were taken from Chalasani & Principe (2015). The results for our method were

perfect for all the tested cases.

5.2 ConvRNN + PUN: Combining Convolutional RNNs and Perception UpdatingNetworks

With the knowledge about ConvRNNs acquired with the results in the previous

section and about Perception Updating Networks from previous chapter, we set to

combine both architectures to create an scalable memory based, shift invariant video

prediction system. The convolutional part of the ConvRNN would give us the scalable

shift invariant properties, the RNN is what takes care of the dynamics and the PUN is

responsible for the memory mechanism.

The first equations to be defined when writing a ConvRNN based PUN are the ones

for calculating the “where” variable δxy and the sprite st . In the original PUN formulation

we calculate first an LSTM hidden state ht and δxy is the outer product of two affine

transformations of ht . For st , in the decoder based memory, we use an MLP. In our

experiments, we limited the ht to 100 or 1000 dimmnsions. On the other hand, when

using a ConvRNN, the hidden state of a Convolutional LSTM (ConvLSTM) is a set of

multidimensional feature maps Ht with the same number of rows and columns as the

input image times the number of channels in the convolutional filters. For clarity and

quicker reference, we restate the ConvLSTM equations bellow:

It = logistic(Whi ⋆ Ht−1 +Wxi ⋆ Xt + bi)

Ft = logistic(Whf ⋆ Ht−1 +Wxf ⋆ Xt + bf )

Ot = logistic(Who ⋆ Ht−1 +Wxo ⋆ Xt + bo)

Gt = tanh(Whg ⋆ Ht−1 +Wxg ⋆ Xt + bg)

Ct = Ft ⊙ Ct−1 + It ⊙ Gt

Ht = Ot ⊙ tanh(Ct),

(5–3)

102

where ⋆ denotes convolution operation.

One may argue that we can simply resize the feature maps Ht to a vector shape

and use the original formulation of PUN. To observe that this is impractical, assume an

input video frame of the moving MNIST dataset. Those frames have 64x64=4096 pixels.

Now, assume a resoanably sized convolutional filter with 128 channels. The output of

such a ConvLSTM would have 4096*128=524,288 pixels. This input is size is too large

and learning a weights for each one of those pixel inputs is impractical.

At this point, we recall the experiments with PUNs with oversized memories. We

observed that backpropagation and enough data was enough to make the PUN learn

to use only the parts of the memory it needed for successfully modeling the input video.

Thus, instead of resizing the maps Ht we keep their original shape and learn a new set

of convolutional filtersWδ andWs and compute the “what” st and “where” δxy as

st =Ws ⋆ Ht ,

δxy = softmax(Wδ ⋆ Ht),

ot = δxy ⋆ st

(5–4)

where the number of output channels inWδ is 1 and the number of output channels

inWs is the same as the input videos if the PUN layer is the output layer. We can also

calculate δxy = sigmoid(Wδ ⋆ Ht) in the hidden layer if we want to repeat the st further. If

the PUN layer is a hidden layer the number of output channels inWs is a free parameter.

Note that the ConvRNN based PUN can also implement a finite memory where only δxy

is calculated with convolutions, and st is selected from an associative memory.

103

We can also implement multiple deltas and multiple sprites and combine the result

with argmax, like in the original PUN formualation:

st,0 =Ws,1 ⋆ Ht , st,1 =Ws,1 ⋆ Ht ,

δxy ,0 = softmax(Wδ,0 ⋆ Ht), δxy ,1 = softmax(Wδ,1 ⋆ Ht),

ot = argmax(δxy ,0 ⋆ st,0, δxy ,1 ⋆ st,1),

(5–5)

where argmax is implemented element wise, pixel by pixel.

Although the intuition provided by the PUN experiments in the previous chapter

helped us design this scalable version, when using the PUN as a hidden layer, we

can no longer visually inspect the learned memories. For this reason, in the following

experiments validating this new architecture we perform extensive hyperparameter tests

with both synthetic and real world videos. Our hope is that with several comparisons

against a ConvRNN based will help us to understand when PUN are a better suited

layer in neural networks design.

5.3 Experiments

5.3.1 Moving MNIST

We start our experiments with the two digits Moving MNIST benchmark. Similarly to

the previous chapter, we have a 64x64 pixels canvas and two 28x28 pixels MNIST digit

moving in the scene. We set up to learning a generative model that predicts future video

frames given a history of previous frames. We performed extensive experiments against

a baseline ConvLSTM. All hyperparameters and results are presented in Table 5-4. PUN

based networks were 0.1s slower.

We observed that the proposed method requires a large visual receptive field

to work well. Such large receptive field can be achieved with several layers and

downsampling or resolution preserving dilated convolutions Kalchbrenner et al. (2016).

Since, the output needs to be on the same resolution as the input, for the downsampling

encoders, the output layers of the network need implement upsampling, just like in

104

Table 5-4. Experiments with hidden PUN. Average negative log-likelihoods (nats) onvideo prediction experiments with the Moving MNIST benchmark.

All methods trained with ADAM optimizer with learning rate 0.0009.All convolutions had 64 filters of 5x5 pixels

PUN as output layerTwo layer experiments

ConvLSTM - PUN ConvLSTM - Conv178 180

Deep residual U-net (see Figure 5-5)two PUN output Conv output

155.54 2802

PUN as Hidden Layer4 layers experiments

Conv - Conv- ConvLSTM - Upsample - PUN Conv - Conv - ConvLSTM - Upsample - Conv145.7 150

6 layers experimentsConv - Conv - ConvLSTM -

PUN- Upsample - Conv - ConvConv - Conv - ConvLSTM -

Conv - Upsample - Conv - Conv139.26 147.89

so called ”hourglass” models. The hourglass nickname is a reference to the shape of

the encoder that funnels the representation with downsampling and the decoder that

expands it back with upsampling.

Perception Updating Network augmented convnets outperform their equivalent

counterparts, with the improvement being larger for deeper models. We conclude that

if the extra computational cost is affordable, the user can always use PUN augmented

networks and expect improvements. The extra computational cost is one convolution

per sample per PUN layer. The computation cost of convolution is O(4), but note that

in modern GPU libraries their implementation is very efficient, nevertheless they can be

costly on naıve CPU implementations. In our implementations, the PUN based networks

were less than a second slower per sample in a batch. The smaller the size of the input

where the PUN layer is applied the smaller this time difference in computation. Finally,

note that the convolutional PUN model has better results than the fully connected model

presented in the previous chapter.

105

input

conv 3x3,128

64x64

32x32

16x16

8x8ConvLSTM

3x3,128

resnet

2 blocks

3x3,128

conv 3x3,128

conv 3x3,128 conv 3x3,128

output

conv 3x3,128

resnet

2 blocks

3x3,128

resnet

2 blocks

3x3,128

resnet

2 blocks

3x3,128

conv 3x3,128

PUN 1 PUN 2

Figure 5-5. Deep residual U-net with Perception Updating Networks output.See results in Table 5-4. Note that resnet (He et al., 2016) blocks are known to be

unstable without batch normalization layers, but here we didn’t need them because thePerception updating Network layers in the output naturally bound the gradients that are

backpropagated. Without batch normalization the full network is faster. See theimplementation of residual blocks we used in Figure 5-6.

5.3.2 Real Videos: Kitti Dataset

In this experiment, we tested the PUN-augmented convnet on real world videos

of the Kitti Dataset (Geiger et al., 2013). The Kitti dataset is a standard benchmark for

visual odometry, SLAM, depth estimation, structure from motion, etc. The videos of

the Kitti Dataset were recorded from the wind shield of a car driven in rural areas and

106

nonlinearity

+

input

resnet block

conv

3x3, 128

conv

3x3, 128

x = x, if x > 0

x= a(exp(x)-1), else

nonlinearityx = x, if x > 0

x= a(exp(x)-1), else

output

Figure 5-6. Definition of a single resnet block used in the experiments.See Figure 5-5 for details.

highways of Karlsruhe, Germany. The input videos had 128x160x3 pixels. The pixel

values were rescaled between 0 and 1.

As learned in the previous experiments, we needed to have receptive fields in

different scales, but since the video in this experiments were larger, we opted for using

resolution preserving networks (Kalchbrenner et al., 2016). In other words, we used

no downsampling operation, instead to achieve multiple resolutions, we used dilated

convolutions. Convolution with dilation does not change the number of trained weights,

but increases the effective receptive field by padding zeros between the non-zero

weights (dilation). We illustrated dilated convolutions in Figure 5-7.

In this experiment we compared two deep resolution preserving convolutional neural

networks. We show the hyperparameter choices and results in Table 5-5.

With this extra set of experiments, we observed that Perception Updating Networks

consistently improve the performance of convolutional neural networks. Note that all

107

Figure 5-7. Dilated convolution with a filter of 3x3 pixels with dilation rate of 1x1.Inputs are shown in light blue, weights in dark blue and outputs in green. In our

implementation we padded the inputs with zeros to make the output have the same sizeas the input. The dilation rate controls the number of holes or zeros between the thetrainable weights. The larger the dilation rate, the larger the effective receptive field of

the convolutional layer. This image was adapted fromhttps://github.com/vdumoulin/conv_arithmetic.

(a)

(b)

Figure 5-8. Qualitative results on the test set of Kitti Dataset.a) predicted frames using Perception Updating Network augmented convnet. b) Target

videos.

the experiments we performed were with generative models, we did not experiment

with PUN on convnets for classification. Nevertheless, it is interesting to note that while

we proposed PUN for memory augmented and interpretable models for 2D videos, but

the operation defined in 5–4 is capable of learning mappings that are more general

108

https://github.com/vdumoulin/conv_arithmetic

Table 5-5. Hyperparameters and quantitative results on the test set of Kitti Dataset.We compared our Perception Updating Network augmented convnet to a conventional

convnet. Both models were trained with similar hyperparameters, with the onlydifference being the PUN layer in the output. PUN based networks were in average

0.88s slower for 10 frames long videos.All methods trained with ADAM optimizer with learning rate 0.0009.

All convolutions had 48 channelsSimilar to both networksConv - 5x5 - dilation rate: 1Conv - 5x5 - dilation rate: 2Conv - 5x5 - dilation rate: 3Conv - 5x5 - dilation rate: 4

ConvLSTM - s3x3Output layer of compared methods

PUN Baseline convnetSprite — Delta — Mask — Conv 3x3 Conv - 3x3

Mean Squrared Error on next frame prediction (test set)

PUN Baseline convnet Previous Frameas predictor

0.0054 0.0085 0.0143

than those learned with conventional convnets and we showed that they can always

be used regardless of the complexity of the task and architectural choices. The cost of

this new model is the extra convolution between the δ and sprite, in other word, extra

computations at O(4). But note that those extra computations can be ran in parallel with

all the other convolutions in the same depth of the architecture. We used Tensorflow with

GPU and observed no significant increase in time per epoch (less than 1s per sample in

a batch), but we had more memory requirements when using PUN.

109

CHAPTER 6CONCLUSIONS

The present thesis was about neural networks augmented with memory. We

implemented these memories to be used to consolidate relevant events in the network

inputs and/or the states of the neural networks evoked by such relevant events. For

this reason we named our developments “A Framework for Pattern Consolidation in

Cognitive Architectures”.

We progressed the research in 3 main steps:

1) In the first step (reported in chapter 3) we designed a general architecture

without specific applications in mind. In that architecture we had a recurrent neural

network that is augmented with a content addressable memory and read and write

operations inside the recurrent neural network loop.

The main goal of this step of the research was to learn to memorize, i.e. force the

neural network to rely on its memory module and use the read and write operations to

store information. To do so, we applied the model to sequence memorization, in other

words, we had a time series auto-encoder working in two stages: first sequence read,

where the entire input sequence is presented and second input reconstruction, where

the network has to output the exact sequence. Obviously, the best way to do that is

by memorizing (via the addressable memory write operations) and later traversing the

memory, returning the input (with the memory read operations).

The main lessons learned here were how to use neural network weights (memories)

that are generated with different dynamics other than backpropagation, instead using

read and write operations via content addressing.

2) In the second step (reported in chapter 4) we had a specific application in mind:

generative models for video. In such case it was easier to interpret what should be

memorized by our memory augmented architectures. It was easier to interpret because

we worked with videos that could be decomposed as moving objects in a scene. In this

110

case, the “relevant event” to be memorized was the main object in the scene. Finally,

for generative modeling purposes, our network just had to learn “where” to place the

memorized events/objects. The full model had also the form of a recurrent neural

network with extra modules.

Here we used the read and write memory mechanisms studied in the previous step

and defined Perception Updating Networks. We also developed a 2D graphics pipeline

statistical framework for validating the proposed architecture.

In this chapter we solved some of the main tasks proposed for this thesis, which

was the developing a cognitive architecture capable of using a content addressable

memory for snapshotting relevant events and objects in a scene, as well as representing

such snapshots efficiently.

3) The third and last step is an attempt to make Perception Updating Networks more

practical for real world videos (i.e. 3D world captured by 2D camera). Those types of

videos cannot be perfectly modeled with our 2D graphics pipeline assumption. For this

reason here we wrote a fully convolutional implementation of the Perception Updating

Network algorithm that was used to augment Convolutional Neural Networks (both

output and hidden layers were tested). The implementation relied on Convolutional

Recurrent Neural Networks, that our lab had previously shown (with DPNC and RWTA)

to be useful for feature extraction and unsupervised learning.

From our framework perspective, this is simply a reimplementation of the main

findings from chapter 4. A reimplementation where the memorized events or “what” and

“where” are the outputs maps of convolutional layers.

At this point we also noticed that another possible interpretation for our model

is a neural network that learns to output other synthetic neural networks. In other

words, if we interpret the “where” maps as inputs and the “what” maps as weights, their

combination is essentially the implementation of a single layer convolutional neural

network. With this interpretation in mind, we plugged the Perception Updating Network

111

layer indiscriminately in conventional convnets are were able to consistently improve

results in generative modeling benchmarks. This interpretation relates Perception

Updating Networks to the work on meta neural networks (Ha et al., 2016) (Zoph & Le,

2016).

An alternative way to progress this research that was not investigated in this thesis

is the development of a proper 3D graphics pipeline. This framework could be more

general but also more computationally expensive. Besides that, the neural networks and

differentiable computer vision advances required to make such statistical 3D graphics

pipeline practical have only been recently published (late 2015 and 2016), which makes

this line of research interesting for (nearby) future work.

Another way to take the present research further is by applying the Perception

Updating Network framework to the recently published Fast Weights based model (Ba

et al., 2016). Fast Weights can be understood as an RNN inside an RNN to generate

weights that will be used to update the outer RNN states. They also consider the

entire sequence of generated states as an addressable memory and use a linear

product kernel for addressing. It would be interesting to implement PUN method as a

convolutional Fast Weights.

Finally now that we better understand how Cognitive Architectures can leverage

memory to snapshot relevant events in space-time, future work should address how to

combine that with attention mechanisms for controlling what is snapshot. That control

mechanism should be conditioned on environment states and agent goals. In other

words, it should be interesting to investigate Perception Updating Networks in the

context of Focus of Attention research (Burt et al., 2016) and Reinforcement Learning

(Emigh et al., 2015).

112

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S.,Davis, A., Dean, J., Devin, M., et al. (2015). Tensorflow: Large-scale machine learningon heterogeneous systems, 2015. Software available from tensorflow. org, 1. 2.1.1

Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014).Convolutional neural networks for speech recognition. Audio, Speech, and LanguageProcessing, IEEE/ACM Transactions on, 22(10), 1533–1545. 2.1.3

Amari, S.-I. (1988). Statistical neurodynamics of various versions of correlationassociative memory. In Neural Networks, 1988., IEEE International Conferenceon, (pp. 633–640). IEEE. 2.7

Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., & Ionescu, C. (2016). Using fast weights toattend to the recent past. In Advances In Neural Information Processing Systems, (pp.4331–4339). 6

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473. 1, 2.7, 3.3, 4.2.1

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends R⃝ inMachine Learning, 2(1), 1–127. 2.1

Bengio, Y. (2012). Deep learning of representations for unsupervised and transferlearning. Unsupervised and Transfer Learning Challenges in Machine Learning, 7 ,19. 2.1.2

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,J., Warde-Farley, D., & Bengio, Y. (2010). Theano: a cpu and gpu math expressioncompiler. In Proceedings of the Python for scientific computing conference (SciPy),vol. 4, (p. 3). Austin, TX. 2.1.1

Bishop, C. M. (2006). Pattern recognition and machine learning. springer. 2.1

Burt, R., Santana, E., Principe, J. C., Thigpen, N., & Keil, A. (2016). Predicting visualattention using gamma kernels. In Acoustics, Speech and Signal Processing(ICASSP), 2016 IEEE International Conference on, (pp. 1606–1610). IEEE. 6

Chalasani, R., & Principe, J. C. (2015). Context dependent encoding using convolutionaldynamic networks. Neural Networks and Learning Systems, IEEE Transactions on,26(9), 1992–2004. 5.1, 5-2, 5-3, 5.1

Chollet, F. (2015). Keras. GitHub repository: https://github. com/fchollet/keras. 2.1.1

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gatedrecurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. 2.4

113

Dai, A., Qi, C. R., & Nießner, M. (2016). Shape completion using 3d-encoder-predictorcnns and shape synthesis. arXiv preprint arXiv:1612.00101. 5

De Vries, B., & Principe, J. C. (1992). The gamma modela new neural model fortemporal processing. Neural Networks, 5(4), 565–576. 1, 2.2, 2.2, 2.2, 3

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: Alarge-scale hierarchical image database. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, (pp. 248–255). IEEE. 2.1.2

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visualrecognition and description. arXiv preprint arXiv:1411.4389. 1, 2

Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning.arXiv preprint arXiv:1603.07285. 2-3

Emigh, M., Kriminger, E., & Principe, J. C. (2015). A model based approach toexploration of continuous-state mdps using divergence-to-go. In Machine Learn-ing for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on, (pp.1–6). IEEE. 6

Eslami, S., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, K., & Hinton, G. E. (2016).Attend, infer, repeat: Fast scene understanding with generative models. arXiv preprintarXiv:1603.08575. 4.1

Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in theprimate cerebral cortex. Cerebral cortex , 1(1), 1–47. (document), 1, 1-1

Finn, C., Goodfellow, I., & Levine, S. (2016a). Unsupervised learning for physicalinteraction through video prediction. arXiv preprint arXiv:1605.07157 . 2.6

Finn, C., Goodfellow, I., & Levine, S. (2016b). Unsupervised learning for physicalinteraction through video prediction. arXiv preprint arXiv:1605.07157 . 4.1

Fitch, W. T., Hauser, M. D., & Chomsky, N. (2005). The evolution of the language faculty:clarifications and implications. Cognition, 97 (2), 179–210. 2

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for amechanism of pattern recognition unaffected by shift in position. Biological cybernet-ics, 36(4), 193–202. 2.1.3

Fuster, J. M. (2003). Cortex and mind: Unifying cognition.. Oxford university press. 1

Gadelha, M., Maji, S., & Wang, R. (2016). 3d shape induction from 2d views of multipleobjects. arXiv preprint arXiv:1612.05872. 5

114

Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kittidataset. The International Journal of Robotics Research, (p. 0278364913491297).5.3.2

Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continualprediction with lstm. Neural computation, 12(10), 2451–2471. 2.4

Giles, C. L., Miller, C. B., Chen, D., Chen, H.-H., Sun, G.-Z., & Lee, Y.-C. (1992).Learning and extracting finite state automata with second-order recurrent neuralnetworks. Neural Computation, 4(3), 393–405. 2.4

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforwardneural networks. In International conference on artificial intelligence and statistics,(pp. 249–256). 2.1.2, 3.5.1

Godard, C., Mac Aodha, O., & Brostow, G. J. (2016). Unsupervised monocular depthestimation with left-right consistency. arXiv preprint arXiv:1609.03677 . 4.1

Goodale, M. A., & Milner, A. D. (1992). Separate visual pathways for perception andaction. Trends in neurosciences, 15(1), 20–25. 5.1

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances inNeural Information Processing Systems, (pp. 2672–2680). 4.1

Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv preprintarXiv:1410.5401. 1, 3, 3.4, 3.4, 3.4, 3.4, 3.5, 3.5.3, 3.5.3, 4.2.1, 4.3

Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). Draw: A recurrent neuralnetwork for image generation. arXiv preprint arXiv:1502.04623. 2.7, 3.3

Ha, D., Dai, A., & Le, Q. V. (2016). Hypernetworks. arXiv preprint arXiv:1609.09106. 6

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R.,Satheesh, S., Sengupta, S., Coates, A., et al. (2014). Deepspeech: Scaling upend-to-end speech recognition. arXiv preprint arXiv:1412.5567 . 1, 2

Hasanbelliu, E., & Principe, J. C. (2008). Content addressable memories in reproducingkernel hilbert spaces. In Machine Learning for Signal Processing, 2008. MLSP 2008.IEEE Workshop on, (pp. 9–13). IEEE. 2.7, 2.7

Haykin, S. (2004). A comprehensive foundation. Neural Networks: A ComprehensiveFoundation, 2(2004). 2.1.3

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for imagerecognition. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, (pp. 770–778). 5-5

115

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data withneural networks. Science, 313(5786), 504–507. 2.1.2

Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow inrecurrent nets: the difficulty of learning long-term dependencies. 2.1.2

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation,9(8), 1735–1780. 2.2, 2.4, 3.5, 3.5.2, 3.5.2

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collectivecomputational abilities. Proceedings of the national academy of sciences, 79(8),2554–2558. 2.7

Hu, Y., Mian, A. S., & Owens, R. (2011). Sparse approximated nearest points for imageset classification. In Computer vision and pattern recognition (CVPR), 2011 IEEEconference on, (pp. 121–128). IEEE. 5-3

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture ofmonkey striate cortex. The Journal of physiology , 195(1), 215–243. 2.1.3

Hurri, J., & Hyvarinen, A. (2003). Simple-cell-like receptive fields maximize temporalcoherence in natural video. Neural Computation, 15(3), 663–691. 4.1

Hyvarinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis, vol. 46.John Wiley & Sons. 2.1.2

Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatialtransformer networks. arXiv preprint arXiv:1506.02025. 4.1, 4.2.1

Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT,RTRL, EKF and the” echo state network” approach. GMD-ForschungszentrumInformationstechnik. 2.4

Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization withgumbel-softmax. arXiv preprint arXiv:1611.01144. 4.2.1, 4.2.2, 4.2.2, 4.2.2

Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration ofrecurrent network architectures. In Proceedings of the 32nd International Conferenceon Machine Learning (ICML-15), (pp. 2342–2350). 2.4, 3.2

Kaiser, L., & Sutskever, I. (2015). Neural gpus learn algorithms. arXiv preprintarXiv:1511.08228. 2.6

Kalchbrenner, N., Oord, A. v. d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., &Kavukcuoglu, K. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527 . 2.6,4.1, 4.4.2, 5.3.1, 5.3.2

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014).Large-scale video classification with convolutional neural networks. In Computer

116

Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, (pp. 1725–1732).IEEE. 1

Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprintarXiv:1408.5882. 2.1.3

Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980. 1, 2.1.2, 3.5.2

Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114. 4.2.1, 4.2.2, 4.2.2, 4.2.2, 4.2.3

Kohonen, T. (2012). Content-addressable memories, vol. 1. Springer Science &Business Media. 2.7

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems,(pp. 1097–1105). 2.1.1, 2.1.3

Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. (2015). Deep convolutionalinverse graphics network. In Advances in Neural Information Processing Systems,(pp. 2539–2547). 4.1

Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariantspatio-temporal features for action recognition with independent subspace analysis.In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, (pp.3361–3368). IEEE. 5-2

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning appliedto document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 2.1.3

Lee, K.-C., Ho, J., Yang, M.-H., & Kriegman, D. (2005). Visual tracking and recognitionusing probabilistic appearance manifolds. Computer Vision and Image Understanding,99(3), 303–331. 5-3

Li, K., & Prıncipe, J. C. (2016). The kernel adaptive autoregressive-moving-averagealgorithm. IEEE transactions on neural networks and learning systems, 27 (2),334–346. 1

Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for objectrecognition. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, (pp. 3367–3375). 2.6

Lotter, W., Kreiman, G., & Cox, D. (2016). Deep predictive coding networks for videoprediction and unsupervised learning. arXiv preprint arXiv:1605.08104. 2.6, 4.1

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Interna-tional journal of computer vision, 60(2), 91–110. 2.1.2

117

Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuousrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712. 4.2.1, 4.2.2,4.2.2

Mahjourian, R., Wicke, M., & Angelova, A. (2016). Geometry-based next frameprediction from monocular video. arXiv preprint arXiv:1609.06377 . 4.1

Makhzani, A., & Frey, B. J. (2015). Winner-take-all autoencoders. In Advances in NeuralInformation Processing Systems, (pp. 2773–2781). 5.1

McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2016). Scenenet rgb-d: 5mphotorealistic images of synthetic indoor trajectories with ground truth. arXiv preprintarXiv:1612.05079. 5

Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherencein video. In Proceedings of the 26th Annual International Conference on MachineLearning, (pp. 737–744). ACM. 5.1, 5-2

Nene, S. A., Nayar, S. K., Murase, H., et al. (1996). Columbia object image library(coil-20). Tech. rep., technical report CUCS-005-96. 5-3, 5.1

Olshausen, B. A., et al. (1996). Emergence of simple-cell receptive field properties bylearning a sparse code for natural images. Nature, 381(6583), 607–609. 4.1

Omlin, C. W., & Giles, C. L. (1996). Constructing deterministic finite-state automata inrecurrent neural networks. Journal of the ACM (JACM), 43(6), 937–972. 2.4

Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchyof markov random fields. In Advances in neural information processing systems, (pp.1121–1128). 4.1

Pagiamtzis, K., & Sheikholeslami, A. (2006). Content-addressable memory (cam)circuits and architectures: A tutorial and survey. Solid-State Circuits, IEEE Journal of ,41(3), 712–727. 2.7

Palm, G., Schwenker, F., Sommer, F. T., & Strey, A. (1997). Neural associativememories. Associative processing and processors, (pp. 307–326). 2.7

Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of training recurrentneural networks. arXiv preprint arXiv:1211.5063. 1, 3.5.2

Patraucean, V., Handa, A., & Cipolla, R. (2015). Spatio-temporal video autoencoder withdifferentiable memory. arXiv preprint arXiv:1511.06309. 2.6

Principe, J. C., & Chalasani, R. (2014). Cognitive architectures for sensory processing.Proceedings of the IEEE , 102(4), 514–525. (document), 1, 1, 2-2, 2.3, 4

Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (1999). Neural and adaptive systems:fundamentals through simulations with CD-ROM. John Wiley & Sons, Inc. 2.4, 2.7

118

Principe, J. C., Xu, D., & Fisher, J. (2000). Information theoretic learning. Unsupervisedadaptive filtering, 1, 265–319. 3.4

Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE , 77 (2), 257–286. 1

Ravanbakhsh, S., Schneider, J., & Poczos, B. (2016). Deep learning with sets and pointclouds. arXiv preprint arXiv:1611.04500. 5

Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn featuresoff-the-shelf: an astounding baseline for recognition. In Computer Vision and Pat-tern Recognition Workshops (CVPRW), 2014 IEEE Conference on, (pp. 512–519).IEEE. 2.1.2

Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016). Learning whatand where to draw. arXiv preprint arXiv:1610.02454. 4.1

Rezende, D. J., Eslami, S., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess,N. (2016). Unsupervised learning of 3d structure from images. arXiv preprintarXiv:1607.00662. 4.1

Robinson, A., & Fallside, F. (1987). The utility driven dynamic error propagation network .University of Cambridge Department of Engineering. 2.4

Rolls, E. T. (2007). An attractor network in the hippocampus: theory andneurophysiology. Learning & Memory , 14(11), 714–731. 1

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage andorganization in the brain. Psychological review , 65(6), 386. 2.1

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations byback-propagating errors. Cognitive modeling, 5, 3. 2.1

Sandberg, I. W., & Xu, L. (1997). Uniform approximation of multidimensional myopicmaps. Circuits and Systems I: Fundamental Theory and Applications, IEEE Transac-tions on, 44(6), 477–500. 2.2, 2.4

Santana, E., Burt, R., & Principe, J. C. (2017). Memory augmented auto-encoders. In inpreparation. (document), 3.3

Santana, E., Cinar, G. T., & Principe, J. C. (2015). Parallel flow in deep predictive codingnetworks. In 2015 International Joint Conference on Neural Networks (IJCNN), (pp.1–5). IEEE. (document)

Santana, E., Emigh, M., & Principe, J. C. (2016a). Information theoretic-learningauto-encoder. 2016 International Joint Conference on Neural Networks (IJCNN).(document)

119

Santana, E., Emigh, M., Zerges, P., & Principe, J. C. (2016b). Exploiting spatio-temporalstructure with recurrent winner-take-all networks. arXiv preprint arXiv:1611.00050.2.6, 5, 5, 5.1

Santana, E., & Hotz, G. (2016). Learning a driving simulator. arXiv preprintarXiv:1608.01230. (document)

Santana, E., & Principe, J. C. (2015). Mixed generative and supervised learning modesin deep predictive coding networks. In 2015 International Joint Conference on NeuralNetworks (IJCNN), (pp. 1–4). IEEE. (document)

Santana, E., & Principe, J. C. (2016). Perception updating networks: On architecturalconstraints for interpretable video generative models. ICLR 2017 (submitted).(document)

Shirley, P., Ashikhmin, M., & Marschner, S. (2015). Fundamentals of computer graphics.CRC Press. 4.1

Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets.Journal of computer and system sciences, 50(1), 132–150. 2.4

Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neuralrepresentation. Annual review of neuroscience, 24(1), 1193–1216. 4.1

Socher, R., Huang, E. H., Pennin, J., Manning, C. D., & Ng, A. Y. (2011). Dynamicpooling and unfolding recursive autoencoders for paraphrase detection. In Advancesin Neural Information Processing Systems, (pp. 801–809). 2.4

Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving forsimplicity: The all convolutional net. arXiv preprint arXiv:1412.6806. 2.1.3

Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning ofvideo representations using lstms. arXiv preprint arXiv:1502.04681. 1, 4.4

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance ofinitialization and momentum in deep learning. In Proceedings of the 30th interna-tional conference on machine learning (ICML-13), (pp. 1139–1147). 2.4, 3.5.2

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, (pp. 3104–3112). 1,2.4

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude. COURSERA: Neural Networks for MachineLearning, 4. 2.1.2

van Handel, R. (2014). Probability in high dimension. 3.4

120

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stackeddenoising autoencoders: Learning useful representations in a deep network with alocal denoising criterion. The Journal of Machine Learning Research, 11, 3371–3408.2.1.2

Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scenedynamics. arXiv preprint arXiv:1609.02612. 4.1, 4.4.1

Wang, R., & Chen, X. (2009). Manifold discriminant analysis. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on, (pp. 429–436). IEEE.5-3

Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrentgas market model. Neural Networks, 1(4), 339–356. 2.4

Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it.Proceedings of the IEEE , 78(10), 1550–1560. 5.1

Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv preprintarXiv:1410.3916. 1, 3, 3.1.1

Williams, R. J., & Zipser, D. (1989). Experimental analysis of the real-time recurrentlearning algorithm. Connection Science, 1(1), 87–111. 2.4

Wu, J., Zhang, C., Xue, T., Freeman, B., & Tenenbaum, J. (2016). Learning aprobabilistic latent space of object shapes via 3d generative-adversarial modeling. InAdvances in Neural Information Processing Systems, (pp. 82–90). 5

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., & WOO, W.-c. (2015).Convolutional lstm network: A machine learning approach for precipitationnowcasting. In Advances in Neural Information Processing Systems, (pp. 802–810).2.6

Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015).Show, attend and tell: Neural image caption generation with visual attention. arXivpreprint arXiv:1502.03044. 2.7

Yan, X., Yang, J., Yumer, E., Guo, Y., & Lee, H. (2016). Perspective transformer nets:Learning single-view 3d object reconstruction without 3d supervision. In Advances InNeural Information Processing Systems, (pp. 1696–1704). 5

Zaremba, W., & Sutskever, I. (2015). Reinforcement learning neural turing machines.arXiv preprint arXiv:1505.00521. 3

Zhu, Q., Yeh, M.-C., Cheng, K.-T., & Avidan, S. (2006). Fast human detection usinga cascade of histograms of oriented gradients. In Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on, vol. 2, (pp. 1491–1498).IEEE. 2.1.2

121

Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578. 6

122

BIOGRAPHICAL SKETCH

Ewaldo Eder Carvalho Santana Jr. (or just Eder Santana because no system in

America has space enough to sign his full name) was born in Brazil in 1988. Eder

graduated from Federal University of Maranhao, where he got both Bachelor of Science

(2011) and Master of Science (2012) degrees in electrical engineering.

In 2013 Eder dropped out from the PhD program at the Federal University of

Maranhao to try to make a living in America. In 2017 Eder received his PhD in electrical

and computer engineering from the University of Florida, where he was advised by Dr.

Jose C. Principe.

Eder Santana is active in the Machine Learning community. He contributed to Keras

the most popular high level neural network design framework and published the video

course Deep Learning with Python. He also worked at Comma.ai helping to develop AI

for self-driving cars and Paracosm.io leveraging deep learning for 3D object recognition.

123

Documents

c 2017 Ewaldo Eder Carvalho Santana Jr.ufdcimages.uflib.ufl.edu/UF/E0/05/08/71/00001/CARVALHO_SANTAN… · Ewaldo Eder Carvalho Santana Jr. May 2017 Chair: Jose C. Pr´ ´ıncipe