Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A FRAMEWORK FOR PATTERN CONSOLIDATION IN COGNITIVE ARCHITECTURES
By
EWALDO EDER CARVALHO SANTANA JR.
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2017
c⃝ 2017 Ewaldo Eder Carvalho Santana Jr.
2
To the most lovely of all, my mother Oxum
3
ACKNOWLEDGMENTS
I thank my advisor the Jose C. Prıncipe for giving me opportunity to fulfill my dream
of becoming the very best, the best there ever was. I also thank the University of Florida
for the graduate scholarship.
I am also very thankful to my friends in CNEL. I specially thank my homies Ryan,
Evan, Matt, Goktug, Mihael and Austin for the friendship, support and eventual baby
sitting.
Most importantly, I thank my family for the love and for allowing me to stay so long
abroad. Inez, Lucas, Livia and Zion you are the most important people in the world, I
love you!
Lastly, I would not be able to pull up any science without the ever constant love
of God. Thanks to my mother Oxum and my guides da Ilha, D. Maria and all the
anonymous supporters.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Deep Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . 192.1.1 Faster Computers and Big Data . . . . . . . . . . . . . . . . . . . . 202.1.2 Elaborate Initialization Techniques and Learning Algorithms . . . . 212.1.3 Task Specific Architectures . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Temporal Processing with Neural Networks . . . . . . . . . . . . . . . . . 252.3 Deep Predictive Coding Networks . . . . . . . . . . . . . . . . . . . . . . 272.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 292.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 332.6 Combining CNNs and RNNs: Convolutional Recurrent Neural Networks. . 362.7 Content Addressable Memories . . . . . . . . . . . . . . . . . . . . . . . . 37
3 A FRAMEWORK FOR DYNAMIC ADDRESSABLE MEMORIES . . . . . . . . 42
3.1 Memory Reading and Writing in Recurrent Neural Networks . . . . . . . . 453.1.1 Type I: DiffRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.1.2 Type II: CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Content Addressable Memory . . . . . . . . . . . . . . . . . . . . . . . . . 483.3 Differentiable Random Access Memory . . . . . . . . . . . . . . . . . . . 513.4 Hybrid Access Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Initialization and Algorithmic Choices . . . . . . . . . . . . . . . . . 573.5.2 Adding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.5.3 Copy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5.4 Sequence Generation . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 ADDRESSABLE MEMORIES AS PART OF A DIFFERENTIABLE GRAPHICSPIPELINE FOR VIDEO PREDICTION . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 On the Need of a Differentiable Computer Graphics Pipeline . . . . . . . . 654.2 A 2D Statistical Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Preliminary Considerations and Relevant Literature Review . . . . 69
5
4.2.2 Variational Autoencoding Bayes . . . . . . . . . . . . . . . . . . . . 724.2.3 Proposed Statistical Framework . . . . . . . . . . . . . . . . . . . . 76
4.3 Perception Updating Networks . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Bouncing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 Moving MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.3 Visualizing the RNN-to-CAM Connections . . . . . . . . . . . . . . 884.4.4 Snapshotting “What” Directly from Pixels . . . . . . . . . . . . . . . 89
4.5 Rules of Thumb for Model Choice . . . . . . . . . . . . . . . . . . . . . . 92
5 SCALING UP PERCEPTION UPDATING NETWORKS . . . . . . . . . . . . . 94
5.1 Convolutional Recurrent Neural Networks for Unsupervised Learning ofVideos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 ConvRNN + PUN: Combining Convolutional RNNs and Perception UpdatingNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3.1 Moving MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3.2 Real Videos: Kitti Dataset . . . . . . . . . . . . . . . . . . . . . . . 106
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6
LIST OF TABLES
Table page
3-1 Adding Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3-2 Copy Problem: Percentage of correctly copied bits. . . . . . . . . . . . . . . . . 61
3-3 Sequence generation cost function (negative log likelihood, NLL) on the testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3-4 Classification accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4-1 Comparison between Snapshot PUN conv PUN on the single digit movingMNIST benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5-1 Hyperparameter choices per experiment . . . . . . . . . . . . . . . . . . . . . . 99
5-2 Recognition rate (in percentage %) for object recognition in Coil-100 dataset . 101
5-3 Recognition rate (in percentage %) for face recognition in Honda/UCSD dataset 101
5-4 Experiments with hidden PUN. Average negative log-likelihoods (nats) on videoprediction experiments with the Moving MNIST benchmark. . . . . . . . . . . . 105
5-5 Hyperparameters and quantitative results on the test set of Kitti Dataset. . . . . 109
7
LIST OF FIGURES
Figure page
1-1 Felleman & Van Essen (1991) diagram of wiring in the visual cortex. . . . . . . 14
2-1 Deep Neural Network for temporal processing. . . . . . . . . . . . . . . . . . . 26
2-2 Principe & Chalasani (2014) schematic diagram of a Deep Predictive CodingNetwork with two layers showing bottom-up and top-down information flow. . . 29
2-3 Example of convolutional neural network (CNN) layer computation. In this examplea single channel input, filter and output are illustrated. . . . . . . . . . . . . . . 34
2-4 Example of convolutional neural network (CNN) layer computation with zeropadding and strides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2-5 Visual representation of a convolutional recurrent neural network. . . . . . . . . 38
2-6 Diagram of Differentiable Random Access Memory (DiffRAM). . . . . . . . . . 41
3-1 Schematic diagram of a memory augmented recurrent neural network. . . . . . 46
3-2 Adding problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3-3 Neural Turing Machine operations in the Copy problem. . . . . . . . . . . . . . 62
3-4 Sample desired and generated sequences using NTM2 and LSTM. . . . . . . . 64
4-1 Steps of the 2D graphics or rendering pipeline that inspired our model. . . . . . 66
4-2 How to get similar results using convolutions with delta-functions and spatialtransformers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4-3 Variational autoencoder graphical model. . . . . . . . . . . . . . . . . . . . . . 73
4-4 Block diagram of a Variational Autoencoder with Gaussian prior and reparametrizationtrick. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-5 A schematic block diagram for a Perception Updating Network. . . . . . . . . . 80
4-6 Results on the Bouncing Shapes dataset. . . . . . . . . . . . . . . . . . . . . . 83
4-7 Results of a Convolutional Perception Updating Network. . . . . . . . . . . . . 85
4-8 Performance curves in the test task of two implementations of the proposedarchitecture (conv PUN and STN PUN) and an equivalent LSTM baseline. . . . 86
4-9 Sample rollouts of a 2 layer LSTM convolutional Perception Updating Network. 87
4-10 A piece of the schematic block diagram for a Perception Updating Networkand t-SNE embedding of the codes sent from the RNN controller to CAM. . . . 90
8
4-11 Snapshot Perception Updating Network. See Figure 4-5 and compare it tothe convolutional Perception Updating Network model. . . . . . . . . . . . . . . 91
5-1 Convolutional Perception Updating Network as a hidden layer of a deep convnet. 96
5-2 Schematic diagram of the Recurrent Winner-Take-All (RWTA) network. . . . . . 98
5-3 Sample videos from Coil and Honda/UCSD datasets. . . . . . . . . . . . . . . 99
5-4 128 decoder weights of 7x7 pixels learned on Coil-100 videos. . . . . . . . . . 100
5-5 Deep residual U-net with Perception Updating Networks output. . . . . . . . . . 106
5-6 Definition of a single resnet block used in the experiments. . . . . . . . . . . . 107
5-7 Dilated convolution with a filter of 3x3 pixels with dilation rate of 1x1. . . . . . . 108
5-8 Qualitative results on the test set of Kitti Dataset. . . . . . . . . . . . . . . . . . 108
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
A FRAMEWORK FOR PATTERN CONSOLIDATION IN COGNITIVE ARCHITECTURES
By
Ewaldo Eder Carvalho Santana Jr.
May 2017
Chair: Jose C. PrıncipeMajor: Electrical and Computer Engineering
One of the most essential functions in human sensory processing is the ability to
group a sequence of stimuli that impresses the senses into a single coherent experience
of “what” happened in the world. For example, humans interpret another person’s
actions as a whole, not as a sequence of independent poses in different scenes.
Although the interpretation of individual scenes is fundamental, the complete experience
understanding relies on appropriate temporal context.
In order to capture sensory inputs in video, recent approaches such as Deep
Predictive Coding Networks and Hierarchical Temporal Memories rely on single temporal
step prediction and optimization. Unfortunately, this approach is not generic enough for
most real world video, audio and text analysis.
The proposed work enhances the Cognitive Architecture for Sensory processing
by organizing their internal representations in explicit “what” and “where” components,
which has not been addressed in the literature. Currently very long vectors of internal
states of the top layers are fed to external classifiers to decide what has been presented
to the system input, however this method may not scale up when millions of images (or
videos) have been processed. Our hypothesis is that the combination of a recurrent
neural network, which models short term working memories, and a long-term content
addressable memory , inspired by the functional connections between the neocortex
and the hippocampus, will provide a solution that scales better.
10
We investigate recurrent neural networks that can be trained with backpropagation
through time with an external addressable memory. With recurrent architectures and
backpropagation through time we do not have to rely on the Markov assumption for
learning models of sequential data. Also, with addressable memory extensions we
can decouple the memory capacity of the proposed architecture from the number of
adaptive parameters, thus scaling better for practical engineering applications. And most
importantly, the addressable memory can be used to consolidate the decaying dynamic
states of recurrent neural networks.
To illustrate this framework in practice, we investigate the task of video prediction.
In video prediction we have to calculate future frames using previous frames as context.
We do so by explicitly modeling the moving objects in the scene and their dynamics as
separate components. This way, we can snapshot representations of “what” is in the
scene and model its dynamics, or “where” the object is in the scene, as separate latent
factors. For this part of our work, beyond the inspiration of addressable memories from
neuroscience, we took inspiration from computer graphics pipelines while designing our
network, which is also an example of how to use efficient computer science discoveries
into neural network design.
From a practitioner perspective, our contributions to the neural network literature
are about novel architectures, architectural constraints and design guidelines. We
recommend the interested reader to find inspiration in this work beyond the direct
applications of image and video processing, but as example of how to convert
consolidated computer science and physics knowledge into neural networks that
can be optimized with backpropagation.
To summarize, the main contributions of this thesis to the literature are organized as
follows:
• In Chapter 3 we discuss a generalized frameworks for memory augmentedrecurrent neural networks (Santana et al., 2017). There we show networks thatlearn to read and write to content addressable memories.
11
• We focus on memory reading mechanisms and video prediction in Chapter 4. Thisis the chapter we propose a novel variational statistical framework for videos withdecoupled “what” and “where” object representations. This is also the chapter weintroduce the algorithm of Perception Updating Networks (Santana & Principe,2016).
During the time we were working on the the present thesis we published other
papers on unsupervised learning for self-driving cars data (Santana & Hotz, 2016),
image filtering (Santana & Principe, 2015), video-audio sensor fusion (Santana et al.,
2015), information theoretic learning autoencoders (Santana et al., 2016a), etc. That
work contributed for the knowledge presented here but will not be discussed in details
here as the papers itemized above.
12
CHAPTER 1INTRODUCTION
Modeling sequential data is a challenge in many different fields. Examples are text
in Language Modeling, speech waveforms in voice recognition and image sequences in
video analysis. Humans can interpret the context of these different signals despite noise,
scaling, rotation and some degrees of deformation. Attempts to model this cognitive
ability from a statistical perspective usually rely on the Markov assumption, which is valid
only for very simple cases (Rabiner, 1989). When that assumption is valid, we can factor
the joint distribution of the sensory input as a product of transition probabilities.
Recently (Principe & Chalasani, 2014) proposed a Cognitive Architecture for
Sensory Processing inspired by human sensory processing, called Deep Predictive
Coding Networks (DPCN). Figure 1-1a represents the anatomical diagram of the
visual cortex (Felleman & Van Essen, 1991) showing the flow of information from the
retina (bottom) until it reaches the hippocampus, the brain structure that consolidates
memories. We can see a distributed, hierarchical set of highly interconnected subsystems
that extract distinct information from the visual scene that constitutes the percept. In the
top of the architecture we have the hippocampus that consolidates these features as
long term memories that can be associatively recalled. These recollected features
act as causes or priors for the representations in the lower layers. Thus, the overall
architecture tries to predict the outer world based on its internal representations.
Whenever the model falsifies its hypothesis with a poor prediction it updates its memory
and consequently its perception of the world (Fuster, 2003). This provides the amazing
cognitive capabilities of humans and other higher order animals, but there is still the
open problem of what comes first, perception or memory?
The DPCN model attempts to preserve some of this distributed, hierarchical,
bidirectional, online and self-organizing flow of information in Figure 1-1b. DPCNs
are built from identical blocks containing a generalized state space model, which are
13
Auto-
Associative
memory
Auto-
Associative
memory
Short-term
Memory
Backward
connection
Forward
connection
Cortex
Lateral
connection
Hippocampus
Figure 1-1. Felleman & Van Essen (1991) diagram of wiring in the visual cortex.In the Cognitive Architectures for Sensory Processing framework the hippocampus
generates causes that guide the features extraction in the lower layers.
organized hierarchically. DPCN uses an empirical Bayes framework, with top-down
and bottom-up information flow, which projects the incoming video frames on an
overcomplete and sparse basis, learned from the data (similar to V1). Through inference
these features represent the input video frames at different scales and the causes and
states of the top layer successfully discriminate objects in a self organizing way, even
under shift and rotation transformations. In Principe & Chalasani (2014)’s framework,
perceptions are represented by the single step predictions of DPCNs, which are
multidimensional transient vectors in time that only exist while the image is present
at the input retina. Therefore in our previous work, the user has to capture the top states
and causes synchronously with the presentation of the images and all the published
results with DPCN utilize a classifier, following the trend in deep learning. Performance
was as good or better than other unsupervised convolutional models presented in the
14
literature at the time, but the user must be in the loop and the classifier framework is
weak because one must know how many classes exist, and train the classifiers. This
dissertation will continue the research in the cognitive architecture and implement
the time to space mapping involved in consolidating the DPCN transient causes and
states in permanent memory that can be organized by content, in a self organizing way,
following the spirit of the cognitive architecture.
What is missing in DPCNs is the ability to take snapshots to consolidate the
perceptions of the sensory processing networks and an associative recall mechanism
for bringing those perceptions back to the context dynamics. To give a practical example
of the limitations of a network without memory consolidation, we can think about a video
where a person appears and disappears from the camera sight. When the person is
present, Principe & Chalasani (2014) showed that the DPCN could generate online
representations of its facial features that could be used for classification. Nevertheless,
when the person leaves the scene, the DPCN continues to represent the background
without explicit knowledge that the person has left, it will simply continue to represent
the new sensory input. If the subject ever comes back, the internal representations of
its face will be already blurred with representations of the background. With a memory
consolidation mechanism, when the facial features were first represented, they could be
stored and compared with the representations of future sensory input.
The issue we still have to face is what is the best way to model the combination
of the working memory with the Content Addressable Memory (CAM). In neural
network theory there are two basic types of memory mechanisms (De Vries & Principe,
1992): the finite window memory that can be implemented by a tap delay line, and the
exponential decaying memories that can be implemented by infinite impulse filters, with
the gamma memory as an hybrid. The tap delay line is very constraining because one
has to know a priori the memory depth. For instance, Karpathy et al. (2014) showed that
15
a deep convolutional network, with filters convolving both space and time, did not have
better results than simply classifying each frame and voting to classify the entire video.
Alternatively, Recurrent Neural Networks (RNN) are a more appropriate model
to represent the past information, since they implement nonlinear Infinite Impulse
Response (IIR) feature extractors that were successful in problems such as to transcribe
speech signals to text (Hannun et al., 2014), text translation (Sutskever et al., 2014),
etc. Recently, Li & Prıncipe (2016) showed that RNNs can also be implemented in
Reproducing Kernel Hilbert Spaces (RKHS). RNNs trained with features extracted with
feedforward neural networks (Donahue et al., 2014) (Srivastava et al., 2015) had better
results than the purely convolutional neural network proposed in (Karpathy et al., 2014).
Unfortunately, even RNNs memories are reliable only up to a certain point. Later,
we will discuss how IIR models still have uniformly decaying memories. Another problem
with RNN memories is that they do not scale well and backpropagation through time
(their main learning algorithm) is only reliable up to a certain input length (usually 100
time points) (Pascanu et al., 2012). Backpropagation Through Time (BPTT) behavior
can be bounded by limiting the length of the input training batch and using modern
stochastic gradient optimization techniques such as gradient clipping (Pascanu et al.,
2012) and Adam (Kingma & Ba, 2014). But the number of trainable parameters grows
quadratically with the size of the dynamic state of the RNN. For instance, the sequence
translation model cited above (Sutskever et al., 2014) used a 3 layer RNN, each one
with a memory of size 1000, which amounts to 3 million adaptive parameter just for the
hidden-to-hidden transition matrices.
Here we look to brain sciences for inspiration to solve the memory problem
of RNNs. Specially, we comment on the interactions between the neocortex and
the Hippocampus. Hippocampus region III (CA3) can arguably be regarded as a
autoassociation or attractor network involved in spatial functions and memory (Rolls,
2007). Region I (CA1), on its turn, records information from CA3 and back-projects it to
16
the neocortex. Thus, the Hippocampus and the neocortex implement complementary
memory types, with the later being used for rapidly changing unstructured memorization
and the former for building semantic representations of what has been stored and how
to retrieve that information (Rolls, 2007). This will be our inspiration to design the top
layer of our cognitive sensory processing system, where we can associate the outputs of
DPCNs with the sensory cortex, extracting features from the input data. These features
are fed to the hippocampus, where the CA3 plays the role of the RNN that is capable
of representing its variable length input in its state within the short term past. However
this is not sufficient as we discussed above, because we would like to consolidate this
information permanently and organize it with the past stored representations that the
system acquired previously.
The RNN research community has recently started to pay more attention to this
missing architectural feature and proposed developments along the following lines: an
RNN should not rely solely on its dynamic hidden states when computing its next state
and output, but it should be able to store and retrieve previous states using an attention
mechanism that considers the context of the current input and next desired output.
On that line of thought, Bahdanau et al. (2014) proposed RNNSearch, a recurrent
encoder-decoder where all the hidden dynamic states of the encoder are saved and
partially retrieved by the decoder. Graves et al. (2014) proposed Neural Turing Machines
that explicitly define a content and location addressable memory inspired by memory
tapes in Turing Machines. Weston et al. (2014) Memory Networks on the other hand,
store variable length inputs and learn to rank and retrieve the relevant ones at query
time. RNNSearch, although it has been successfully applied to automatic sentence
translation, has the same downside as feedforward neural networks of representing
variable length inputs by other variable length outputs since its memory is just a non
self-organizing stack, which for instance makes its application to unsupervised clustering
harder. Memory networks provide a general framework for using memory appends,
17
but its application storing input sequences the way they are is not easily scalable or
biologically plausible. Memory Networks memories can also be interpreted as storing
temporal windows similarly to FIRs. In this proposal we develop our contributions
reasoning more closely to NTMs, instead.
Here, we propose a general framework for reinterpreting RNNs in a way that allows
us to propose a family of architectures where addressable memories augment dynamic
neural networks. We show that NTM can be seen as a specific case of this framework,
we also propose alternative architectures.
Afterwards, we use our findings on memory augmented RNNs to propose a
neural 2D graphics pipeline. This pipeline will be used for modeling videos with explicit
snapshotting of “what” is in the scene independent of “where” it is. The memory in
this new system can be interpreted as a sprites (or object) database. The remaining
components of the architecture learn where to place the sprite in the scene and model
its movement for video prediction. The final result is an architecture that memorizes
perceptions and updates its representation, which motivated us to call the system
Perception Updating Networks.
In the next chapter we review the relevant literature, especially memory structures
in neural networks, Deep Learning and RNN, on top of which we propose to build our
contributions.
18
CHAPTER 2BACKGROUND
Here we are interested in analyzing a vector valued sequence xt where the (usually
time) index t may have different length for different realizations. For example, in a
video xt are the pixels of a frame, it could also be a vectorial representation of words
for text analysis or pieces of speech signal. We will review the relevant literature of
Signal Processing and Deep Learning for this thesis, starting with deep feedforward
neural networks that deals with different sequence xt independent of the temporal
context. Deep feedforward neural networks (DFNN) are relevant for feature extraction,
for instance, but we are mostly interested in finding structure in variable-length data
and recursive processing, which is fundamental to enable language understanding
(Fitch et al., 2005), action recognition (Donahue et al., 2014) and voice transcription
(Hannun et al., 2014) between other applications. For that objective, we will also review
the literature on temporal signal processing focused on Timed Delayed and Recurrent
Neural Networks later in this Section.
2.1 Deep Feedforward Neural Networks
Typically in Machine Learning a multiple step pipeline of signal processing is
required from the input data to the final task output (Bishop, 2006). For example,
removing the mean and renormalization, outlier removal, dimension reduction, and
finally the classification or regression task. The main philosophy motivating Deep
Learning is to learn all the preprocessing steps and the ultimate task directly from
data (Bengio, 2009). Since this entire process mostly uses Artificial Neural Networks
(ANN), Deep Learning can be also considered the third generation of ANNs. This is the
generation with networks several hidden layers deep, also called Deep Neural Networks
(DNN). The first generation of ANNs introduced several adaptive artificial neurons
(here called processing elements-PEs) and the Perceptron learning rule (Rosenblatt,
1958). The second generation was sparkled by the backpropagation algorithm applied
19
to Multilayer Perceptrons (MLP) (Rumelhart et al., 1988), which is nothing but an
application of the chain rule from basic Calculus, to gradient computation in nonlinear
multilayer architectures. This generation provided the first class of Universal Learning
Machines (ULMs) that can be trained directly from data. The generation branded as
Deep Learning or DNNs, can be largely credited to three main factors being used
together:
• Faster computers and huge amounts of training data (Big Data)
• Elaborate initialization techniques and learning algorithms
• Task specific architectures
2.1.1 Faster Computers and Big Data
With faster CPUs and General Purpose Graphical Processing Unities (GPGPU)
it became possible to train larger neural networks in practical time. It is well know that
ANNs with at least one hidden layer are ULMs. Thus, ANN can theoretically learn
any function as long as it has enough PEs on its hidden layers. However, the system
still uses the same basic representation space to construct the input-output map.
Discriminative applications such as object recognition in pictures, word spotting in audio
streams, etc. benefit from a more versatile representation structure at multiple spatial
or temporal scales that mimic better the structure of the input space. This asks for
several hidden layers with many PE and hundred of thousands of parameters to reliably
approximate real world problems and generalize well. Such architectures also require
a huge number of training data to construct the internal representations. Only very
recently there is sufficient data to accomplish proper training of such large architectures.
In such cases, ANNs with hundreds of thousands or even millions (Krizhevsky et al.,
2012) of parameters can now be trained with the current version of cloud computing or
special GPU clusters, which was unthinkable a few years ago.
We should also consider novel programming frameworks that leverage this extra
computational power. Specially the open source libraries that allow fast prototyping of
20
different architectures with code that compiles seamlessly to CPU or GPU. The number
of scientific citations to these libraries makes it clear how important it is to provide a
simple abstraction for fast scientific experimentation for deep neural network research.
In this work, we used Theano (Bergstra et al., 2010) and Tensorflow (Abadi et al., 2015),
which are Python frameworks for Machine Learning that between many other features
provides GPU abstraction and automatic differentiation, which guarantees that we are
using the correct gradients even when testing new complex architectures, such as the
ones proposed here. We build our models using Keras (Chollet, 2015), an open source
deep learning library on top Tensorflow and Theano, to which the present author is also
a voluntary contributor. All the code used in this thesis will be made open source for fast
reproducibility.
2.1.2 Elaborate Initialization Techniques and Learning Algorithms
When the parameters of an ANN are initialized with random values, the function
approximated by the network may be too distant from the required mapping. In such
cases the ANN may need too much time to be trained, specially when using sigmoidal
activation functions where the vanishing gradient problem (Hochreiter et al., 2001) slows
converge down. The existence of very low gradient regions, because of the non-convex
nature of the problem, makes this a daunting task. One of the first solutions to overcome
this problem was unsupervised pre-training with Auto-Encoder (AE) (Vincent et al.,
2010). An AE consists of two parts: an encoding function that transforms the data into
a code space, and a decoding function that reconstructs the original input from its code.
For the simple case where the encoder is a linear function, the decoder is simply its
transpose, and the hidden codes are the principal components of the input. Thus AE
pre-training places the weights of the ANN in the directions of significant statistical
properties of the data. Note that initialization is not a problem for Kernel Machines and
RBFs because their transformation functions are always centered in the data, which is
an advantage. But for randomly initialized DNNs, starting far from the data domain is a
21
problem. The process of initializing a DNN using unsupervised pre-training, in its most
practical form, is a layer-wise procedure. First, starting from the original data input, an
encoder/decoder function pair is trained to represent the data as a first layer code set.
Afterwards, another encoder/decoder pair is trained, but this time using the codes from
the previous pair as input. After enough encoder/decoder pairs are trained, the encoders
are stacked up and used as an initialized DNN, which on its turn, can be fine-tuned
using backpropagation of error for task specific problems. Hinton & Salakhutdinov
(2006) used a Restricted Boltzman Machine (RBM) to train the encoder/decoder pairs.
RBMs are undirected graphical models and can be trained using Gibbs sampling and an
approximate Markov chain Monte Carlo (MCMC) method called contrastive divergence.
Vincent et al. (2010) also showed that it is possible to train the encoder/decoder pairs
using a technique similar to non-linear component analysis and obtained initializations of
similar quality.
However, we should note that with larger datasets such sensible initialization are
not mandatory. Glorot & Bengio (2010) showed that it suffices to initialize the random
weights to cover the linear region of the sigmoidal nonlinearities (or to use piecewise
linear activations) and train the network with appropriate extensions of the Stochastic
Gradient Descent (SGD) rule and mini-batches. This is a common misconception
between practitioners and neophytes to Deep Learning. They sometimes refer to older
approaches, such as the debuting paper by Hinton & Salakhutdinov (2006) that focus
on pre-training and batch mode fine-tuning. While recent approaches have almost
completely abandoned pre-training in favor of appropriate random initializations and
mini-batches. Also, instead of the second order methods, such as Conjugate Gradient
(Hinton & Salakhutdinov, 2006), recent literature has focused on SGD approaches that
only approximate the diagonal of the Hessian of the cost function. Examples of such
algorithms are RMSprop (Tieleman & Hinton, 2012) and ADAM (Kingma & Ba, 2014),
22
that rely on moving average estimates of second order statistics of the gradient and
momentum to optimize through slow gradient regions.
Here we reproduce the most common nonlinearities used for deep learning, to
make clear the statements above:
tanh : tanh(x) =ex − e−x
ex − e−x,
logistic : σ(x) =1
1 + e−x,
ReLU : relu(x) = max(0, x),
softmax : softmaxi(x) =exp(xi)∑j exp(xj)
,
(2–1)
where ReLU stands for rectified linear unity, which is the nonlinearity that provides best
results in practice and is also less sensible to precise initialization, since its linear region
is obviously larger than that of the sigmoids. Softmax returns a probability distribution
and is usually used as the last nonlinearity to represent classes probabilities. They can
also be used as weights to average a vector, which is an application that we will discuss
later in this chapter.
For more complicated problems, such as video or image analysis with little
training data, transfer learning (Bengio, 2012) is the state-of-the-art approach. For
instance, Razavian et. al. (Razavian et al., 2014) showed that the penultimate layer of
a Deep Neural Network trained to classify the IMAGENET dataset (Deng et al., 2009)
provides features for other image analysis tasks that surpass all the previously proposed
feature extraction techniques such as SIFT (Lowe, 2004), HOG (Zhu et al., 2006) and
unsupervised learning such as Sparse Coding and Independent Component Analysis
(ICA) (Hyvarinen et al., 2004).
In a nutshell, initialization is fundamental and the most up to date suggestion
to initialize a Deep Neural Network is to use a network that has been trained to
classify a large dataset in a similar domain or to use random initialization and train
23
the network with mini-batches on a large dataset. Although theoretically sound, for
practical applications unsupervised learning and auto-encoders should be only the third
option of initialization.
2.1.3 Task Specific Architectures
Even though DNNs are also ULMs, correlations in the input data make the
training process complicated. For example, for dealing with temporally correlated
data Recurrent Neural Networks (Haykin, 2004) were proposed in the second ANN
generation mentioned above. On the other hand this third-generation exploited very
effectively convolution based architectures, also called Convolutional Neural Networks
(CNN) (LeCun et al., 1998). CNNs were developed to exploit local dependencies in
data such as images, where neighboring pixels present strong correlations (LeCun
et al., 1998). This is implemented using PEs that receive only part of the input, much
like the Neocognitron (Fukushima, 1980) architecture that was inspired by the space
selective receptive fields of the simple-cells of the visual cortex (Hubel & Wiesel, 1968).
The main difference between the Neocognitron architecture and CNNs, is that the
former use shared parameters (or weights) for all the local receptive fields (LeCun et al.,
1998). Using shared parameters across local receptive fields implements a convolution
operation, thus the name CNN. Invariance to local shifts in space is achieved in CNNs
through local pooling or strided convolutions (Springenberg et al., 2014). In other words,
a pooling layer downsamples the output map of a convolutional layer by forwarding
only the maximum activation of local regions. Which is an approximate behavior to
complex cells in the cortex (Hubel & Wiesel, 1968). This new type of architecture was
the core technology that powered most of the large scale image processing networks,
such as the DNN that won the IMAGENET 2012 (Krizhevsky et al., 2012), and also
the applications for speech recognition (Abdel-Hamid et al., 2014) in which case the
convolutions were applied to the spectrogram of the audio data. More recently, CNNs
24
have been also applied to text processing (Kim, 2014), where the convolutions are
applied over words vectorial representations of words or characters.
2.2 Temporal Processing with Neural Networks
When the input has a temporal dimension, and the task is temporal pattern
recognition, DFNNs are no longer an appropriate model. De Vries & Principe (1992)
proposed a unifying framework for neural networks that can solve such problem. The
basic architectures can be interpreted as either defining a fixed length window in the
input series or implementing an infinite convolution through time. For the case where
we define a fixed length window in time we have finite impulse response (FIR) filters,
moving average (MA) and time-delayed neural networks (TDNN). We can write the MA
model as
ht = xt +
N∑i=1
wixt−i + b, (2–2)
where N is the size of the temporal window, ht is the generated signal and wi and b are
free parameters of the model.
Sandberg & Xu (1997) showed that the functions generated by any tap-delay
line followed by an MLP or DNN (essentially what a TDNN is) are myopic maps, in
other words, they are universal approximators on a functional space of decaying time
functions and their memory depth has to be preselected to the application. In practice
this means that TDNNs have limited memory capacity, which could be increased by a
longer tap-delay line but with the cost of a huge increase in the number of parameters.
We illustrate TDNNs and how their tap-delay line can be calculated with the Gamma
memory in 2-1.
On the other hand, with an infinite impulse response (IIR) we have models that
implement an infinite convolution over the input sequence. These models can also be
rewritten in a recursive form such as auto-regressive (AR) models
ht = xt +
N∑i
wiht−1 + b (2–3)
25
time
Figure 2-1. Deep Neural Network for temporal processing.(a) time-delayed neural network (b) gamma memory, for µ = 0 we have the regular
tap-delay line.
where N is now the depth of recursion. De Vries & Principe (1992) showed that an IIR
can also be implemented with the Gamma memory defined as a cascade arrangement
of generalized delay operators given by
xt = µxt + (1− µ)xt−1 (2–4)
26
for the case where µ = 0, we have a regular tap-delay and the MLP using the Gamma
memory as the first layer would reduce to a nonlinear FIR. For the other values of µ the
memory has an IIR impulse response itself with time constant controlled by µ.
It is important to understand the value of the recurrent parameter m for memory
based applications (De Vries & Principe, 1992). The recursive parameter in (2–4) acts
a control of the time axis scale, i.e. as a compromise between memory depth (D) and
its resolution (R). For a L order filter, built with generalized delay operators, R = M + L.
It turns out that for the generalized delay operator, M = 1/µe, so for large values of
m(0 < m < 1) the memory has high temporal resolution but low depth, while for
values close to 0 the depth is long but the resolution is poor. This is a fundamental
feature of linear recurrent systems when used as memory systems. The only way to go
beyond this limitation is with nonlinear memory functions which use gating such as the
Long-Short Term Memory (Hochreiter & Schmidhuber, 1997).
The combination of both approaches is the auto-regressive moving average
(ARMA). For shift-invariant linear models, the transfer function of ARMA models can
be defined as a linear combination of decaying complex exponentials reiterates the
assertion that all these transformations are based on linear combinations of complex
exponential functions with uniformly decaying memories.
Possible ways to implement nonlinear IIRs are using recursive equations such as
those proposed by DPCNs and RNNS. DPCNs are also task specific architectures that
combine the power of convolutional neural networks and the Markov assumption to
exploit temporally varying signals. RNNs can be interpreted as nonlinear IIRs and exploit
temporal structures in a longer time scale beyond the Markov assumption. We devote
the next two sections to enter in details about DPCNs and RNNs.
2.3 Deep Predictive Coding Networks
DFNNs do not have a temporal context and the feature extraction is implemented
in a single swipe through the architecture. This means that all the feedback provided by
27
the upper layers to the lower layers is that of backpropagtion. During the feedforward
pass, there is no feedback at all. DPCNs on the other hand propose to combine a
bottom-up flow, similar to that of DFNNs, but driven by priors given by a top-down flow.
This bottom-up plus top-down flow is leveraged using the context of the input sequence
it across different t ’s.
DPCNs are hierarchical systems of equally defined layers. A layer is defined by a
set of adaptive weights A,B,C , which are respectively the transition, causes rescaling
and observation matrix, and a pair of dynamic variables xt , ut which are called states
and causes and evolve in time t. An important difference between DPCNs and DFNNs
is that here the layer outputs are the dynamic variables xt , ut and they are not calculated
from a single projection followed by a nonlinearity, they are instead optimized with
Expectation-Maximization (EM) to fit a generative model of the input data. A block
diagram of DPCNs is shown in 2-2
The parameters and outputs of the l-th layer of a DPCNs are alternately optimized
to minimize the following energy function:
E(xt , ut , θ) =N∑n=1
(1
2||u(l−1,n)t − C (l)x (l)t ||22 + λ||x (l)t − A(l)x
(l)t−1||1 +
K∑k=1
|γ(l)t,k · x(l)t,k
)
+ β||u(l)t ||1 +1
2||u(l)t − u
(l+1)t ||22 − logP(θ),
(2–5)
where
γ(l)t,k = γ0
[1 + exp(−[B(l)u(l)t ]k)
2
]and θ = A,B,C , (2–6)
where l = 0 represents the input data. Matrix C reconstructs the layer input u(t l − 1) from
sparse codes x (l)t . These codes evolve sparsely from previous representations A(l)x (l)t−1,
where A are transition matrices. The sparseness of the codes x are controlled by the
components u that are also sparse but evolve from priors coming from upper layers in a
28
Figure 2-2. Principe & Chalasani (2014) schematic diagram of a Deep Predictive CodingNetwork with two layers showing bottom-up and top-down information flow.
top-down flow u(l+1)t = C (l+1)x(l+1)t−1 . The prior probability logP(θ) is used as an l2-norm
regularization.
Here the EM algorithm is implemented with Stochastic Gradient Descent of the cost
function above. In Figure 2–5 it is clear another difference between DPCNs and DFNNs,
where to calculate the higher layer activations ut the later would only use the current
context xt . On the other hand, DPCNs not only implement a temporal context through
Axt−1, but also a top-down flow via u(l+1). DPNCs provided better results to simple video
classifications than DFNNs Principe & Chalasani (2014).
Nevertheless, the temporal context provided by the very previous time step t − 1 is
only sufficient when the data has homogeneous dynamics through time since DPCNs
encodes variable length time series as variable length causes with uniformly decaying
memories. For more complex structures, we have to extrapolate these temporal
limitations. RNNs on the other hand can learn longer term dependencies. We will
talk about RNNs in the next section.
2.4 Recurrent Neural Networks
RNNs are networks with PEs forming a directed cycle. They were initially proposed
as a cognitive model by several authors such as Jeff Elman and Michel I. Jordan, see
(Principe et al., 1999). From a statistical perspective, several authors proposed the
backpropagtion through time (BPTT) (Robinson & Fallside, 1987), (Werbos, 1988) and
the Real-Time Recurrent Learning (RTRL) algorithms (Williams & Zipser, 1989) to train
29
RNNs for sequential data prediction and temporal pattern recognition. In its simplest
form, given a vector valued input sequence xt , the dynamics of a RNN evolves as
ht = f (Whhht−1 +Wxhxt + b), (2–7)
where ht is the dynamic state of the RNN,Whh is the hidden-to-hidden connection
matrix,Wxh is the input-to-hidden connection matrix, b is a bias vector and f is
a differentiable nonlinearity such as the hyperbolic tangent. RNNs are Universal
Computers (Siegelmann & Sontag, 1995), from myopic maps perspective (Sandberg &
Xu, 1997), this means that RNNs have infit memory extent adapted directly from data.
This concept is more general than ULMs, nonetheless it is equally hard to fully explore in
practice with finite connections. As universal computers, RNNs can implement arbitrary
sequence to sequence mappings. In practice, RNNs have difficulty in learning long-term
dependencies due to the vanishing gradient problem, which is a consequence of the
uniformly decaying nature of myopic maps, since the derivative of a myopic map is also
myopic. Notice that in the following derivative
∂ht∂Whh
= f ′ht−1∂ht−1∂Whh
(2–8)
the gradient f ′, are small numbers with absolute values between 0 and 1, thus vanishing
with the total gradient at each time step t. To combat that, second order RNNs were
proposed. The first and most popular solution is named Long Short Term Memory
(LSTM) network (Hochreiter & Schmidhuber, 1997), where the gradients are kept using
gating connections, similar to digital logic gates, but differentiable and trainable with
BPTT. LSTMs equations in its most recent formulation (Gers et al., 2000) can be written
as
30
it = logistic(Whiht−1 +Wxixt + bi)
ft = logistic(Whf ht−1 +Wxf xt + bf )
ot = logistic(Whoht−1 +Wxoxt + bo)
gt = tanh(Whght−1 +Wxgxt + bg)
ct = ft ⊙ ct−1 + it ⊙ gt
ht = ot ⊙ tanh(ct),
(2–9)
where, i , f , o and g are respectively the input, forget, output and add gates. They control
how much information is accepted, forgot and exposed from the cell unity c . The ⊙ is
the element-wise multiplication. Thus, LSTMs have two dynamic states, the cell ct and
the output state ht . LSTMs can learn long term dependencies by storing information in
ct and short, rapidly changing dependencies in its output ht , hence its name. Also, due
to the multiplicative connections between states and inputs, LSTMs can be mapped to
a finite state machine both in training and representation (Giles et al., 1992), (Omlin &
Giles, 1996).
Other approaches to avoid the vanishing gradient problem are Echo State Network
(ESN) (Jaeger, 2002) and Gated Recurrent Units (GRU) (Chung et al., 2014). ESNs
avoid the vanishing gradient by not adapting the hidden-to-hidden connections of the
network, focusing only on the hidden-to-output connections and appropriate weight
initialization. Sutskever et al. (2013) showed that using ESN-like initialization and
gradient descent adaptation provides better results than using fixed weights. The
ESN literature was the first to propose to initialize the hidden-to-hidden connections
to orthogonal matrices with spectral radius bounded to be close to one. Orthogonal
initialization will be the default choice for theWhh throwout this proposal, unless explicitly
stated otherwise.
31
GRUs provides a similar approach to LSTMs to keep the gradients from vanishing,
i.e. by using gating connections much like leaky integrators in the Gamma memory
showed above. Just like in the Gamma memory leaky integrators can be adapted
to focus on the appropriate time scale of the input data, GRUs implement the leaky
integrator in the hidden state, thus controlling the time scale of the dynamic representation,
similarly to LSTMs cells. The GRU equations are the following:
rt = logistic(Whrht−1 +Wxrxt + br)
zt = logistic(Whzht−1 +Wxzxt + bz)
ht = tanh(Whh(rt · ht−1) +Wxhxt + bh)
ht = (1− zt) · ht−1 + zt · ht ,
(2–10)
where rt is the reset gate that defines how much of the previous state ht−1 will be
exposed to the proposed state ht and the update gate zt integrates between the
previous and proposed state to generate the new state ht . Thus, zt adaptively controls
the temporal scale of the hidden state, while rt works as a forgetting factor.
Although, LSTMs and GRUs were originally developed as ad hoc to the vanishing
gradient problem, Jozefowicz et al. (2015) implemented an extensive search for different
architectures to better solve problems with long-term temporal dependencies and the
best architectures were not much different or significantly better than those two. Thus, in
this work we will either use LSTMs or GRUs wherever an RNN is necessary.
Specific cases of RNNs are sequence to vector and sequence to sequence
mappings. In sequence to vector we have an input xt , t ∈ 1, ... ,T that is transformed
by the RNN to ht , t ∈ 1, ... ,T , after that either hT or a (weighted) average of all ht is
used a vector representation that can be input to an MLP for classification, for example.
In sequence to sequence applications that initial vector representation is used as input
(also called key or conditioning) to another RNN that generates an output sequence.
32
Example of sequence to sequence applications are recurrent encoder-decoders (Socher
et al., 2011), (Sutskever et al., 2014).
However, as we discussed above, the problem is not only training with vanishing
gradients, but also that an application may require high resolution for some components
of the history and not for others, in a given memory depth. Either LSTM and GRU does
not allow for learning independently these two characteristics of memories, so they
are not general mechanisms for storing information in time. RNNs bring potentially the
capability of configuring the memory requirements for time processing, but they still
suffer from the problem of efficient training and they are unable to control the resolution
of the memory trace independently where it occurs in time.
RNNs can cope with dependencies in time. But, just like fully connected neural
networks, they are not efficient to learn spatial invariant transformations. A better suited
architecture for this task, as mentioned before, are convolutional neural networks. We
discuss convolutional neural networks (CNN) in more details in the next section.
2.5 Convolutional Neural Networks
A convolution operation in neural networks can be expressed as follow. Assume an
input batch Xb,h,w ,c with dimensions b images per batch, where each image in the batch
has h rows, w columns and c channels. That batch is to be convolved with a set of filters
Wi ,j ,c,k with i rows and j columns. k is the number of output channels. Each one of the k
channels in the convolutional filter operate over all the c channels in the input at once. In
an equation, the result of the neural network convolution is given by
Yb,hy ,wy ,k = X ⋆W =i−1∑α=0
j−1∑β=0
c∑γ=1
Xb,hy−i/2+α,wy−j/2+β,γWα,β,γ,k , (2–11)
where ⋆ denotes the convolution operation. We can visualize the convolutional filter
W sliding over the images X as depicted in Figure 2-3. The size of the output map
depends on several choices, for example, if we want only calculate the values where
the convolutional filter totally overlaps with the input, in which case the number of rows
33
Figure 2-3. Example of convolutional neural network (CNN) layer computation. In thisexample a single channel input, filter and output are illustrated.
The inputs are represented in blue and the outputs in green. The darker shade of blueare the convolutional filter values being operated on that spatial location. This image has
been adapted from (Dumoulin & Visin, 2016).
and columns in the output are smaller than input. Another choice is to force the output
to be of the same size as the input, by zero padding the former before the convolution.
There are also strided convolutions, where not all the values in the double summation
in (2–11) are calculated. This is a more efficient way to do downsampling, since it avoid
computations at all, instead of pooling regions. Strided convolutions, with zero padded
inputs are illustrated Figure 2-4
It is interesting to note the complexity of the current architecture of the CNNs versus
the RNN architectures. The current implementations of CNNs in the literature are trying
to achieve universal features for static images to cope with the variability of how a given
object can appear in teh scene (due to rotations, translations and scale). Likewise, we
should improve the architectures of RNNs to achieve of similar goal to process events in
34
Figure 2-4. Example of convolutional neural network (CNN) layer computation with zeropadding and strides.
Each frame in the sequence above represents how each value of the output iscalculated. Note that the convolutional filter (gray) strides 2 pixels at a time, as opposedto one pixel at a time as depicted in Figure 2-3. In this example a single channel input,filter and output are illustrated. The input is padded with zero values (white squares)..
The inputs are represented in blue and the outputs in green. This image has beenadapted from https://github.com/vdumoulin/conv_arithmetic.
35
time. The result of this combination are Convolutional Recurrent Neural Networks that
we review in the next section. Also, bear in mind that further improvements to RNNs are
the essence of this thesis.
2.6 Combining CNNs and RNNs: Convolutional Recurrent Neural Networks.
In order to generalize the shift and scale invariant properties of CNNs to temporal
series, Convolutional Recurrent Neural Networks (ConvRNN) were proposed. ConvRNNs
were initially used in the context of supervised learning for classifying static images
(Liang & Hu, 2015), in which case the same input batch is represented as input
several times instead of using a time series. ConvRNNs were also applied for weather
forecasting (Xingjian et al., 2015). After these first applications, several other papers
followed up with unsupervised learning applications of ConvRNNs. In unsupervised
learning ConvRNNs were used for video prediction (Kalchbrenner et al., 2016) (Lotter
et al., 2016) (Finn et al., 2016a), optical flow estimation (Patraucean et al., 2015),
algorithmic learning (Kaiser & Sutskever, 2015) and feature extraction from videos
(Santana et al., 2016b).
The motivation of using ConvRNNs for unsupervised learning is that they can be
interpreted as locally connected RNNs with shared parameters across the input images.
Similarly to how CNNs are interpreted as locally connected shared parameters MLPs.
We show a visualization of a ConvRNN in Figure 2-5. Also, we can rewrite all the RNN
equations mentioned above using convolutions, which are as follows:
ConvRNN
ht = f (Whh ⋆ Ht−1 +Wxh ⋆ Xt + b), (2–12)
36
ConvGRU
Rt = logistic(Whr ⋆ Ht−1 +Wxr ⋆ Xt + br)
Zt = logistic(Whz ⋆ ht−1 +Wxz ⋆ Xt + bz)
Ht = tanh(Whh ⋆ (Rt · Ht−1) +Wxh ⋆ Xt + bh)
Ht = (1− zt) · Ht−1 + Zt · Ht ,
(2–13)
ConvLSTM
It = logistic(Whi ⋆ Ht−1 +Wxi ⋆ Xt + bi)
Ft = logistic(Whf ⋆ Ht−1 +Wxf ⋆ Xt + bf )
Ot = logistic(Who ⋆ Ht−1 +Wxo ⋆ Xt + bo)
Gt = tanh(Whg ⋆ Ht−1 +Wxg ⋆ Xt + bg)
Ct = Ft ⊙ Ct−1 + It ⊙ Gt
Ht = Ot ⊙ tanh(Ct),
(2–14)
The ConvLSTM is the mostly used ConvRNN architecture. Later in this thesis,
we will use ConvRNNs to write an efficient end-to-end differentiable reinterpretation of
DPCNs. We will use them to scale our memory augmented models to large images.
In the next chapter, we will augment RNNs with content addressable memories
(CAM), but first, in the next section we discuss what CAMs are in the context of neural
networks.
2.7 Content Addressable Memories
Content Addressable Memories (CAM) in the context of computer hardware are
memories used for very high speed searching applications (Pagiamtzis & Sheikholeslami,
2006). Another important special type of memory is the Random Access Memory (RAM)
that is addressed with a location array and returns the stored value. On the other hand,
CAMs are addressed with a data word and returns the addresses with similar values
stored.
37
ht-2 ht-1 ht
Figure 2-5. Visual representation of a convolutional recurrent neural network.Here we show an unfolded model for 3 time steps and represent a locally connected
convolutional filter being applied to a region of the input frames. This same operation,with shared parameters is used throughout the input image. This way for each time step,
we have images as inputs, images as hidden states and images as outputs. This iscontrary to conventional recurrent neural networks, where all the operations are based
on dot products between vectors.
In the field of neural networks, CAMs are also called Neural Associative Memories
(Palm et al., 1997). CAMs were used as a correlation model to store data (Kohonen,
2012). In the simple form without internal dynamics, CAMs receive an input vector x and
return an output vector y calculated as
y = f (Wx), (2–15)
where f is a function such as sign(x) = 1 if x > 0,else − 1 for binary or the
identity function for linear CAMs. When the network is required to store a sequence
38
of input vectors x1, x2, ... , xN , the mean squared error solution for the CAMW is the
autocorrelation
W =
∑Nn=1 xix
Ti
N, (2–16)
where xTi is a transposed row vector. Similarly, when the network is required to
output a desired vector yi , whenever an input xi is presented, the solution forW is
the cross-correlation
W =
∑Nn=1 yix
Ti
N. (2–17)
Unfortunately, the memory capacity of correlation based of CAMs for perfect recall
is N patterns (the number of orthogonal vectors in N dimensions). Also they can only
reliably recollect about 0.14 times the number of PEs (Amari, 1988). Hasanbelliu &
Principe (2008) proposed a CAM memory implemented using the kernel trick and
Reproducing Kernel Hilbert Spaces (RKHS) that has an unconstrained memory capacity
only limited by the physical memory of the machine. Given an input vector x , the kernel
CAM (KCAM) output is
y =
N∑n=1
ynκ(xn, x), (2–18)
where κ is a Mercer kernel, such as the Gaussian:
κσ(xi , xj) =exp(−(xi − xj)2/2σ2)√
2πσ2, (2–19)
and (xn, yn) are the associated input-output pairs. KCAM were proved to have a larger
memory capacity and better quality data recollection than linear CAM Hasanbelliu &
Principe (2008).
It is important to note that the main difference between associative memories and
regressors is in the number of exemplars. Principe et al. (1999) argued that regressors
39
are meant to find a single optimal hyperplane describing as much information about the
entire dataset, while linear associative memories want to output a response that is as
close as possible to each memorized input-output pair. Thus, for regressors want need
more data than free parameters, when for associative memories we want the opposite.
A different implementation of CAMs involves an internal dynamics. The most
prominent example of such dynamic CAM are Hopfield networks (Hopfield, 1982). The
dynamics of a Hopfield network can be described as the follows. The input signal is the
initial memory states and is called s0 = xi . The CAM state evolves following a recursive
equation st = f (Wst−1):
st =
+1 ifW st−1 ≥ θi
−1(2–20)
After t = N such recursions, the network converges to the recollected pattern yi = sN .
Again, parametersW can be initialized to the cross-correlation between the pairs (yi , xi).
A differentiable RAM (DiffRAM) for neural networks can be implemented using a
discrete distribution vector p with A values. Each value indicates the weight of each
address and its expected value is the content retrieval. In an equation, let a memory
M ∈ RA,B where each word has length B stored in A memory slots, we can retrieve a
word as
m =∑i∈A
Mipi , (2–21)
see Figure 2-6 for an illustration.
This mechanism of using a distribution to weight different values in a matrix (or
tensor) and retrieve values using moments, as above, is referred as differentiable
attention in the Deep Learning literature (Xu et al., 2015), (Bahdanau et al., 2014),
(Gregor et al., 2015).
40
0.5 0.4 0.6 0.0 0.2
0.51.15.02.51.5
0.0 0.0 0.0 1.9 0.0
0.9
0.0
0.1
Memory
Addressing signal
(location addressing
must add up to 1)
{0.5*0.9 0.4*0.9 0.6*0.9 0.0*0.9 0.2*0.9
0.5*0.01.1*0.05.0*0.02.5*0.01.5*0.0
0.0*0.1 0.0*0.1 0.0*0.1 1.9*0.1 0.0*0.1
sum
colla
pse
0.45 0.36 0.54 0.19 0.18
Output addressed memory
Figure 2-6. Diagram of Differentiable Random Access Memory (DiffRAM).Here we show a memory containing 3 elements, each one a 5 dimensional vector, seered matrix. The differentiable addressing signal (green vector), has one value of eachmemory element. The addressing signal here retrieves the first row almost perfectly.Note that this leak where the addressing signal is not zero for all elements but one
sometimes happens in practice. The reason for such leaks to happen is the smoothnature of the operation that allows differentiability but such undesired cases.
Unfortunately, CAM, KCAM, RAM and Hopfield networks only work when the inputs
are fixed length patterns. To extend associative memories to deal with variable length
data, while mimicking the neocortex-hippocampus operation, we have to embed such
memories in a dynamic architecture. In the next chapter we propose a framework for
defining such architectures.
41
CHAPTER 3A FRAMEWORK FOR DYNAMIC ADDRESSABLE MEMORIES
Let a vector valued time series xt ∈ RD be an input to a Recurrent Neural Network.
In the present framework, this input time series can be the top most dynamic causes
generated by DPCNs. It can also be any simple time-series such as text, simple videos
or any other data which regular RNNs may be successfully applied to. We are interested
in the dynamic states ht generated by the recurrent network when it is programmed to
model the time structure of xt . Also, at any given point during the presentation of the
time series to the RNN, we want to store the most interesting state features ht which
may range from a few to an unbounded number of samples. The maximum number of
stored state features ,N, is the capacity of the system. Here, we refer to this collection
of N state features as our time-to-space embedding. In the Cognitive Architecture
framework applied to video, these stored states may correspond, for instance, to stable
face representations. Such stored states can be used to cluster different DPCN’s causes
series or as a substitute to transient causes in the case the DPCN is in predictive mode,
where we use it as generative model to sample data.
Conventional RNNs only have access to the last ht at each point of time. On the
other hand, in this new model the N-valued list of interesting states can be read from or
written during processing.
We introduce a theoretical framework that formalizes this idea as a nonlinear
generalization of the model presented in De Vries & Principe (1992). More precisely, the
proposed model is a nonlinear time-variable gating AR. Mathematically, we can define it
as
ht = f (Wxhxt +Whhht−1 + b + gt,h(ht−2, ht−3, ...)), (3–1)
where gt,h is an adaptive auto-regressive time-variable gating function learned from
data. A motivation for using the gating element g is to allow mathematical expressions
such as the CAM, DiffCAM, etc to be plugged as pieces of conventional RNNs. These
42
RNNs can be interpreted as simple for-loops where the internal steps are solely inner
products. With the inclusion of g we can represent complex operations such as nested
for-loops, memory addressing, table lookup, etc. This extension is done to solve the
fundamental problem with all the recurrent structures that is the lack of flexibility to deal
independently with the resolution of a memory trace, independent of where it occurs in
time. This model can be generalized to also store interesting inputs, thus becoming a
nonlinear time-variable gating ARMA.
ht = f (Wxhxt +Whhht−1 + b + gt,x(xt−1, xt−2, ...) + gt,h(ht−2, ht−3, ...)), (3–2)
where gt,x is the moving-average time-variable gating function also learned from
data. We present this ARMA generalization for theoretical completeness. We will not
focus on ARMA models in the following experiments. Examples for the MA part are
Question-Answering tasks in Natural Language Processing (NLP), where the output
answers are words in the input Weston et al. (2014), and, consequently, the model
benefits from having stored copies of input keywords.
At each time step these functions g bring a small number of previous stored inputs
and states to the calculation of the current ht . To illustrate this, imagine that at h10, the
value h1 is relevant in the dynamics, so a conventional AR model could implement this
with a term ht−9, but for h11, only a blurred version of h1 will be present in h2. A time
variable function gt,h could instead capture a snapshot of h1 and present it with arbitrary
precision for all ht , and, by this way, decoupling temporal resolution from time constants.
Similarly, we could interpret gt,x as arbitrarily long temporal windows in the input series
with time-varying sparse connections. Since both g functions are time varying, they
are more appropriate for real world non-stationary signal modeling, in other words, g
could keep presenting a relevant previous state hi while the input is stationary, and it
could change the relevant previous state under top down control. Back to our Cognitive
43
Framework, this is why we need to keep representations of different individuals in focus
when predicting their behavior in a video. As soon as a person leaves the camera’s
field of view, the system no longer has to keep its facial representation as part of the
dynamics.
Also, this formulation works as a unifying signal processing interpretation of models
such as Memory Networks Weston et al. (2014) and NTMs Graves et al. (2014). For
Memory Networks, gt,x is implemented as a theoretically infinite window (in practice it is
just as long as the number of input sequences) and the rule for choosing relevant inputs
is learned from data. For NTM gt,h is learned from data as a fixed length addressable
memory. Since we are not interested in storing unprocessed input values, here we focus
in models based on NTM. Note that NTM (Graves et al., 2014) does not implement a
MA memory, gt,x , it only has an AR memory, nor it accepts directly information from a
top down input representing the past information stored by the system in its interaction
with the world. Another missing information in Graves et al. (2014) is that they do not
report how to design an NTM to be used for feature extraction in unsupervised learning.
They only focus in cases where the memory reservoir is large enough to learn simple
copy-pasting operations, which, in our application, is not desired and means overfitting
the input.
Reinforcement Learning NTM (Zaremba & Sutskever, 2015) has a MA that was
trained with hybrid reinforcement and supervised learning. But they do not formalize
their model under a Signal Processing framework as proposed here. The stability
conditions and the space of solutions for the above mentioned models is not well
defined, but the formulation in (3–2) could help us define some necessary conditions.
Which we plan to investigate in future work as well.
In summary our problem has two complementary sides: the necessity of a
temporal resolution that goes beyond the one provided by decaying exponentials,
and a time-to-space mapping (xt ,∀t) → z to extract events in time. The temporal
44
resolution could be solved by appropriate snapshots of states while the space mapping
is simply the concatenated snapshots. In a generative model the input time series is
modeled as
P(X ) =
T∏t=1
P(xt |Ht−1, z), (3–3)
where X = (x1, ... , xT ) and Ht is the sequence’s history up to point t. Note that we do
not conform to the Markov assumption, which would simplify to Ht = xt−1.
Using the associative memory consolidation of the Cognitive Architectures shown
in Figure 1-1b as guidance, in the next sections we show interpret methods for using
addressable memories to implement gt,h in a differentiable way that can be learned from
data with backpropagation through-time.
3.1 Memory Reading and Writing in Recurrent Neural Networks
Here we introduce a reinterpretation of the NTM model under the light of (3–1). We
want to implement gt,h as an addressable memory Mt ; in order to do so, we have we
have to define how to address (or read) and write to specific memory locations. Also,
reading and writing should be dynamic and possibly switch during the presentation of
each time point in the input series as in:
Mt = write (read (ΘMM ,Mt−1) , xt ,ΘxM) . (3–4)
Here Mt can be either a matrix Mt ∈ RA,B , where the A different rows (or memory
locations) are separate B-dimensional words. More generally the memory can be a
tensor Mt ∈ RA1,A2,...,AN ,B , where we have a spatial organization of N such addressable
dimensions Ai . A schematic diagram to illustrate 3–1 on such case is shown in Figure
3-1.
The choice of the appropriate type of memory depends on the properties of the
temporal structure we are trying to capture. We do not have an extensive list of available
45
Figure 3-1. Schematic diagram of a memory augmented recurrent neural network.We show a representation unfolded in time for a total of 5 time steps. For 3 of which the
model receives new inputs and 3 of which we observe its outputs. There is one timestep overlap between the input and output stages. Notice the interconnection betweenthe recurrent neural network states h and the memory module M. The arrows betweenthem indicate content addressable read and write operations. The intermediate state z
when the input and output stage overlap can be used as a fixed length representation ofthe entire input signal, since it has to contain all the information necessary for calculating
the output, when the architecture stops receiving new inputs.
options and we leave the investigation about the best memory architectures for future
work. Here we define two types of memory.
3.1.1 Type I: DiffRAM
The DiffRAM Type is useful for signals where the characteristics of the events are
homogeneous in time. For example, in the simple Question-Answering task (Weston
et al., 2014) where specific vector valued inputs are the answers. Another use for
46
DiffRAMs is implementing dynamical systems with simple arithmetic operations on the
input or states.
Note that an issue arises when they are not homogeneous, and since before the
system encounters them it does not know what type they are, the user must know it
is appropriate to use this type of network Assume for example that we need to store
in memmory a linear combination of two inputs. In this case it is sufficient to assign
one memory location and add to it the interesting inputs when they are presented. The
Adding problem in the experimental section illustrates such case.
3.1.2 Type II: CAM
In the most general case of answering unknown questions after the data has been
presented, it is essential to have knowledge of what is being stored to memory. This
can be implemented using Content Addressable Memories controlled as part of an
RNN dynamics. Also content addressing can implement location addressing using
key-value mapping techniques. Thus, Type II memories can theoretically implement
Type I memories. Back to the face recognition in video example, given in the beginning
of this chapter, a representation h of each person in the video could be stored in a
memory location. To retrieve relevant statistics for classification we should address
these memories by content.
Thus, a motivation for using one of these two types of memory would be DiffRAMs
(Type I) when we are interested in where information is stored and CAM (Type II) when
the important information itself should be used in retrieval process. But a final word
about what is the best option for each problem can only be given experimentally. This is
why in the next sections we derive all the options so they can be tested experimentally.
Let us now present the general algorithm that represents the operation of 3–1 when
gt,h is an addressable memory. Given a time t, a multidimensional input time series xt
and the previous states of the architecture, where ht−1 is the AR dynamic state of the
47
model, rt−1 the previous vector from the read function and wt−1 the previous write vector,
the architecture operates as:
1. Using ht−1, update the reading vector rt = fr(rt−1, ht−1)
2. Read from memory mt = read(rt ,Mt−1)
3. Using the input and the read vector, update the semantic representation RNN:ht = RNN(xt ,mt , ht−1)
4. Using ht , update the writing vector wt = fw(wtm1, ht)
5. Write to memory Mt = write(Mt−1, ht ,wt).
Note that although 3–1 uses a conventional RNN update equations, we could also
use GRU or LSTM-like updates to calculate ht . In the next section, we present in more
detail how to implement the read and write operations for using CAMs, DiffRAM and the
hybrid of both as in NTM.
We would like to mention that in the next sections whenever we define an affine
transformation or a specific nonlinearity, the choices can be substituted by Multilayer
Perceptrons or other appropriate neural network. In any case, all the parameters of the
network should be learned from data using backpropagtion through time, where the cost
function depends on the problem as well. This implies that what is stored, and when it is
stored to the content addressable memory is also learned to minimize the cost function.
3.2 Content Addressable Memory
Here we discuss Algorithm 1. To implement content addressing, given the context
representation ht−1, as defined in the algorithm above we calculate an addressing
B-dimensional data word as
rt =Wkht−1 + bk . (3–5)
Given rt and using a similar reasoning as similar to the one proposed by LSTMs
and GRUs, we implement a gating mechanism that allows the network to switch focus
48
Data: input sequences xt , initial values h0, r0. Weights (W , b) for each neuralnetwork.
Result: memorized representations Mfor t ∈ {0, 1, ...} doht ← RNN(xt , ht−1, rt−1)rt =Wkht−1 + bkgt = logistic(Wght−1 + bg)rt = gt rt + (1− gt)rt−1mt = K(Mt , rt)mt,i = Mt,iktmt,i = κσ(Mt,i , kt)σt = σ0 · logistic(Wσht−1 + bσ)βt = relu(Wβht−1 + bβ)rt = softmax(βt rt)mt =
∑i Mt,i rt,i
wt = logistic(Wwht + bw)et = logistic(Weht + be)at = tanh(Waht + ba)Mt = Mt−1(1− wt ⊗ et)Mt = Mt + wt ⊗ at
endAlgorithm 1: Dynamic Content Addressable Memory Network.
between long and short term memories using second order interactions. Such gating
mechanism can be implemented as
gt = logistic(Wght−1 + bg),
rt = gt rt + (1− gt)rt−1.(3–6)
Given rt we can use it to read from memory in two ways, with a projection or
retrieving the closest matching memory slot. Depending on the choice we have
either projection-based CAM (CAMp) or matching-based CAM (CAMm). If we use
the projection retrieved value from memory, it is
mt = K(Mt , rt), (3–7)
where K is an appropriate kernel, that can be for example the linear
49
mt,i = Mt,ikt , (3–8)
or the Gaussian kernel
mt,i = κσ(Mt,i , kt), (3–9)
where Mt,i denotes the i -th row of the matrix Mt and mt,i the i -th value of vector mt .
An advantage of the present framework, compared to other RKHS methods, is that
the kernel size can also be easily learned from data using the same cost function that is
used to train all the other parameters and backpropagation through time. For instance,
here we can calculate the kernel size as
σt = σ0 · logistic(Wσht−1 + bσ), (3–10)
where we used the logistic function to enforce positivity and σ0 is the maximum kernel
size allowed, here fixed to σ0 = 1.
The matching access is based on a scale invariant projection, in other words
we normalize the projection between the generated key rt and the memory M. This
generates a probability distribution over the addressable dimensions:
βt = relu(Wβht−1 + bβ)rt = softmax(βt rt), (3–11)
where βt is the inverse of the temperature of the Softmax and defines how spread the
probability distribution over addressable dimensions is. Addressing is the expectation of
Mt−1 given that distribution
mt =∑i
Mt,i rt,i . (3–12)
Getting mt completes the read function. Now let us talk about how to write. In order
to bound the number of adaptive parameters of the write operation, here we use a lower
50
range outer product, ⊗. Given, the updated semantic state ht , we calculate three vectors
wt ∈ RA1,l ...,AM ,b, etRb,B and at ∈ ,B, where b << B, represent the address, erase, and
add vectors, respectively. We calculate them as follow
wt = logistic(Wwht + bw),
et = logistic(Weht + be),
at = tanh(Waht + ba).
(3–13)
Both wt and et are bounded to [0, 1] because they define respectively if an address
will be affected or erased, which are supposed to be approximately binary operations.
Finally, we write to Mt as
Mt = Mt−1(1− wt ⊗ et)
Mt = Mt + wt ⊗ at .(3–14)
In the same way that Hopfield Networks extends CAM with an internal dynamic
mechanism, we could also extend the previously proposed networks. If we think of
RNNs being implemented as for loops, dynamic CAMs in this context are essentially
nested for loops, where the external loop runs over time t and the internal loop runs for
a fixed number of iterations or until the addressable memory converges to an attractor.
Note that although it has been argued that modern second order RNNs do not rely
on attractors (Jozefowicz et al., 2015), here we hypothesize that to pair RNNs with an
architecture that does converge has the power to augment resulting model’s capacity.
But we leave experimental validation of this type of architecture for future work.
3.3 Differentiable Random Access Memory
This section discusses Algorithm 2. To implement random addressing, as
mentioned in the previous chapter, we need to define a probability distribution over
valid address locations that is independent of their content. In a previous work, we
51
Data: input sequences xt , initial values h0, r0. Weights (W , b) for each neuralnetwork.
Result: memorized representations Mfor t ∈ {0, 1, ...} doht ← RNN(xt , ht−1, rt−1)rt =Wkht−1 + bkgt = logistic(Wght−1 + bg)µt =Wµht−1 + bµσt = σ20 · logistic(Wσ2ht−1 + bσ2)
r ct,i =1√2πσ2texp
(− (i−µt)2
2σ2
)r ct = softmax(Wwht−1 + bw)gt = logistic(Wght−1 + bg)rt = gt r
ct + (1− gt)rt−1
rt = softmax(βt rt)mt =
∑i Mt,i rt,i
wt = logistic(Wwht + bw)et = logistic(Weht + be)at = tanh(Waht + ba)Mt = Mt−1(1− wt ⊗ et)Mt = Mt + wt ⊗ at
endAlgorithm 2: Differentiable Random Access Memory Network.
proposed to use a Gaussian distribution (see Santana et al. (2017)), in which case,
the memory addressing resembles the reading operation of Deep Recurrent Attentive
Write (DRAW) networks Gregor et al. (2015). Resembling RNNSearch Bahdanau
et al. (2014), we could also use a multinomial distribution. Each distribution has a
number of free parameters and those parameters should be calculated from h, for the
case of an isotropic Gaussian distribution, we only need to calculate a mean µt and
a variance σ2t . To simplify the equations, we assume a single addressable dimension
A, but all the equations can be easily extended to multiple dimensions by defining
the addressing distribution over all A1, ... ,AN . We can calculate the parameters for a
Gaussian addressing as
52
µt =Wµht−1 + bµ
σt = σ20 · logistic(Wσ2ht−1 + bσ2)
r ct,i =1√2πσ2t
exp
(−(i − µt)
2
2σ2
),
(3–15)
where r ct is the addressing distribution with i ∈ N ranging from all valid addresses.
For a multinomial distribution, there is more freedom in the distribution shape over
the attended addresses. On the other hand, the number of trainable parameters is
proportional to the number of addresses. In an equation, we have
r ct = softmax(Wwht−1 + bw). (3–16)
Again, we propose to use gating over r to combat vanishing gradients and ease the
process of switching between long and short term dependencies:
gt = logistic(Wght−1 + bg),
rt = gt rct + (1− gt)rt−1.
(3–17)
For the write operation, wt can be calculated similarly to rt , but from ht instead of
ht−1. Given, rt the read is simply the expectation over the valid address values
mt =∑
i∈A1,...,An
Mt−1rt . (3–18)
Given wt , et and at , the write operations are similar to DCAM’s as in (3–14).
3.4 Hybrid Access Memory
Here we extend the Graves et al. (2014)’s approach to use multidimensional
addressable memories. The model we will describe in details is in Algorithm 3 Hybrid
access combines matching based addressing and a differentiable random addressing
53
Data: input sequences xt , initial values h0, r0. Weights (W , b) for each neuralnetwork.
Result: memorized representations Mfor t ∈ {0, 1, ...} doht ← RNN(xt , ht−1, rt−1)rt =Wkht−1 + bkσt = σ20 · logistic(Wσ2ht−1 + bσ2)
r ct,i =1√2πσ2texp
(− (i−µt)2
2σ2
)r ct = softmax(Wwht−1 + bw)kt = tanh(Wkht−1 + bk)βt = relu(Wβht−1 + bβ)r ct = softmax(βtK(Mt , kt))Ki(X , y) =
Xiy||x ||·||y ||
Ki ,σ2(X , y) =κσ2(Xi ,y)∑
j κσ2(Xj ,Xj )·∑l κσ2(yl ,yl )
gt = logistic(Wght−1 + bg)r gt = gtr
ct + (1− gt)rt−1
sAit = softmax(Wsht−1 + bs)r sit = r
gt ⋆ s
Ait mt =
∑i Mt,i rt,i
wt = logistic(Wwht + bw)et = logistic(Weht + be)at = tanh(Waht + ba)Mt = Mt−1(1− wt ⊗ et)Mt = Mt + wt ⊗ at
endAlgorithm 3: Differentiable Random Access Memory Network.
around the location retrieved by content. They can do so by first retrieving the address
of a given content word and shifting around that address. In equations, we start with a
content key calculated as
kt = tanh(Wkht−1 + bk). (3–19)
Note that here we are bounding the content key to [−1, 1]. Again, the content
addressing is done as an expectation over valid addresses
βt = relu(Wβht−1 + bβ).rct = softmax(βtK(Mt , kt)), (3–20)
54
also β > 0 works as the inverse temperature of the Softmax, and controls how spread
the distribution is. This time, K should be a scale invariant similarity measure. One can
use, for example, either the cosine similarity as in Graves et al. (2014)
Ki(X , y) =Xiy
||x || · ||y ||, (3–21)
or the Cauchy-Schwarz divergence, which is similar to the cosine similarity, but in RKHS
Principe et al. (2000)
Ki ,σ2(X , y) =κσ2(Xi , y)∑
j κσ2(Xj ,Xj) ·∑l κσ2(yl , yl)
, (3–22)
To allow the network to choose between this new proposed address distribution and
the one used in the previous time step, we can integrate those values as
gt = logistic(Wght−1 + bg),
r gt = gtrct + (1− gt)rt−1
(3–23)
Once we have r gt , we compute a multidimensional shift in the address space.
sAit = softmax(Wsht−1 + bs)
r sit = rgt ⋆ s
Ait ,
(3–24)
where ⋆ denotes circular convolution in the Ai -th addressable dimension. We sequentially
repeat (3–24) for all the addressable dimensions to get to the distribution r st . Thus, this
operation is similar to differentiable RAM centered around the content addressed
locations. Since (3–24) smooths the address weights, Graves et al. (2014) suggested
to sharpen them using element-wise power and renormalization. This last step can be
implemented as:
55
γt = 1 + relu(Wγht−1 + bγ)
wt =(w st )
γt∑(w st )
γt,
(3–25)
where, the summation above is across all the address dimensions Ai . Note that (3–24)
is an extended version of the architecture implemented in Graves et al. (2014). Since
the original formulation has only one dimension to address, when storing information to
memory about complex data structures, NTMs have to rely on the content addressable
memory to jump to different locations or force larger shifts. We hypothesize that allowing
M to be organized in several dimensions we make it easier to store complex data
structures. In the experimental section we validate this hypothesis. For faster reference,
we will only refer to NTM when talking to this specific implementation. From a practical
point of view, st is also a probability distribution calculated with softmax and probability
distributions are ill defined in large dimensional spaces van Handel (2014). When using
single precision on GPUs st can become unstable. In other words, refining st across
several dimensions not only potentially helps to store complex data structures, but it is
also useful for numerical stability reasons.
The write probability, wt can be calculated similarly to rt , but from the updated state
ht , instead of ht−1. Also, the read and write functions are the same as those for DRAM,
see (3–18) and (3–14), respectively.
Note that whenever we refer to NTM1D , we mean an NTM with a single addressable
dimension as implemented in Graves et al. (2014). For two dimensions, we refer to it
as NTM2D . For simplicity we use CAM and DiffRAM instead of “NTM with only content
addressing” or “NTM with random access only”, respectively.
3.5 Experiments
In this section we compare the proposed architectures between each other and
the RNN architectures proposed in the previous chapter. We are specially interested in
56
problem specific cost functions during training and generalization to unseen samples.
The success of using the studied memory augmented RNNs to complex problems such
as consolidating and clustering DPCN causes, is dependent on their ability to learn
simple operations such as memorization, copying, memory transformation, etc. The next
experiments were designed by this author and others to test some of these abilities as
much as possible while keeping the tasks simple and reproducible.
The preliminary results are focused on a modified Adding problem Hochreiter &
Schmidhuber (1997) to quickly validate the benefits of adding an external memory
to simple RNNs. The Copy problem Graves et al. (2014) that tests the ability of the
networks to remember variable length sequences as well as to learn simple algorithms.
Finally, we propose the Rotation problem to test the ability of the proposed architectures
to work as complex content addressable memories and generative models for time
series. The Rotation problem is focused around the MNIST dataset to also investigate
the clustering properties of the internal self-organized representations by these
architectures.
3.5.1 Initialization and Algorithmic Choices
In all the following experiments, we initialized the memory tensors M0 with zeros
for DiffRAM and CAMp only. Note that for CAMm and NTM the norm of M is part of the
denominator when calculating the cosine similarity, in that case, we initialize the memory
with a small constant with value 0.001. The biases b were initialized with zeros and the
weight matrices using Glorot initialization Glorot & Bengio (2010). The hidden-to-hidden
matrices to the RNNs were initialized with random orthonormal matrices. Another
exception is that the bias utilized to calculate the addressing word for NTM is initialized
to random values to avoid zeros in the denominator of (3–21).
For NTM1 we limited the address shifts to (−2,−1, 0, 1, 2). The shifts for NTM2 is
limited to (−1, 0, 1) for each dimension to guarantee that both methods have similar
capabilities and facilitate comparison.
57
We tested CAMm using the cosine similarity measure as in NTM, and CAMp with
Gaussian kernel with adaptive kernel size.
All recurrent networks had 100 hidden neurons. This value ties the sizes of all the
other layers that get the outputs of the RNNs as inputs. We trained the models with
ADAM optimization algorithm with learning rate 0.0001.
3.5.2 Adding
In this modified adding problem we force the networks to have very little hidden
states and test their ability to generalize to completely unseen input lengths. Here we
compared NTMs using a GRU (2–10) as internal RNN to generate states h. The final
output of each network is yt = logistic(Wyht + by). The other compared methods are the
simple RNN, GRU and an LSTM. All the NTMs models have 10 hidden states only. The
memory tensor is M ∈ R10,1 for NTM1, CAMm and CAMp. It is M ∈ R3,3,1 for DiffRAM and
NTM2. The conventional RNN models have 20 hidden states.
The input signal for the adding problem is made of two variable length sequences,
the first one with real values between -1 and 1 and the second with flags -1, 0 or 1,
where only two elements assume the value 1. These two elements mark the numbers in
the first sequence that should be added. Letting the two marked values be called X1 and
X2, the target for a given input is 0.5X1+X24
, which makes sure the target is a real value
between 0 and 1. The compared models were trained with Adam Kingma & Ba (2014)
with learning rate equal to 10−3 for 10000 gradient updates with batch size 100, which is
a total of 106 random sequences for training. The gradients for training were clipped to
maximum norm 10 Pascanu et al. (2012) to avoid exploding gradients. During training,
the minimum length of the input sequences is 50, the maximum is 70 and the error
signal is only emitted after the entire sequence is presented. During test, the length is
100 to test the ability of compared methods to generalize to harder problems. Targets
values between 0 and 1 justifies the choice of the output nonlinearity. The cost function
for training is the negative log-likelihood L = −dt log(yt)− (1− dt)log(1− yt). According
58
Figure 3-2. Adding problem.(a) and (b) from a sample input sequence of the test set. (c) are the values of Mt for the
best performing method DRAM when that input sequence was presented. Notice thedifference in contrast in the third row when the second positive peak in the second row ispresented. That second peak indicates the second value to be added and the difference
in contrast as the result of the model accumulating its value in the memory unity tocomplete the addition.
Table 3-1. Adding Problem.This table compares our proposed architectures with the original Neural Turing
Machines and classic recurrent neural networks architectures. Notice that the modelwith differentiable random access memory performs best as expected in arithmetic
applications.NTM1 NTM2 CAMm CAMp DiffRAM Simple RNN GRU LSTM
ACC (%) 81.2 82 88.8 63.2 92.2 13.4 64.4 65.4
to Hochreiter & Schmidhuber (1997) a sequence is considered correctly added if
output-target absolute error is lower than 0.04. We show in Table 3-1 the accuracy for
the test set. Accuracy is calculated as the number of outputs with error less than 0.04.
Before discussing the results, it is important to note that those methods could
achieve better results with more training since the ADAM algorithm performs an
59
automatic form of learning rate annealing or larger hidden states. This problem can
be easily solved by LSTMs Hochreiter & Schmidhuber (1997) and even simple RNNs
when properly initialized and given enough hidden states Sutskever et al. (2013). But,
previous works have not investigated how the trained networks generalize to sequence
lengths not present in the training set. Nevertheless, it is interesting to note the speed
of convergence for such a simple setting. Here we observed that NTM2 learned faster
than NTM1, which is surprising given the problem simplicity. But also did not rejected our
hypothesis that smaller shifts across several dimensions is better than larger addressing
shifts across a single dimension. We believe that this is due to too much spreading in
the shifting operation and we plan to investigate solutions for this in the future. Also,
CAMm had better results than CAMp which makes sense, since here we want to store
values in the memory and do not need the nonlinear transformation performed in RKHS
to solve this problem.
In the first and second rows of Figure 3-2 we show a sample test sequence. In
the third row we show the memory of the best method DRAM evolving through time. In
this third row we can note a sudden change in color contrast when the second element
to be added is presented (a little after t = 40). Thus, this network was the fastest to
learn to simply store values to its extra memory and retrieve the result by the end of the
sequence, just like one would add values using an ALU in digital circuits.
From this first experiment we conclude that although we may feel tempted to always
use NTMs with both content and random addressing, which seems to be the most
complete memory architecture, we should still account for the problem complexity and
the number of adaptive parameters we are allowed to use.
3.5.3 Copy Problem
Graves et al. (2014) proposed the copy problem as a sequence-to-sequence
transformation. The input sequence is a sequence of 8 bits vectors with minimum length
of 1 and a maximum length of 10 vectors followed by an “end of sequence” marker. That
60
Table 3-2. Copy Problem: Percentage of correctly copied bits.Comparison between two of our proposed methods and the Neural Turing Machines
(NTM). Copying values from memories is a purely content addressable challenge, wherethe value to be copied is retrieved by content. Thus NTM1 performs best. But we show
that our NTM2 performs better than the original NTM1 when a larger region of thememory space needs to be addressed in a single step.
NTM1-3 shifts NTM1-7 shifts NTM2 DRAMACC (%) 99.2 50.1 80.3 64.1
pattern is concatenated with a sequence of zero valued sequence of maximum length
20. During the presentation of the zeros, the network is expected to output the same
sequence presented before the marker in the same order as the original input. In the
test set, the length of the input sequences is 100. The cost function for training was the
negative log-likelihood. The test accuracy was measured as the average number of bits
correctly represented in the output.
This problem does not fit well our generative model (3–3) since all the samples of
the sequence are statistically independent vectors, in which case z could not be smaller
than the input sequence itself for an exact generation. Nevertheless, this problem
revealed the advantages and limitations of the models we studied.
First, as shown in Graves et al. (2014) LSTMs fail to completely learn the training
set and does not generalize well to the test set. We were successful to reproduce
the results using NTM1 when the location shift was fixed to a maximum of 3 values
(i.e no shift, 1 to the left or 1 to the rigth). The model learned the training set and
generalized well to the test set. NTM2 and DiffRAM learned the training set faster, but
didn’t generalize well to the test set. We observed that NTM2 overfitted the training
set, using a small segment of the multidimensional memory to implement a simple 1D
memory. In other words, we used a square memory of size 11x11 to train NTM2 and the
model focused on its main diagonal, running out of space when larger sequences were
used as input. We could obtain better results with different 2D memory configurations
such as 5x20, which made easier for NTM2 generalize to larger sequences, but this
61
Figure 3-3. Neural Turing Machine operations in the Copy problem.Upper row: write and read distributions over memory locations. Lower row: desired andoutput of the network, cost is only calculated over the second half of the output. The firsthalf of the input (from rows 0 to 20) are ignored because those are the time steps when
the input is being presented.
only confirms that the linear nature of this problem does not require multidimensional
memories and is not the appropriate choice to compare these architectures.
We summarize the results in the test set in Table 3-2. In Figure 3-3 we can see the
position of the write and read distributions of the NTM1 that obtained the best results
in our experiments. Note how the network first write to a linear sequence of inputs and
later reads from the very same positions in order. Just like one would write a simple
program for copy-pasting a sequence using an intermediate memory.
Nevertheless, NTM1 also had problems when the input series was larger than the
number of memory locations. In such cases, the network didn’t readapt its memory.
62
Table 3-3. Sequence generation cost function (negative log likelihood, NLL) on the testset.
NTM1 NTM2 LSTMNLL 0.733 0.654 0.748
Further investigation is necessary to test if this problem is due to only the lack of
structure in the input or if the model is also overfitting.
3.5.4 Sequence Generation
In this experiment we wanted to test the ability of the networks to generate
sequences given only the first frame and how long the output sequence should be. This
is to test how well the studied architectures work as associative memories themselves.
The input frame is a 784-point vector made of a flattened 28 by 28 pixels image from
the MNIST dataset. The 785th point of the input vector denotes how long should be the
output video. This last input point was calculated as
l =l − 510, (3–26)
’ where l ∈ 1, 2, ... , 10 for training and l = 20 for testing. The desired sequence is a
smooth rotation of the input digit from 0 to 180 degrees. The spacing between each
angle varies with the sequence length. In this experiment we compared NTM1, NTM2
and LSTM trained using the negative log-likelihood between output and desired time
series as cost function. The final cost for the test set is shown in Table 3-3. Sample
generated sequences are shown in Figure 3-4. To further understand the operations of
these networks, we trained a classifier using the representations of the dynamic states
right after the presentation of the input sequence z = h1. The training was done using
the first 3000 samples from the MNIST training set; testing was done on the first 3000
samples of the test set. The classification accuracy is shown in Table 3-4.
The performance of both NTMs were similar with NTM2 slightly better. Both were
better than the compared LSTM. We noticed that the classification accuracy using the
first hidden states was poor when compared to supervised DNNs for classification.
63
desired
NTM2
LSTM
Figure 3-4. Sample desired and generated sequences using NTM2 and LSTM.The compared methods never had to generate sequences larger than 10 frames during
test. Notice, nevertheless that the proposed NTM2 were able to continue dreamingabout more images while the LSTM overfits to the training sequence length.
Table 3-4. Classification accuracy.We used the sequence generators’ dynamic states as feature. In other words, we
classified the z representations, see Figure 3-1 for context.NTM1 NTM2 LSTM
ACC (%) 85 86 83
Visualizing the features embedding, we noticed an overlap between the representation
of digits “4” and “9”. Further test is needed understand if the network was not good for
classification because the representations were independent of the input share, in other
words, we have to test if the network learned a shape invariant rotation operation. In
future work, we should test this hypothesis rotating images from an external dataset.
In the next chapter, we use the memory mechanism studied here as part of a novel
architecture for video prediction.
64
CHAPTER 4ADDRESSABLE MEMORIES AS PART OF A DIFFERENTIABLE GRAPHICS PIPELINE
FOR VIDEO PREDICTION
The ability to predict future frames in video has several applications, for example
we can cite video compression, planning for robotics and image enhancement. Video
prediction was one of the original goals of DPCNs (Principe & Chalasani, 2014). In
this chapter we investigate a neural network architecture and statistical framework that
models frames in videos using principles inspired by computer graphics pipelines. The
proposed model explicitly represents “sprites” or its percepts inferred from maximum
likelihood of the scene and infers its movement independently of its content. We impose
architectural constraints that forces resulting architecture to behave as a recurrent
what-where prediction network. The sprites in the scene are stored in an addressable
memory similarly to those investigated in the previous chapter, thus avoiding the
necessity of explicitly recalculating the shape of objects in the scene at every time step,
as usually done when using conventional RNNs for video prediction. We snapshot the
sprites using several mechanisms and address them using the methodology explained
in the previous chapter. Thus, the model specified here can be seen as a member of the
family of models described in the previous chapter. Nevertheless, this new architecture
has modules specific for video generation which we describe later in this chapter.
Developing what-where prediction networks with snapshot perceptions was on of the
main goals set for this thesis, this chapter shows how we achieved that. We call this
model Perception Updating Networks.
4.1 On the Need of a Differentiable Computer Graphics Pipeline
The current computer graphics pipelines are the result of efficient implementations
required by limited hardware and high frequency output requirements. These requirements
were also achieved with the use of explicit physics and optic constraints and modeling
with constantly improving data structures (Shirley et al., 2015).
65
Geometric Primitives
Figure 4-1. Steps of the 2D graphics or rendering pipeline that inspired our model.We start with geometric primitives such as sprites, vectors, points, etc. The first step ofthe pipeline is modeling transformation that represents the geometric primitives in the
world coordinate. Clipping is the process of discarding the world representations that willnot appear in the final image due to limited view angle. View transformation is the act of
rotating, translating and deforming the objects in the view to comply with the point ofview of the camera. Scan conversion finalizes the image generation with an array of
values that compose the image to be displayed.
In contrast Convolutional Neural Networks brute force search and match to get
features that are scale, rotation and translation invariant. Also, for a long time in machine
learning, image (Olshausen et al., 1996) and video (Hurri & Hyvarinen, 2003) generative
models had been investigated with statistical approaches that model images down to
the pixel level (Simoncelli & Olshausen, 2001), sometimes assuming neighborhood
statistical dependencies (Osindero & Hinton, 2008). In video prediction, the current state
66
of the art uses variations of deep convolutional recurrent neural networks (Kalchbrenner
et al., 2016) (Lotter et al., 2016) (Finn et al., 2016b).
As a parallel to the classic machine learning approach to image interpretation and
prediction, there is a growing trend in the deep learning literature for modeling vision
as inverse graphics (Kulkarni et al., 2015)(Rezende et al., 2016)(Eslami et al., 2016).
These approaches can be interpreted into two groups: supervised and unsupervised
vision as inverse graphics. The supervised approach assumes that during training an
image is provided with extra information about its rotation, translation, illumination,
etc. The goal of the supervised model is to learn an auto-encoder that explicitly factors
out the content of the image and its physical properties. The supervised approach is
illustrated by Kulkarni et al. (2015).
The unsupervised approach requires extra architectural constraints, similar to
those assumed in computer graphics. For example, Reed et al. (2016) modeled the
content of a scene with a Generative Adversarial Network (Goodfellow et al., 2014)
and its location with Spatial Transformer Networks (Jaderberg et al., 2015). The full
model is adapted end-to-end to generate images whose appearance can be changed
by independently modifying the ”what” and/or ”where” variables. A similar approach
was applied to video generation with volumetric convolutional neural networks (Vondrick
et al., 2016). In two papers by Google DeepMind (Rezende et al., 2016) (Eslami et al.,
2016) they improved the ”where” representations of the unsupervised approach and
modeled the 3D geometry of the scene. This way they explicitly represented object
rotation, translation, camera pose, etc. Their approaches were also trained end-to-end
with REINFORCE-like stochastic gradients to backpropagate through non-differentiable
parts of the graphics pipeline (Rezende et al., 2016) or to count the number of objects in
the scene (Eslami et al., 2016). Those papers also used Spatial Transformer Networks
to model the position of the objects in the scene, but they extended it to 3D geometry so
it could also model rotation and translation in a volumetric space.
67
Other approaches inspired by the graphics pipeline and computer vision geometry
in machine learning uses the physics constraints to estimate the depth of each pixel in
the scene and camera pose movements to predict frames in video (Mahjourian et al.,
2016) (Godard et al., 2016).
The new approach we developed is closer to the unsupervised approach of vision
as inverse graphics. More precisely, here we investigate frame prediction in video.
Contrary to the work by Reed et al. (2016) here we first limit ourselves to simple
synthetic 2D datasets and learning models whose representations can be visually
interpreted. This way we can investigate exactly what the neural network is learning
and validate our statistical assumptions. Most importantly, we can verify what the
memory unit of our architecture is able to snapshot and memory from the scenery. Also,
we investigate the behavior of Spatial Transformer Networks and question it as the
default choice when limited compute resources are available and no scale invariance is
required.
First in the next section we will pose a statistical model that is appropriate for
machine learning but inspired by the graphics pipeline. This will allow us to train a
memory augmented neural network using end-to-end backpropagation, just like we
did in the last chapter. From an experiment perspective, here instead of learning to
represent variable length video streams as fixed length vectors, we want to learn to
predict future frames in video using the extra power of addressable memories to avoid
redundant computations.
4.2 A 2D Statistical Graphics Pipeline
This section starts with a high level description of the 2D graphics pipeline, followed
by a discussion of how to implement it with neural network modules, and finally we
define a formal statistical model.
68
convolution result
spatial transformer result
δXY
convolution
spatial transformer
Figure 4-2. How to get similar results using convolutions with delta-functions and spatialtransformers.
Input sprite is 8× 8 pixels and the outputs are 64× 64 pixels. Note that in the convolutionthe result shape is rotated 180 degrees and its center is where the delta equals to one at
pixel (x = 16, y = 16). Note also that the edges of the spatial transformer results areblurred due to bilinear interpolation. A matrix can be read as “zoom-out” 8 times and
translate up and left in a quarter of the resulting size.
4.2.1 Preliminary Considerations and Relevant Literature Review
The 2D graphics pipeline starts from geometric primitives and follows with modeling
transformations, clipping, viewing transformations and finally scan conversion for
generating an image, see Figure 4-1. Here, we will deal with previously rasterized
bitmaps, i.e. sprites, and will model the translation transformations, rotation and clipping
with differential operations. This way, the steps in the pipeline can be defined as layers
of a neural network and the free parameters can be optimized with backpropagation.
For our neural network implementation, we assume a finite set of sprites (later we
generalize it to infinite sprites) that will be part of the frames in the video. The image
69
generation network selects a sprite, s, from a memorized sprite database Si∈{1,...,K}
using an addressing signal c :
s =∑j
cjSj , where
∑j
cj = 1.
(4–1)
Note that this is the same location addressing mechanism discussed in the previous
chapter. For interpretable results it would be optimal to do one-hot memory addressing
where cj = 1 for Sj = S and cj = 0 otherwise. Note that (4–1) is differentiable w.r.t
to both cj and Sj so we can learn the individual sprites from data. We can force cj add
up to 1 using the softmax nonlinearity. This approach was inspired by the recent deep
learning literature on attention modules (Bahdanau et al., 2014) (Graves et al., 2014)
and a more detailed discussion is shown in the previous chapter.
When the number of possible sprites is too large it is more efficient to do a
compressed representation. Instead of using an address value c we use a content
addressable memory where the image generator estimates a code z that is then
decoded to the desired sprite with a (possibly nonlinear) function d(z). If we interpret
the addressing value z as a latent representation and the content addressable memory
d(z) as a decoder, which is essentially a content addressable memory as discussed
in previous chapters. Also, we can use the recent advances in neural networks for
generative models to setup our statistical model. We will revisit this later in this section.
The translation transformation can be modeled with a convolution with a Delta
function or using spatial transformers. Note that the translation of an image I (x , y) can
be defined as
I (x − τx , y − τy) = I (x , y) ⋆ δ(x − τx , y − τy), (4–2)
70
where ⋆ denotes the image convolution operation. Clipping is naturally handled in such a
case. If the output images have finite dimensions and δ(x−τx , y−τy) is non-zero near its
border, the translated image I (x−τx , y−τy) will be clipped. Another way of implementing
the translation operation is using Spatial Transformer Networks (STN) (Jaderberg et al.,
2015). An implementation of STN can be defined in two steps: resampling and bilinear
interpolation. Resampling is defined by moving the position of the pixels (x , y) in the
original image using a linear transform to new positions (x , y) as
xy
= Ax
y
1
, where
A =
A11 A12 A13A21 A22 A23
.(4–3)
We assume the coordinates in the original image are integers 0 ≤ x < M and
0 ≤ y < N, where M × N is the size of the image I . Once the new coordinates are
defined, we can calculate the values of the pixels in the new image I using bilinear
interpolation:
I (x , y) = wx1,y1 I (x1, y1) + wx1,y2 I (x1, y2)+
wx2,y1 I (x2, y1) + wx2,y2 I (x2, y2)
(4–4)
where (x1, x2, y1, y2) are integers, x1 ≤ x < x2, y1 ≤ y < y2 and
71
wx1,y1 = (⌊x⌋ − x)(⌊y⌋ − x)
wx1,y2 = (⌊x⌋ − x)(⌊y⌋+ 1− y)
wx2,y1 = (⌊x⌋+ 1− x)(⌊y⌋ − y)
wx2,y2 = (⌊x⌋ − x)(⌊y⌋+ 1− y)
(4–5)
To avoid sampling from outside the image we clip the values ⌊x⌋ and ⌊x⌋ + 1
between 0 and M and the values ⌊y⌋ and ⌊y⌋ + 1 between 0 and N. We omitted that in
(4–5) for conciseness. Note that (4–4) is piecewise differentiable w.r.t I .
We can define translation through operations with
A =
1 0 τx
0 1 τy
. (4–6)
Also, we can rotate the image ρ radians counter clockwise with
A =
cos ρ sin ρ 0
− sin ρ cosρ 0
. (4–7)
Image rescaling is achieved on that framework by rescaling in the right square
submatrix A1:2,1:2. We illustrate in Fig. 4-2 how to get similar results using convolutions
with a delta-function and spatial transformers.
Our proposed statistical framework is based on the Variational Autoencoding Bayes
framework (Kingma & Welling, 2013). In the next subsection we review the Gaussian
and Gumbel-Softmax variational autoencoders (Jang et al., 2016) (Maddison et al.,
2016).
4.2.2 Variational Autoencoding Bayes
The Variational Autoencoding Bayes framework, also know as variational autoencoders
(VAE), proposed by Kingma & Welling (2013) uses neural networks to invert an
intractable generative model pθ(z)pθ(x |z), with unknown parameters θ and unobserved
72
z x
θ
ϕ
p(x|z)
q(z|x)
Figure 4-3. Variational autoencoder graphical model.An observable variable x is generated from unobserved factors z according to
pθ(z)pθ(x |z), where the parameters θ are not observed either. We approximate z bylearning a tractable recognition model qϕ, that approximates the posterior pθ(z |x), givena known prior pθ(x). Here which here take the form of a neural networks. Dense linesrepresent the generative model and dashed lines the learnt recognition (or inference)
model.
latent variables z , as depicted in Figure 4-3. The neural network is used to infer qϕ(z |x)
that approximates the true posterior pθ(z |x), assuming a known prior pθ(z).
Given a set of observations x ∈ {x1, x2, ... , xM}, the recognition model qϕ(z |x) is
trained to optimize the evidence lowerbound (ELBO):
L(θ,ϕ; xi) = −DKL(qϕ(z |xi)||pθ(z)) + Eqϕ(z |xi ) [log pθ(xi |z)] , (4–8)
where DKL(q||p) =∑i q(i) log
q(i)p(i)
is the Kullback-Liebler divergence between two
distributions and Eqϕ(z |xi ) [log pθ(xi |z)] is the autoencoder reconstruction cost (e.g. mean
square error for continuous variables and binary crossentropy for discrete variables). In
practice (4–8) assumes different forms depending on the prior distribution pθ(z). The
original formulation of VAE (Kingma & Welling, 2013) assumed a Gaussian prior. Also of
interest for the present work is the Categorical distribution (Jang et al., 2016) (Maddison
et al., 2016).
In order to make −DKL(qϕ(z |xi)||pθ(z)) differentiable with respect to the parameters
ϕ, it was proposed the reparametrization trick (Kingma & Welling, 2013), that allows
73
+
encoder
(MLP or CNN)
x
x~
z
sm
linear layerv
linear layer
*Gaussian
sampler
decoder
(MLP or CNN)
cost function:
see equations
(4-8) and (4-10)
x x~
m v
ξ
vm
Figure 4-4. Block diagram of a Variational Autoencoder with Gaussian prior andreparametrization trick.
The trainable parameters are on the encoder and decoder networks, and the linearlayers (conventional fully connected) that generate the mean m and standard deviationv . Notice that in practice, as represented, the weights of the encoder network are shared
between the m and v pathways.
us to sample and differentiate through qϕ(z |x). For the Gaussian with zero mean and
identity covariance prior case, the reparametrization tricks looks like:
Gaussian: z ∼ qϕ(z |x) = mϕ(x) + vϕ(x)ξ, (4–9)
74
where mϕ and vϕ are learned mean and standard deviations functions and ξ ∼ N (0, I).
For this zero mean identity covariance Gaussian case, the Kullback-Liebler divergence
becomes:
DKL(qϕ(z |xi)||pθ(z)) =1
2
∑i
1 + 2 log v(xi)−m(xi)2 − v(xi)2. (4–10)
For illustration purposes and to make clear how to use VAE in practice, we show a
block diagram of an autoencoder that uses shared layers to calculate m and v as usually
done in practice (Kingma & Welling, 2013), see Figure 4-4.
Another prior distribution pθ(z) relevant for the present work is the Categorical
distribution. For example, assuming that z are vectors of one-hot encoded variables,
in other words a sparse vector where all elements are 0 but one, that has the value of
1. The Gaussian reparametrization trick is not a good fit in this case, complicating the
issue further, sampling discrete random variables is not a differentiable operation. To
cope with this problem an approximation using Softmax distributions and Gumbel noise
were proposed (Jang et al., 2016) (Maddison et al., 2016). This approximation, called
Gumbel-Softmax uses the following reparametrization trick (Jang et al., 2016):
Gumbel-Softmax: z ∼ qϕ(z |x) = softmax(mϕ(x) + ζ), (4–11)
where ζ is a random variable sampled from the Gumbel distribution using the inverse
CDF method
ζ = − log− log u
u ∼ U(0, 1).(4–12)
75
Using the Gumbel-Softmax reparametrization, the Kullback-Liebler diverce part of
the ELBO is:
DKL(qϕ(z |x)||pθ(z)) = softmax(mϕ(x))
(log softmax(mϕ(x))− log
1
M
), (4–13)
where M is the dimmensionality of the latent space z . A similar block diagram to the
one presented in Figure 4-3 can be used to train a Gumbel-Softmax autoencoder (Jang
et al., 2016), with the main differences being in how the reparametrization trick is defined
and this new cost function.
In the next subsection we use our preliminary considerations and the VAE
definitions to propose the statistical framework we want to optimize here.
4.2.3 Proposed Statistical Framework
This section states the main theoretical contributions we developed in this chapter.
Considering the tools defined above, we can define a statistical model of 2D images the
explicitly represents sprites and their positions in the scene. We can use the free energy
of this statistical model to optimize a neural network. Let us start with a static single
frame model and later generalize it to video.
Let an image I ∼ pθ(I ) be composed of sprite s ∼ pθ(s) centered in the (x , y)
coordinates in the larger image I . Denote these coordinates as a random variable
δxy ∼ pθ, where θ are the model parameters. pθ(δxy) can be factored in two marginal
categorical distributions Cat(δx) and Cat(δy) that models the probability of each
coordinate of the sprite independently. For the finite sprite dataset, pθ(s) is also a
categorical distribution conditioned on the true sprites. For this finite case the generative
model can be factored as
pθ(I , s, δ) = pθ(s)pθ(δxy)p(I |s, δxy), (4–14)
76
assuming that “what”, s, and “where”, δxy , are statistically independent. Also, in such
case the posterior
pθ(s, δ|I ) = pθ(s|I )p(δxy |I ) (4–15)
is tractable. One could use for instance Expectation-Maximization or greedy
approaches like Matching Pursuit to alternate between the search for the position and
fitting the best matching shape. For the infinite number of sprites case, we assume
that there is a hidden variable z from which the sprites are generated as p(s, z) =
pθ(z)pθ(s|z). In such case our full posterior becomes
pθ(z , s, δ|I ) = pθ(z , s|I )p(δxy |I ) =
pθ(z |I )pθ(s|I , z)p(δxy |I ).(4–16)
We can simplify (4–16) assuming pθ(z |s) = pθ(z |I ) for simple images without
ambiguity and no sprite occlusion. For a scalable inference in the case of unknown θ
and z and intractable pθ(z |s) we can use the auto-encoding variational Bayes (VAE)
approach (Kingma & Welling, 2013). Using VAE we define an approximate recognition
model qϕ(z |s). In such case, the log-likelihood of the i.i.d images I is log pθ(I1, ... , IT ) =∑Ti log pθ(Ii) and
log pθ(Ii) = DKL(qϕ(z |si)||pθ(z |si))+
DKL(pθ(z |si)||pθ(z |Ii))+
L(θ,ϕ, δxy , Ii).
(4–17)
Again, assume that the approximation pθ(z |s) = pθ(z |I ) we have DKL(pθ(z |si)||pθ(z |Ii)) =
0 and the free energy (or variational lower bound) term equal to
77
L(θ,ϕ, δ, I ) = −DKL(qϕ(z |si)||pθ(z))+
Eqϕ(z |s,δ)pθ(δ|I )[log pθ(I |z , δ)],(4–18)
where we dropped the subindices xy and i to simplify reading. Here we would like to
train our model by maximizing the lower bound (4–18), again inspired by VAE. We
can do so using the reparametrization trick assuming qϕ(z |s) and the prior pθ(z) to be
Gaussian and sampling (4–9) as:
z = mϕ(I ) + vϕ(I ) · ξ, (4–19)
where ξ ∼ N (0,σI), I is the identity matrix, the functions m(I ) and v(I ) are deep neural
networks learned from data.
One can argue that given z and a good approximation to the posterior qϕ, estimating
δ is still tractable. Nevertheless, we preemptively avoid Expectation-Maximization or
other search approaches and use instead neural network layers lx and ly :
δxy = softmax(lx(I ))⊗ softmax(ly(I )), (4–20)
with ⊗ denoting the outer product of marginals. We also use a variational approximation
for qϕ(δxy |I ) ≈ pθ(δxy |I ). Since the position variables Ix(I ) and Iy(I ) are categorical
random variables, in this case we use the Gumbel-Softmax variational trick (4–11) for
sampling. With this extra reparametrization, the final form of our evidence lower bound
becomes:
L(θ,ϕ, δ, I ) = −DKL(qϕ(z |si)||pθ(z))+
−DKL(qϕ(δx |I )||pθ(δx))−DKL(qϕ(δy |I )||pθ(δy))
Eqϕ(z |s,δ)qϕ(δx |I )qϕ(δy |I )[log pθ(I |z , δ)],
(4–21)
78
where we show the factored statistically independent marginals qϕ(δx |I ) and qϕ(δy |I )
to make it explicit how the final cost functions will look like. We can substitute (4–10) in
(4–21) for the Kullback-Liebler divergence of the Gaussian model qϕ(z |s) and (4–13)
twice for the Categorical models qϕ(δx |I ) and qϕ(δy |I ).
Such amortized inference is also faster in training and test time than EM and will
also cover the case where I is itself a learned low dimensional or latent representation
instead of an observable image. Bear this in mind while we use this approach even in
simple experiments such as those with moving shapes in the Experiments section. This
will help us to understand what can be learned from this model. Also, this will be crucial
when we scalte our model in the next chapter.
Beyond images, we extend the model above to videos, i.e. sequences of images
I (t) = {I (0), I (1), ...}, assuming that the conditional log-likelihood log pθ(It |HIt) =
log pθ(It |Hδt ,Hzt) follows (4–17), where HIt is the history of video frames prior to time
point t. Also Hδt and Hzt are the history of position coordinates and the history of
latent variables of the sprites respectively. We should observe that one can make
the assumption that the sprites don’t change for a given video I (t) and only estimate
one sprite st=0 or hidden variable zt=0. This assumption can be useful for long term
predictions, but requires that the main object moving in the scene doesn’t change.
In the next section, we propose a neural network architecture for maximizing our
approximate variational lower bound 2D videos.
4.3 Perception Updating Networks
This section proposes a group of neural architectures for optimizing the lower
bound (4–18). This is a specific case of the more general framework presented in the
previous chapter, but with modules specifically tuned for video and image generation,
such as convolutions and spatial transformers. A schematic diagram is represented in
Fig. 4-5. The core of our method is a Recurrent Neural Network (RNN) augmented
with task specific modules, namely a sprite addressable memory and modeling
79
spritestranslate
rotate
Add
Background
It It+1
c
δXY
ρ
Figure 4-5. A schematic block diagram for a Perception Updating Network.This configuration uses both convolutions with delta functions for translation and spatial
transformers for rotation. It also shows the optional background underlay. Here, thesprites module is an external memory that is addressed by the RNN. Thus, Perception
Updating Networks are a specific case of the memory augmented framework presentedin the previous chapter. For an equivalent schematic diagram unfolded in time, we refer
the reader to Figure 3-1.
transformations layers. RNNs augmented with task specific units were popularized
by Graves et al. (2014) in the context of learning simple differentiable algorithms and
served as inspiration for us as well. Here since we explicitly model the perceived sprites
as s or z and update it and its location and/or rotation though time we decided to call our
method simply Perception Updating Networks.
Here an input frame at time t, It , is fed to the RNN that emits 2 signals: a memory
address that selects a relevant sprite and transformation parameters. If we are doing
the translation transformation using convolutions and delta functions this output is equal
to (4–20), see Algorithm 4 and Algorithm 5. If using STN, the translation operation
returns the matrix A used in (4–3), see Algorithm 6. Note that we could use both,
letting convolutions with δ to the translation is constraining A as in (4–7) to do rotation
transformations only. We describe the general case where both δxy and STNs are used
in Algorithm 7.
Beyond deciding between STNs vs δxy , a few other free parameters of our method
are the type of RNN (e.g. vanilla RNN, LSTM, GRU, ConvRNN, etc), the number of
neurons in the hidden state of the RNN and neural network architectures that infer the
correct sprite and modeling transformation parameters. Our hyperparameter choices are
investigated separately in each experiment in the next Section.
80
Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersm, I and a content addressable memory CAM as defined in 4–1
Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))ξ ∼ pθ(z)ct = m(ht)st = CAM(ct)It+1 = st ⋆ δxyIt+1 = µIt+1 + (1− µ)B
endAlgorithm 4: Convolutional Perception Updating Networks (conv PUN) with ContentAddressable Memory.
Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , l
Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))ξ ∼ pθ(z)zt = mϕ(ht) + vϕ(ht) · ξst = d(zt)It+1 = st ⋆ δxyIt+1 = µIt+1 + (1− µ)B
endAlgorithm 5: Convolutional Perception Updating Networks (conv PUN) with scalablesprites memory in the form of a variational decoder.
Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , f
Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)a = f (ht)
A =
[a11 a12 a13a21 a22 a23
]ξ ∼ pθ(z)zt = mϕ(ht) + vϕ(ht) · ξst = d(zt)It+1 = STN(st ,A)It+1 = µIt+1 + (1− µ)B
endAlgorithm 6: Spatial Transformer Perception Updating Networks (STN PUN).
81
Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , l , f
Result: video predictions It , t ∈ {1, 2, 3, ...}for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))ρ = f (ht)
A =
[cos ρ sin ρ 0− sin ρ cos ρ 0
]ξ ∼ pθ(z)zt = mϕ(ht) + vϕ(ht) · ξst = d(zt)at = STN(st ,A)It+1 = at ⋆ δxyIt+1 = µIt+1 + (1− µ)B
endAlgorithm 7: Convolution Perception Updating Networks with Spatial Transformerrotations.
In the next section we present experiments with the proposed architecture on
synthetic datasets.
4.4 Experiments
In this section we experiment with several implementations of the proposed
Perception Updating Networks. We start with a simple synthetic dataset made of
videos where one of 3 moving shapes moves with constant speed bouncing in the
edges of an image. This illustrates the working of the finite memory and the addressing
scheme in (4–1) and Algorithm 4. Afterwards we show results on the moving MNIST
dataset (Srivastava et al., 2015) commonly used in the literature of generative neural
network models of videos.
4.4.1 Bouncing Shapes
In this first experiment we generate videos of one of three shapes moving on a
non-zero background. The shapes are a square, triangle and cross. The image size is
20 × 20 pixels and the shapes are 8 × 8 pixels. The pixel values are between 0 and 1.
The shapes are picked with equal probability and they move at constant speed of 1 pixel
82
ground truth
convolutional PUN
LSTM
spatial transformer PUN
b) convolutional PUN learned sprites
a) one step ahead prediction
10x10 sprites
sample δxy when sprites 10x10
6x6 sprites
sample δxy when sprites are 6x6
Figure 4-6. Results on the Bouncing Shapes dataset.Three 8x8 sprites (a square, a cross and a triangle) were used to generate videos. Theshapes move in a 20x20 pixels canvas with a Toeplitz background and bounce on thecorners. a) We show one step ahead predictions with the compared methods. b) Wealso show the learned sprites for the convolutional implementation of the proposed
Perception Updating Networks when we over- and under-estimate the size of the desiredsprites. The internal RNN for both methods had 100 neurons. The sprite selection layer
has a single neuron connection 100 inputs to 3 outputs.
per frame. The shapes start from random initial positions, their movement directions is
random as well.
We tested two implementations of the proposed architecture: one using only
convolutions, referred to as convolutional PUN (conv PUN) in the figures, and another
using using spatial transformers, called Spatial Transformer PUN. For the parameters
of the convolutional PUN the RNN used was a Long Short Term Memory (LSTM) with
100 cells. The RNN in the Spatial Transformer PUN had 256 cells. In the convolutional
PUN, the location layers used to calculate δxy , lx and ly , output vectors of size 20 pixels
and we used the finite addressable memory described in (4–1). The background is also
learned from data as weights of neural network. This background served to make the
task more difficult and force the network to avoid just exploiting any non-zero value. After
83
the convolutional composition It = st ⋆ δxy , we added the background to form a new
image using It = µ · It + (1 − µ)B, where µ is a differentiable mask that accounts for the
“transparency” of the image It . B is the learned 20 × 20 pixels background image. For
complex shapes this mask shape could be calculated as another module in the network,
similarly to the approach in Vondrick et al. (2016). See Algorithm 4.
In the following experiments, the training videos were 10 frames long. At test time
the network is fed the first 10 frames of a video and asked to predict the next 10. Results
for the compared methods are shown in Fig. 4-8. For the baseline method, we did a
hyperparameter search on conventional LSTMs with a single linear output layer until we
found one that had comparable results at test time. That network had 256 hidden cells.
Also, note that although the scale of the mean square error is the same, the results from
our proposed architecture look smoother than those learned by the LSTM as shown in
Fig. 4-6.
Given such a simple experiment, it is elucidating to visualize values learned by
each piece of the network. As expected the sprite memory learned the 3 investigated
shapes in transposed order since they are reverted by the convolution operation to
compose the frame. We also experimented with choosing the size of the learned sprites
st smaller and larger than the true shapes. We observed that for larger shapes such
as 10 × 10 the sprites converge to the correct shapes but just using part of the pixels.
For smaller shapes such as 6 × 6 pixels, instead of learning a part of the correct shape,
the convolutional Perception Updating Network learned to compensate for the lack of
enough pixels with more than one non-zero value in the location operation δxy (see
Fig. 4-6). This allow us to suggest to the interested practitioner that in order to get
interpretable results it is better to use sprites larger than the expected size than smaller.
For the spatial transformer PUN the image is calculated as:
84
Figure 4-7. Results of a Convolutional Perception Updating Network.The first row show the predicted video of a bouncing triangle and the second row shows
“where” variable that encoder the position of the sprite in the scene. This decouplingbetween “what” (the triangle) and “where” is what gives Perception Updating Network
interpretability, efficiency and generalization. It is important to notice on this image thatthe quality of the first predicted frame is not as good as the others, the reason is that weinitialize the internal RNN state with a vector zeros and it takes one step for the state tobe updated to a more useful value. Future work should address training the initial state
as well.
A = f (ht),
It+1 = STN(st ,A),
(4–22)
see Algorithm 6 for context.
We noticed that the spatial transformer PUN was not able to learn the training
videos using an equivalent architecture to the convolutional PUN one. We had to use
multiple layers to define the function f (ht). In other words, in the convolution based
method δxy can be estimated by a single affine transformation of the state ht but A
cannot. We also had to use smaller learning rates to guarantee convergence: 0.0001 for
STN while the δxy -based model worked with a value 10 times larger.
If we don’t use the softmax nonlinearity to construct δxy the representations learned
by the convolutional PUN are still interpretable, but the performance of the overall model
in training and test set is worth. Overall, it is interesting to conclude that under this
framework the “what” and “where” can only be distinguished if we impose architectural
constraints. The reason is the commutative property of the convolution operation. In
Figure 4-7 we show a predicted video with the corresponding learned δxy that exactly
represents the center of the moving object.
85
Figure 4-8. Performance curves in the test task of two implementations of the proposedarchitecture (conv PUN and STN PUN) and an equivalent LSTM baseline.
Note that the spatial transformer based PUN was not able to generalize to the test set,i.e. they did not work well for generating videos when getting its own previous outputs asnext step inputs. Final errors in the test set are conv PUN: 0.033, STN PUN: 0.227 andLSTM: 0.035. Notice that we fixed a small 100 hidden neurons PUN and increased the
baseline LSTM until it had equivalent performance. We had to increase the total numberof trainable parameters by using a baseline with 256 hidden neurons.
As a note on rotation, we ran experiments where the sprite are rotated by a random
angle before being placed in the image. This new type of videos cannot be learned
using only convolutional based Perception Updating Networks unless we increase the
number of sprites proportionally to the number of possible angles. Spatial transformer
based Perception Updating Networks can handle this new type of video naturally.
Nevertheless, if the number of rotation angles is finite or can be discretized we found
that we could learn to generate the videos faster if we combined the convolutional
approach with a mechanism to select the appropriate angle from a set of possibilities.
4.4.2 Moving MNIST
The Moving MNIST benchmark uses videos generated by moving 28 × 28 pixel
images of hand written digits in a 64 × 64 pixels canvas. Just like in the Bouncing
Shapes dataset, the digits move with different different speeds in different directions and
can bounce in the walls. Unlike the Bouncing Shapes dataset, there are 60000 different
86
Figure 4-9. Sample rollouts of a 2 layer LSTM convolutional Perception UpdatingNetwork.
Notice that the quality of the predicted sprite doesn’t change, contrary to other methodsthat gets blurry with time. On the other hand, our proposed network forgets correct
movement. A possible solution could be a parametrization of the movement, in otherwords, parameterize the “where” variables update.
sprites for training and 10000 for test, making it impractical to use a discrete memory
module. Instead, we use the memory representation denoted by (4–19) followed by
st = d(zt) as written in Algorithm 5.
We trained a convolutional Perception Updating Network using 2 layer LSTMs each
one with 1024 cells for 200 epochs, with 10000 gradient updates per epoch. The latent
variable z had 100 dimensions and the decoder d(·) was a single hidden layer MLP
with 100 hidden neurons and softplus activation function. The output layer of this MLP
has 784 neurons, which is the size of an MNIST image, and sigmoid activation function.
In the test set we obtained a negative log-likelihood of 239 nats with the proposed
architecture, while a 2 layer LSTM baseline had 250 nats. Note that the our method was
optimized to minimize the lower bound (4–18), not only the negative likelihood. These
results are not as good as those obtained by the Video Pixel Networks (Kalchbrenner
et al., 2016) that obtained 87 nats on the test set. Nevertheless, both approaches are
not mutually exclusive and instead of a fully connected decoder we could use a similar
PixelCNN decoder to generate sprites with higher likelihood. In this first paper we
decided instead to focus in defining the statistical framework and interpretable “what”
and “where” decoupling.
87
When running the proposed method in rollout mode, feeding the outputs back as
next time step inputs, we were able to generate high likelihood frames for more time
steps than with a baseline LSTM. Also, since the sprite to be generated and its position
in the frame are decoupled, in rollout mode we can fix the sprite and only use the δxy
coming from the network. This way we can generate realistic looking frames for even
longer, but after a few frames we observed the digits stopped moving or moved in the
wrong direction (see video in the companion code repository). This means that the
LSTM RNN was not able to maintain its internal dynamics for too long, thus, there is still
room for improvement in the proposed architecture.
In Fig. 4-9 we show sample rollout videos. The network was fed with 10 frames and
asked to generate 10 more getting its own outputs back as inputs and the companion
code repository for an animated version of this figure.
This experiment also suggests several improvements in the proposed architecture.
For example, we assumed that the internal RNN has to calculate a sprite at every time
step, which is inefficient when the sprites don’t change in the video. We should improve
the architecture with an extra memory unity that snapshots the sprites and avoid the
burden of recalculating the sprites at every step. We believe this would a possible way
to free representation power that the internal RNN could use to model the movement
dynamics for even more time steps. We investigate that later in this chapter.
4.4.3 Visualizing the RNN-to-CAM Connections
One of the most valid criticisms of the architectural models we presented in the
last chapter is that the external memory unit needs to be at least as big as the input
source for a complete sequence reconstructions. A consequence of a large content
addressable memory is repeated memorized values. In the PUN framework with CAM,
we force the model to learn a single decoder-like memory system. Here we visualize
the addressing signal from the controller RNN to the CAM to proof it learns to cluster
redundant samples into meaningful compact representations.
88
In this experiment we continue working with the Moving MNIST dataset, but this
time we are not interested in the video prediction benchmark directly and focus into
movies with a single digit in the scene. We train a convolutional PUN for predicting
the next frames during 200 epochs. The size of the hidden code zt is 100 dimensions
and we reduce it to 2 dimensions using t-SNE. Results are shown in Figure 4-10. For
quantitative results, we compared the performance of linear classifiers in the codes
codes zt vs in the raw MNIST images. We obtained almost 3 times smaller error
probability using the learned codes zt from the moving digits in a larger scene than
using the raw images focused around the digit (see Figure 4-10).
This allow us to conclude that Perception Updating Networks can double as a video
predictor and an unsupervised object detection/feature extraction system.
4.4.4 Snapshotting “What” Directly from Pixels
Data: input videos It , t ∈ {0, 1, 2, ...}, initial RNN state h0, neural network layersmϕ, vϕ, d , l , f . Localization Network LN.
Result: video predictions It , t ∈ {1, 2, 3, ...}A = LN(I0)s = STN(I0,A)for t ∈ {0, 1, ...} doht ← RNN(It , ht−1)δxy = softmax(lx(ht))⊗ softmax(ly(ht))It+1 = s ⋆ δxyIt+1 = µIt+1 + (1− µ)B
endAlgorithm 8: Snapshot Perception Updating Networks.
Although powerful in our experiments, a limitation of learning a limited set of
sprites is that the model does not learn how to snapshot what is important for scene
reconstruction on the go. In this experiment we propose an alternative PUN model that
does not have a content addressable memory. Instead, this new model snapshots a
sprite proposal from the first frame of the video itself and uses it for frame prediction.
This new model, named Snapshot PUN is represented in Figure 4-11 and detailed in
Algorithm 8.
89
spritesIt
c
zt: 100 dimensional vector
t-SNE accuracy of a linear
classifier: 98.1%
Figure 4-10. A piece of the schematic block diagram for a Perception Updating Networkand t-SNE embedding of the codes sent from the RNN controller to CAM.
Note that the codes clustered by label. We represent in dashed lines the parts of thePUN do not contribute to the t-SNE embedding. For a quantitative depiction of the
visually separated clusters, we trained a linear classifier on the hidden codes zt , in whichcase we obtained 98.1% accuracy. A linear classier on the raw 28x28 pixels MNIST
obtains 94.5% accuracy.
Note that this new Snapshot algorithm is not mutually exclusive to the previous
versions PUN and could be used in tandem depending on the application. But note that
this new algorithm requires that the object of interest must be totally visible when the
snapshot is taken. The snapshot is taken by calculating a transformer matrix A using a
localization network, LN, that consists of a 5 layer convolutional neural network that take
90
translate
rotate
Add
Background
It It+1δXY
ρ
Localization
Network (LN)
Spatial
Transformer
first frame
fixed snapshot
sprite
A
Figure 4-11. Snapshot Perception Updating Network. See Figure 4-5 and compare it tothe convolutional Perception Updating Network model.
Note that this model does not update the snapshot sprite. This can mean both efficiency,via avoiding extra calculations, but also less flexibility, since it does not adapt in case of
changes in the object in the video (e.g. rotations, deformations, new objects, etc). Futurework should address a strategy to know when to snapshot objects in the scene and
when to update them.
as input the very first frame of the video and outputs the cropped digit. That digit is used
as a fixed sprite throughout the entire video prediction.
Since the snapshot is based solely on spatial transformer networks, it can converge
faster, but it is also more unstable and harder to use. We compare this snapshot model
with the convolutional PUN on video prediction. Results are shown in Table 4-1. Note
that our main conv PUN generalizes better and that is why we propose it to be our main
model.
91
Table 4-1. Comparison between Snapshot PUN conv PUN on the single digit movingMNIST benchmark.
Results show negative log likelihood (smaller is better) on the test set. Note thatSnapshot PUN captures the sprite from the scene using spatial transformers that arebased on bilinear interpolation, better results could be obtained with extra rescaling or
cleaning of the sprite before composing the scene. Snapshot PUN Conv PUN91.474 85.998
4.5 Rules of Thumb for Model Choice
In this chapter, we presented 5 different PUN algorithms. In the previous chapter we
presented another 3 memory augmented neural networks. Here we summarize rules of
thumb that should help choosing which of those algorithms is more appropriate for new
problems.
The first question to be asked is what is the purpose of the model being trained.
If the objective is to learn an autoencoder, i.e. learn a fixed length representation for
a variable length video, the methods from Chapter 3 are more appropriate. When
the fixed length representation is the result of an arithmetic operation, DiffRAM is
more appropriate. If the fixed length representation should be useful for sequence
reconstruction, NTM, NTM2 or DCAM must be preferred.
Perception Updating Networks should be the model of choice if the user is
interested either on video prediction (for example, for model based controls, video
compression, filtering, etc.), object detection or interpretable representations of videos.
In the PUN family, conv PUN is the most simple and efficient model since they converge
faster and learn clustered representations of the objects in the scene. If generalization is
important but the object of interest appears completely in the scene, Snapshot PUN is a
more robust model. A combination of Snapshot PUN and conv PUN could help in cases
where the object of interest changes during the video. For example, while the object
in the scene is constant, we can use Snapshot PUN, when it changes, we can use the
conv PUN decoder-like memory to draw the new object and the Snapshot PUN can
92
continue the generation from there. Notice though, that learning how to detect change in
the object of interest on the fly is left for future work.
In the next chapter we focus in scaling this video prediction network for real world
videos. We do so by proposing a new type of DPCN, called Recurrent Winner Take
All Networks and later combining it with the memory and graphics pipeline modules
proposed in this chapter.
93
CHAPTER 5SCALING UP PERCEPTION UPDATING NETWORKS
One of the main criticisms of Perception Updating Networks (PUN)1 was regarding
its strong 2D assumptions and potential weakness to represent 3D videos. The 2D
assumptions and 2D graphics pipelines should represent well flat scenes, platform
based video games and top down views such as those of drones, satellites, robots, etc.
On the other hand, the 2D assumption would fail to represent 3D videos of roads, robot
navigation, human actions, etc, which are the vast majority of videos available.
To extend PUNs to 3D one might generalize the graphics pipeline proposed in
the previous chapter to represent the extra dimension of space. We could define
volumetric convolutions to place 3D voxels (the equivalent of the 2D sprites) in space.
With this added extra dimension one would also have to model 3D translation, rotation,
perspective transformations, and also deal with occlusion, view angle, etc. While this
might be the most complete extension of PUNs, the required neural network technology
would be beyond the scope of the analysis presented here. It would also require
innovations only recently developed in the literature (Wu et al., 2016)(Ravanbakhsh
et al., 2016)(Dai et al., 2016)(Gadelha et al., 2016)(McCormac et al., 2016)(Yan et al.,
2016).
We leave the development of 3D-PUN for future work. Here we propose an
alternative extension based on spatio-temporal embeddings with Convolutional
Recurrent Neural Networks (ConvRNN) (Santana et al., 2016b) and Perception
Updating Networks as an extension of Convolutional layers. Our hypothesis is that we
can project the input videos to a non-linear manifold where the flat spatial dimensions
assumption holds true. We implement PUN on that manifold and finally perform the
decoding transformation back to the original video space. The PUN layer implements
1 These comments were kindly provided by ICLR 2016 anonymous reviewers
94
different nonlinearities than a conventional convolutional layer, thus learning more
complex mappings when both are used in tandem.
The motivation for this hypothesis was our recent success in unsupervised learning
with ConvRNNs (Santana et al., 2016b). We were able to learn projections of videos
of 3D rotating shapes that were linearly separable for object recognition. Here we will
first review those results and later learn a PUN in the latent space of ConvRNNs. We
can interpret that as learning with a PUN a spatio-temporal trajectory in a hyperplane
of a convolutional auto-encoder. In this new space, a PUN memory no longer means
only a sprite or a position, but a video memory in the embedded space. We illustrate the
architecture of this proposed extension in Figure 5-1.
Both the ConvRNN results and ConvRNN+PuN presented in this chapter are
original work. The ConvRNN results in this chapter were already submitted for
publication (Santana et al., 2016b). The ConvRNN+PUN combination is unpublished
work. In the next section we review the ConvRNN architecture already discussed in
Chapter 2. We show the modifications we made to that model and show some results
in unsupervised learning of videos. Those results are the main motivation for combining
ConvRNN+PUN and the experiments that concludes this chapter.
5.1 Convolutional Recurrent Neural Networks for Unsupervised Learning ofVideos
In this section we revisit our published results with ConvRNNs (Santana et al.,
2016b) to illustrate their ability to embedded multi-dimensional time series in linearly
separable spaces. Furthermore, in place of using computationally expensive EM
algorithms to compute the sparse states and causes, just like DPCNs, our method
(Santana et al., 2016b) uses convolutional recurrent autoencoders with Winner-Take-All
(Makhzani & Frey, 2015) regularization to encode the states in feedforward manner.
Makhzani and Frey proposed Winner-Take-All (WTA) Autoencoders (Makhzani &
Frey, 2015) which use aggressive Dropout, where all the elements but the strongest of
95
conv
k, 5, 5
↓(2, 2)
free parameters:
k: convolutional channels
c: channels in the output (1 for gray, 3 for color)
H: hidden neurons of PUN's LSTM
N: memory neural network
conv
k, 5, 5
↓(2, 2)
PUN Mask-Network Background
Network
ConvLSTM
k, 5, 5
sigmoid
m * p + (1-m)*b
p
m
b
merged layers
..
.
conv
k, 5, 5
↑(2, 2)
conv
k, 5, 5
↑(2, 2)
..
.
pre-processing
convnet
(1 to 3 layers)
post-processing
convnet
(0 to 2 layers)
Figure 5-1. Convolutional Perception Updating Network as a hidden layer of a deepconvnet.
(Optionally) Using a PUN in the hidden space of a convnet transforms it a ConvolutionalRecurrent Neural Network (green). If used as a hidden layer the PUN’s memory no
longer have a one-to-one relationship with pixels in the output scene, but withspatio-temporal memories, similarly to the gists of Deep Predictive Coding Networks. To
reduce the number of computations, we implemented the convolutions in thepre-processing sub-network (blue) with k-filters of 5x5 pixels with strides 2x2 that
downsample the output. Equivalently, the post-processing sub-network (yellow) hasconvolutions with k-filters of 5x5 pixels with depth-to-space upsampling. The PUN layeroutput is merged with conventional convolutional layers via gated addition (red), where
the soft-binary mask ”m” balances the contribution of each layer.
a convolutional map are zeroed out. This forces sparseness in the latent codes and the
convolutional decoder to learn robust features. In our method we extended convolutional
Winner-Take-All autoencoders through time using convolutional RNNs. WTA for a map
xf ,r ,c in the output of convolutional layer can be expressed as in (5–1). The indices f , r , c
96
represent respectively the number of rows, the number of columns, and the number of
channels in the map.
WTA(xf ,r ,c) =
xf ,r ,c , if xf ,r ,c = max
r ,c(xf ,r ,c)
0, otherwise. (5–1)
Thus,WTA(xf ,r ,c) has only one non-zero value for each channel f . To backpropagate
through (5–1) we use ∇WTA(xf ,r ,c) = WTA(∇xf ,r ,c). In the present paper, we apply
(5–1) to the output of the convolutional maps of the ConvRNNs after they have been
calculated. In other words, the full convolutional map hidden state is used inside the
dynamics of the ConvRNN, WTA is applied only before they are fed as input to the
convolutional decoder.
We also proposed to learn smoothness in time with architectural constraints using
a two-stream encoder as shown in Figure 5-2. In the present work, this two stream
approach inspired the skip-network we used for the ConvRNN+PUN. Originally, this
two-stream architecture was inspired by the dorsal and ventral streams hypothesis in
the human visual cortex Goodale & Milner (1992). Roughly speaking, the dorsal stream
models ”vision for action” and movements and the ventral stream represents ”vision
for perception”. In our proposed architecture one stream is a stateless convolutional
encoder-decoder and the other stream has a convolutional RNN encoder, thus a
dynamic state. Using Siamese decoders for both streams, we force the stateless
encoder and the convolutional RNN to project into the same space—one which can be
reconstructed by the shared weights decoder. It is important to stress that from the point
of view of spatiotemporal feature extraction with the ConvRNN, the stateless stream
works as for regularization. As any other sort of regularization its usefulness can only
be totally stated in practice and the practitioner might optionally not use it. Nevertheless,
we opted for using the full architecture in all the experiments of this paper. In Appendix
97
inputreconstruction of
frame 1
shared encoder
indicates siamese decoders
with shared parameters
Convolutional RNN
frame 1
z-1
WTA
WTA
prediction of
frame 2
Figure 5-2. Schematic diagram of the Recurrent Winner-Take-All (RWTA) network.This is the modified architecture we used to investigate ConvRNNs ability to embedvideos into a separable space. Upper stream is the static encoder-decoder. Lower
stream denotes the temporal, dynamic encoder based on a ConvRNN.
A, we show how this proposed architecture enforces spatiotemporal smoothness in the
embedded space.
Given an input video stream xt , denoting the stateless encoder by E , the decoder
D, and the convolutional RNN by R, the cost function for training our architecture is the
sum of reconstruction and prediction errors:
Lt = E[(xt−1 −D(E(xt−1)))2 + (xt −D(R(xt−1)))2
], (5–2)
where E denotes the expectation operator. Notice that as depicted in Figure 5-2,
E and R have shared parameters. During training, we observe a few input frames
t = [1, 2, ...,T ] and adapt all the parameters using backpropagation through time
(BPTT) (Werbos, 1990). Notice that due to BPTT both streams of our architecture are
adapted while considering temporal context. Thus, the stateless encoder E will learn
richer features than it would if trained on individual frames.
To illustrate the capabilities of such proposed architecture we applied it to two
datasets, the Coil100 and Honda/UCSD Faces Dataset for a direct comparison with
DPCN and other unsupervised learning techniques. Sample videos of both datasets are
98
(a)
(b)
Figure 5-3. Sample videos from Coil and Honda/UCSD datasets.a) Coil 100 dataset (Nene et al., 1996) and b) Honda/UCSD face dataset (Lee et al.,
2005).
Table 5-1. Hyperparameter choices per experiment
COIL100 Honda FacesChannels per layer 128 256Filter size (encoder) 5x5 5x5Filter size (decoder) 7x7 7x7All models were trained using ADAM optimization rule with learning rate 0.001
All models were 4 layers deepAll models had 2 convolutional layers before the ConvRNN layer.
WTA was applied only right before the last layer.
shown in Figure 5-3. A list of the hyperparameters used in those experiments are shown
in Table 5-1.
The COIL-100 dataset (Nene et al., 1996) consists of 100 videos of different
objects. Each video is 72 frames long and were generated by placing the object on a
turn table and taking a picture every 5◦. The pictures are 128x128 pixels RGB. For our
experiments, we rescaled the images to 32x32 pixels and used ZCA pre-processing.
99
Figure 5-4. 128 decoder weights of 7x7 pixels learned on Coil-100 videos.
The classification protocol proposed in the COIL-100 Nene et al. (1996) uses 4
frames per video as labeled samples, the frames corresponding to angles 0◦, 90◦, 180◦
and 270◦. Chalasani and Principe(Chalasani & Principe, 2015) and Mobahi et. al.
(Mobahi et al., 2009) used the entire dataset for unsupervised pre-training. For this
reason, we believe the results in this experiment should be understood with this in
mind. Note that the compared methods enforce smoothness in the representation of
adjacent frames, and since the test frames are observed in context for feature extraction,
information is carried from labeled to unlabeled samples. In other words, this experiment
is better described as semi-supervised metric learning than unsupervised learning.
Here, we followed that same protocol, using 14 frames per video. Results are reported
in Table 5-2. We used encoders with 128 filters of 5x5 pixels and a decoder with 7x7
pixels. The decoder filters are shown in Fig. 5-4
The Honda/UCSD dataset consists of 59 videos of 20 different people moving their
heads in various ways. The training set consists of 20 videos (one for each person),
∼ 300 − 1000 frames each. The test set consists of 39 videos (1-4 per person),
∼ 300 − 500 frames each. For each frame of all videos, we detected and cropped the
100
Table 5-2. Recognition rate (in percentage %) for object recognition in Coil-100 datasetMethod Accuracy
DPCN no context Chalasani & Principe (2015) 79.45Stacked ISA + temporal Le et al. (2011) 87
ConvNets + Temporal Mobahi et al. (2009) 92.25DPCN + temporal + top down Chalasani & Principe (2015) 98.34
Proposed method 99.4
Table 5-3. Recognition rate (in percentage %) for face recognition in Honda/UCSDdataset
Sequences Lengths MDA SANP CDN ProposedMethod
50 Frames 74.36 84.62 92.31 100100 Frames 94.87 92.31 100 100Full Video 97.44 100 100 100
References: MDA (Wang & Chen, 2009), CDN (Chalasani & Principe, 2015),SANP (Hu et al., 2011) .
faces using Viola-Jones face detection. Each face was then converted to grayscale,
resized to 20x20 pixels, and histogram equalized.
During training, the entire training set was fed into the network, 9 frames at a time,
with a batch size of 32. After training was complete, the training set was again fed
into the network. For each input frame in the sequence, the feature maps from the
convolutional RNN were extracted, and then (5,5) max-pooled with a stride of (3,3). In
accordance with the test procedure of Chalasani and Principe Chalasani & Principe
(2015), a linear SVM was trained using these features and labels indicating the identity
of the face. Finally, each video of the test set was fed into the network, one frame
at a time, and features were extracted from the RNN in the same way as described
above. Each frame was then classified using the linear SVM. Each sequence was
assigned a class based on the maximally polled predicted label across each frame
in the sequence. Table 5-3 summarizes the results for 50 frames, 100 frames, and
the full video, comparing with 3 other methods, including the original convolutional
implementation of DPCN Chalasani & Principe (2015). The results for the 3 other
101
methods were taken from Chalasani & Principe (2015). The results for our method were
perfect for all the tested cases.
5.2 ConvRNN + PUN: Combining Convolutional RNNs and Perception UpdatingNetworks
With the knowledge about ConvRNNs acquired with the results in the previous
section and about Perception Updating Networks from previous chapter, we set to
combine both architectures to create an scalable memory based, shift invariant video
prediction system. The convolutional part of the ConvRNN would give us the scalable
shift invariant properties, the RNN is what takes care of the dynamics and the PUN is
responsible for the memory mechanism.
The first equations to be defined when writing a ConvRNN based PUN are the ones
for calculating the “where” variable δxy and the sprite st . In the original PUN formulation
we calculate first an LSTM hidden state ht and δxy is the outer product of two affine
transformations of ht . For st , in the decoder based memory, we use an MLP. In our
experiments, we limited the ht to 100 or 1000 dimmnsions. On the other hand, when
using a ConvRNN, the hidden state of a Convolutional LSTM (ConvLSTM) is a set of
multidimensional feature maps Ht with the same number of rows and columns as the
input image times the number of channels in the convolutional filters. For clarity and
quicker reference, we restate the ConvLSTM equations bellow:
It = logistic(Whi ⋆ Ht−1 +Wxi ⋆ Xt + bi)
Ft = logistic(Whf ⋆ Ht−1 +Wxf ⋆ Xt + bf )
Ot = logistic(Who ⋆ Ht−1 +Wxo ⋆ Xt + bo)
Gt = tanh(Whg ⋆ Ht−1 +Wxg ⋆ Xt + bg)
Ct = Ft ⊙ Ct−1 + It ⊙ Gt
Ht = Ot ⊙ tanh(Ct),
(5–3)
102
where ⋆ denotes convolution operation.
One may argue that we can simply resize the feature maps Ht to a vector shape
and use the original formulation of PUN. To observe that this is impractical, assume an
input video frame of the moving MNIST dataset. Those frames have 64x64=4096 pixels.
Now, assume a resoanably sized convolutional filter with 128 channels. The output of
such a ConvLSTM would have 4096*128=524,288 pixels. This input is size is too large
and learning a weights for each one of those pixel inputs is impractical.
At this point, we recall the experiments with PUNs with oversized memories. We
observed that backpropagation and enough data was enough to make the PUN learn
to use only the parts of the memory it needed for successfully modeling the input video.
Thus, instead of resizing the maps Ht we keep their original shape and learn a new set
of convolutional filtersWδ andWs and compute the “what” st and “where” δxy as
st =Ws ⋆ Ht ,
δxy = softmax(Wδ ⋆ Ht),
ot = δxy ⋆ st
(5–4)
where the number of output channels inWδ is 1 and the number of output channels
inWs is the same as the input videos if the PUN layer is the output layer. We can also
calculate δxy = sigmoid(Wδ ⋆ Ht) in the hidden layer if we want to repeat the st further. If
the PUN layer is a hidden layer the number of output channels inWs is a free parameter.
Note that the ConvRNN based PUN can also implement a finite memory where only δxy
is calculated with convolutions, and st is selected from an associative memory.
103
We can also implement multiple deltas and multiple sprites and combine the result
with argmax, like in the original PUN formualation:
st,0 =Ws,1 ⋆ Ht , st,1 =Ws,1 ⋆ Ht ,
δxy ,0 = softmax(Wδ,0 ⋆ Ht), δxy ,1 = softmax(Wδ,1 ⋆ Ht),
ot = argmax(δxy ,0 ⋆ st,0, δxy ,1 ⋆ st,1),
(5–5)
where argmax is implemented element wise, pixel by pixel.
Although the intuition provided by the PUN experiments in the previous chapter
helped us design this scalable version, when using the PUN as a hidden layer, we
can no longer visually inspect the learned memories. For this reason, in the following
experiments validating this new architecture we perform extensive hyperparameter tests
with both synthetic and real world videos. Our hope is that with several comparisons
against a ConvRNN based will help us to understand when PUN are a better suited
layer in neural networks design.
5.3 Experiments
5.3.1 Moving MNIST
We start our experiments with the two digits Moving MNIST benchmark. Similarly to
the previous chapter, we have a 64x64 pixels canvas and two 28x28 pixels MNIST digit
moving in the scene. We set up to learning a generative model that predicts future video
frames given a history of previous frames. We performed extensive experiments against
a baseline ConvLSTM. All hyperparameters and results are presented in Table 5-4. PUN
based networks were 0.1s slower.
We observed that the proposed method requires a large visual receptive field
to work well. Such large receptive field can be achieved with several layers and
downsampling or resolution preserving dilated convolutions Kalchbrenner et al. (2016).
Since, the output needs to be on the same resolution as the input, for the downsampling
encoders, the output layers of the network need implement upsampling, just like in
104
Table 5-4. Experiments with hidden PUN. Average negative log-likelihoods (nats) onvideo prediction experiments with the Moving MNIST benchmark.
All methods trained with ADAM optimizer with learning rate 0.0009.All convolutions had 64 filters of 5x5 pixels
PUN as output layerTwo layer experiments
ConvLSTM - PUN ConvLSTM - Conv178 180
Deep residual U-net (see Figure 5-5)two PUN output Conv output
155.54 2802
PUN as Hidden Layer4 layers experiments
Conv - Conv- ConvLSTM - Upsample - PUN Conv - Conv - ConvLSTM - Upsample - Conv145.7 150
6 layers experimentsConv - Conv - ConvLSTM -
PUN- Upsample - Conv - ConvConv - Conv - ConvLSTM -
Conv - Upsample - Conv - Conv139.26 147.89
so called ”hourglass” models. The hourglass nickname is a reference to the shape of
the encoder that funnels the representation with downsampling and the decoder that
expands it back with upsampling.
Perception Updating Network augmented convnets outperform their equivalent
counterparts, with the improvement being larger for deeper models. We conclude that
if the extra computational cost is affordable, the user can always use PUN augmented
networks and expect improvements. The extra computational cost is one convolution
per sample per PUN layer. The computation cost of convolution is O(4), but note that
in modern GPU libraries their implementation is very efficient, nevertheless they can be
costly on naıve CPU implementations. In our implementations, the PUN based networks
were less than a second slower per sample in a batch. The smaller the size of the input
where the PUN layer is applied the smaller this time difference in computation. Finally,
note that the convolutional PUN model has better results than the fully connected model
presented in the previous chapter.
105
input
conv 3x3,128
64x64
32x32
16x16
8x8ConvLSTM
3x3,128
resnet
2 blocks
3x3,128
conv 3x3,128
conv 3x3,128 conv 3x3,128
output
conv 3x3,128
resnet
2 blocks
3x3,128
resnet
2 blocks
3x3,128
resnet
2 blocks
3x3,128
conv 3x3,128
PUN 1 PUN 2
Figure 5-5. Deep residual U-net with Perception Updating Networks output.See results in Table 5-4. Note that resnet (He et al., 2016) blocks are known to be
unstable without batch normalization layers, but here we didn’t need them because thePerception updating Network layers in the output naturally bound the gradients that are
backpropagated. Without batch normalization the full network is faster. See theimplementation of residual blocks we used in Figure 5-6.
5.3.2 Real Videos: Kitti Dataset
In this experiment, we tested the PUN-augmented convnet on real world videos
of the Kitti Dataset (Geiger et al., 2013). The Kitti dataset is a standard benchmark for
visual odometry, SLAM, depth estimation, structure from motion, etc. The videos of
the Kitti Dataset were recorded from the wind shield of a car driven in rural areas and
106
nonlinearity
+
input
resnet block
conv
3x3, 128
conv
3x3, 128
x = x, if x > 0
x= a(exp(x)-1), else
nonlinearityx = x, if x > 0
x= a(exp(x)-1), else
output
Figure 5-6. Definition of a single resnet block used in the experiments.See Figure 5-5 for details.
highways of Karlsruhe, Germany. The input videos had 128x160x3 pixels. The pixel
values were rescaled between 0 and 1.
As learned in the previous experiments, we needed to have receptive fields in
different scales, but since the video in this experiments were larger, we opted for using
resolution preserving networks (Kalchbrenner et al., 2016). In other words, we used
no downsampling operation, instead to achieve multiple resolutions, we used dilated
convolutions. Convolution with dilation does not change the number of trained weights,
but increases the effective receptive field by padding zeros between the non-zero
weights (dilation). We illustrated dilated convolutions in Figure 5-7.
In this experiment we compared two deep resolution preserving convolutional neural
networks. We show the hyperparameter choices and results in Table 5-5.
With this extra set of experiments, we observed that Perception Updating Networks
consistently improve the performance of convolutional neural networks. Note that all
107
Figure 5-7. Dilated convolution with a filter of 3x3 pixels with dilation rate of 1x1.Inputs are shown in light blue, weights in dark blue and outputs in green. In our
implementation we padded the inputs with zeros to make the output have the same sizeas the input. The dilation rate controls the number of holes or zeros between the thetrainable weights. The larger the dilation rate, the larger the effective receptive field of
the convolutional layer. This image was adapted fromhttps://github.com/vdumoulin/conv_arithmetic.
(a)
(b)
Figure 5-8. Qualitative results on the test set of Kitti Dataset.a) predicted frames using Perception Updating Network augmented convnet. b) Target
videos.
the experiments we performed were with generative models, we did not experiment
with PUN on convnets for classification. Nevertheless, it is interesting to note that while
we proposed PUN for memory augmented and interpretable models for 2D videos, but
the operation defined in 5–4 is capable of learning mappings that are more general
108
Table 5-5. Hyperparameters and quantitative results on the test set of Kitti Dataset.We compared our Perception Updating Network augmented convnet to a conventional
convnet. Both models were trained with similar hyperparameters, with the onlydifference being the PUN layer in the output. PUN based networks were in average
0.88s slower for 10 frames long videos.All methods trained with ADAM optimizer with learning rate 0.0009.
All convolutions had 48 channelsSimilar to both networksConv - 5x5 - dilation rate: 1Conv - 5x5 - dilation rate: 2Conv - 5x5 - dilation rate: 3Conv - 5x5 - dilation rate: 4
ConvLSTM - s3x3Output layer of compared methods
PUN Baseline convnetSprite — Delta — Mask — Conv 3x3 Conv - 3x3
Mean Squrared Error on next frame prediction (test set)
PUN Baseline convnet Previous Frameas predictor
0.0054 0.0085 0.0143
than those learned with conventional convnets and we showed that they can always
be used regardless of the complexity of the task and architectural choices. The cost of
this new model is the extra convolution between the δ and sprite, in other word, extra
computations at O(4). But note that those extra computations can be ran in parallel with
all the other convolutions in the same depth of the architecture. We used Tensorflow with
GPU and observed no significant increase in time per epoch (less than 1s per sample in
a batch), but we had more memory requirements when using PUN.
109
CHAPTER 6CONCLUSIONS
The present thesis was about neural networks augmented with memory. We
implemented these memories to be used to consolidate relevant events in the network
inputs and/or the states of the neural networks evoked by such relevant events. For
this reason we named our developments “A Framework for Pattern Consolidation in
Cognitive Architectures”.
We progressed the research in 3 main steps:
1) In the first step (reported in chapter 3) we designed a general architecture
without specific applications in mind. In that architecture we had a recurrent neural
network that is augmented with a content addressable memory and read and write
operations inside the recurrent neural network loop.
The main goal of this step of the research was to learn to memorize, i.e. force the
neural network to rely on its memory module and use the read and write operations to
store information. To do so, we applied the model to sequence memorization, in other
words, we had a time series auto-encoder working in two stages: first sequence read,
where the entire input sequence is presented and second input reconstruction, where
the network has to output the exact sequence. Obviously, the best way to do that is
by memorizing (via the addressable memory write operations) and later traversing the
memory, returning the input (with the memory read operations).
The main lessons learned here were how to use neural network weights (memories)
that are generated with different dynamics other than backpropagation, instead using
read and write operations via content addressing.
2) In the second step (reported in chapter 4) we had a specific application in mind:
generative models for video. In such case it was easier to interpret what should be
memorized by our memory augmented architectures. It was easier to interpret because
we worked with videos that could be decomposed as moving objects in a scene. In this
110
case, the “relevant event” to be memorized was the main object in the scene. Finally,
for generative modeling purposes, our network just had to learn “where” to place the
memorized events/objects. The full model had also the form of a recurrent neural
network with extra modules.
Here we used the read and write memory mechanisms studied in the previous step
and defined Perception Updating Networks. We also developed a 2D graphics pipeline
statistical framework for validating the proposed architecture.
In this chapter we solved some of the main tasks proposed for this thesis, which
was the developing a cognitive architecture capable of using a content addressable
memory for snapshotting relevant events and objects in a scene, as well as representing
such snapshots efficiently.
3) The third and last step is an attempt to make Perception Updating Networks more
practical for real world videos (i.e. 3D world captured by 2D camera). Those types of
videos cannot be perfectly modeled with our 2D graphics pipeline assumption. For this
reason here we wrote a fully convolutional implementation of the Perception Updating
Network algorithm that was used to augment Convolutional Neural Networks (both
output and hidden layers were tested). The implementation relied on Convolutional
Recurrent Neural Networks, that our lab had previously shown (with DPNC and RWTA)
to be useful for feature extraction and unsupervised learning.
From our framework perspective, this is simply a reimplementation of the main
findings from chapter 4. A reimplementation where the memorized events or “what” and
“where” are the outputs maps of convolutional layers.
At this point we also noticed that another possible interpretation for our model
is a neural network that learns to output other synthetic neural networks. In other
words, if we interpret the “where” maps as inputs and the “what” maps as weights, their
combination is essentially the implementation of a single layer convolutional neural
network. With this interpretation in mind, we plugged the Perception Updating Network
111
layer indiscriminately in conventional convnets are were able to consistently improve
results in generative modeling benchmarks. This interpretation relates Perception
Updating Networks to the work on meta neural networks (Ha et al., 2016) (Zoph & Le,
2016).
An alternative way to progress this research that was not investigated in this thesis
is the development of a proper 3D graphics pipeline. This framework could be more
general but also more computationally expensive. Besides that, the neural networks and
differentiable computer vision advances required to make such statistical 3D graphics
pipeline practical have only been recently published (late 2015 and 2016), which makes
this line of research interesting for (nearby) future work.
Another way to take the present research further is by applying the Perception
Updating Network framework to the recently published Fast Weights based model (Ba
et al., 2016). Fast Weights can be understood as an RNN inside an RNN to generate
weights that will be used to update the outer RNN states. They also consider the
entire sequence of generated states as an addressable memory and use a linear
product kernel for addressing. It would be interesting to implement PUN method as a
convolutional Fast Weights.
Finally now that we better understand how Cognitive Architectures can leverage
memory to snapshot relevant events in space-time, future work should address how to
combine that with attention mechanisms for controlling what is snapshot. That control
mechanism should be conditioned on environment states and agent goals. In other
words, it should be interesting to investigate Perception Updating Networks in the
context of Focus of Attention research (Burt et al., 2016) and Reinforcement Learning
(Emigh et al., 2015).
112
REFERENCES
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S.,Davis, A., Dean, J., Devin, M., et al. (2015). Tensorflow: Large-scale machine learningon heterogeneous systems, 2015. Software available from tensorflow. org, 1. 2.1.1
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014).Convolutional neural networks for speech recognition. Audio, Speech, and LanguageProcessing, IEEE/ACM Transactions on, 22(10), 1533–1545. 2.1.3
Amari, S.-I. (1988). Statistical neurodynamics of various versions of correlationassociative memory. In Neural Networks, 1988., IEEE International Conferenceon, (pp. 633–640). IEEE. 2.7
Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., & Ionescu, C. (2016). Using fast weights toattend to the recent past. In Advances In Neural Information Processing Systems, (pp.4331–4339). 6
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473. 1, 2.7, 3.3, 4.2.1
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends R⃝ inMachine Learning, 2(1), 1–127. 2.1
Bengio, Y. (2012). Deep learning of representations for unsupervised and transferlearning. Unsupervised and Transfer Learning Challenges in Machine Learning, 7 ,19. 2.1.2
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,J., Warde-Farley, D., & Bengio, Y. (2010). Theano: a cpu and gpu math expressioncompiler. In Proceedings of the Python for scientific computing conference (SciPy),vol. 4, (p. 3). Austin, TX. 2.1.1
Bishop, C. M. (2006). Pattern recognition and machine learning. springer. 2.1
Burt, R., Santana, E., Principe, J. C., Thigpen, N., & Keil, A. (2016). Predicting visualattention using gamma kernels. In Acoustics, Speech and Signal Processing(ICASSP), 2016 IEEE International Conference on, (pp. 1606–1610). IEEE. 6
Chalasani, R., & Principe, J. C. (2015). Context dependent encoding using convolutionaldynamic networks. Neural Networks and Learning Systems, IEEE Transactions on,26(9), 1992–2004. 5.1, 5-2, 5-3, 5.1
Chollet, F. (2015). Keras. GitHub repository: https://github. com/fchollet/keras. 2.1.1
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gatedrecurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. 2.4
113
Dai, A., Qi, C. R., & Nießner, M. (2016). Shape completion using 3d-encoder-predictorcnns and shape synthesis. arXiv preprint arXiv:1612.00101. 5
De Vries, B., & Principe, J. C. (1992). The gamma modela new neural model fortemporal processing. Neural Networks, 5(4), 565–576. 1, 2.2, 2.2, 2.2, 3
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: Alarge-scale hierarchical image database. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, (pp. 248–255). IEEE. 2.1.2
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visualrecognition and description. arXiv preprint arXiv:1411.4389. 1, 2
Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning.arXiv preprint arXiv:1603.07285. 2-3
Emigh, M., Kriminger, E., & Principe, J. C. (2015). A model based approach toexploration of continuous-state mdps using divergence-to-go. In Machine Learn-ing for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on, (pp.1–6). IEEE. 6
Eslami, S., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, K., & Hinton, G. E. (2016).Attend, infer, repeat: Fast scene understanding with generative models. arXiv preprintarXiv:1603.08575. 4.1
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in theprimate cerebral cortex. Cerebral cortex , 1(1), 1–47. (document), 1, 1-1
Finn, C., Goodfellow, I., & Levine, S. (2016a). Unsupervised learning for physicalinteraction through video prediction. arXiv preprint arXiv:1605.07157 . 2.6
Finn, C., Goodfellow, I., & Levine, S. (2016b). Unsupervised learning for physicalinteraction through video prediction. arXiv preprint arXiv:1605.07157 . 4.1
Fitch, W. T., Hauser, M. D., & Chomsky, N. (2005). The evolution of the language faculty:clarifications and implications. Cognition, 97 (2), 179–210. 2
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for amechanism of pattern recognition unaffected by shift in position. Biological cybernet-ics, 36(4), 193–202. 2.1.3
Fuster, J. M. (2003). Cortex and mind: Unifying cognition.. Oxford university press. 1
Gadelha, M., Maji, S., & Wang, R. (2016). 3d shape induction from 2d views of multipleobjects. arXiv preprint arXiv:1612.05872. 5
114
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kittidataset. The International Journal of Robotics Research, (p. 0278364913491297).5.3.2
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continualprediction with lstm. Neural computation, 12(10), 2451–2471. 2.4
Giles, C. L., Miller, C. B., Chen, D., Chen, H.-H., Sun, G.-Z., & Lee, Y.-C. (1992).Learning and extracting finite state automata with second-order recurrent neuralnetworks. Neural Computation, 4(3), 393–405. 2.4
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforwardneural networks. In International conference on artificial intelligence and statistics,(pp. 249–256). 2.1.2, 3.5.1
Godard, C., Mac Aodha, O., & Brostow, G. J. (2016). Unsupervised monocular depthestimation with left-right consistency. arXiv preprint arXiv:1609.03677 . 4.1
Goodale, M. A., & Milner, A. D. (1992). Separate visual pathways for perception andaction. Trends in neurosciences, 15(1), 20–25. 5.1
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances inNeural Information Processing Systems, (pp. 2672–2680). 4.1
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv preprintarXiv:1410.5401. 1, 3, 3.4, 3.4, 3.4, 3.4, 3.5, 3.5.3, 3.5.3, 4.2.1, 4.3
Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). Draw: A recurrent neuralnetwork for image generation. arXiv preprint arXiv:1502.04623. 2.7, 3.3
Ha, D., Dai, A., & Le, Q. V. (2016). Hypernetworks. arXiv preprint arXiv:1609.09106. 6
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R.,Satheesh, S., Sengupta, S., Coates, A., et al. (2014). Deepspeech: Scaling upend-to-end speech recognition. arXiv preprint arXiv:1412.5567 . 1, 2
Hasanbelliu, E., & Principe, J. C. (2008). Content addressable memories in reproducingkernel hilbert spaces. In Machine Learning for Signal Processing, 2008. MLSP 2008.IEEE Workshop on, (pp. 9–13). IEEE. 2.7, 2.7
Haykin, S. (2004). A comprehensive foundation. Neural Networks: A ComprehensiveFoundation, 2(2004). 2.1.3
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for imagerecognition. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, (pp. 770–778). 5-5
115
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data withneural networks. Science, 313(5786), 504–507. 2.1.2
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow inrecurrent nets: the difficulty of learning long-term dependencies. 2.1.2
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation,9(8), 1735–1780. 2.2, 2.4, 3.5, 3.5.2, 3.5.2
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collectivecomputational abilities. Proceedings of the national academy of sciences, 79(8),2554–2558. 2.7
Hu, Y., Mian, A. S., & Owens, R. (2011). Sparse approximated nearest points for imageset classification. In Computer vision and pattern recognition (CVPR), 2011 IEEEconference on, (pp. 121–128). IEEE. 5-3
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture ofmonkey striate cortex. The Journal of physiology , 195(1), 215–243. 2.1.3
Hurri, J., & Hyvarinen, A. (2003). Simple-cell-like receptive fields maximize temporalcoherence in natural video. Neural Computation, 15(3), 663–691. 4.1
Hyvarinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis, vol. 46.John Wiley & Sons. 2.1.2
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatialtransformer networks. arXiv preprint arXiv:1506.02025. 4.1, 4.2.1
Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT,RTRL, EKF and the” echo state network” approach. GMD-ForschungszentrumInformationstechnik. 2.4
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization withgumbel-softmax. arXiv preprint arXiv:1611.01144. 4.2.1, 4.2.2, 4.2.2, 4.2.2
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration ofrecurrent network architectures. In Proceedings of the 32nd International Conferenceon Machine Learning (ICML-15), (pp. 2342–2350). 2.4, 3.2
Kaiser, L., & Sutskever, I. (2015). Neural gpus learn algorithms. arXiv preprintarXiv:1511.08228. 2.6
Kalchbrenner, N., Oord, A. v. d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., &Kavukcuoglu, K. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527 . 2.6,4.1, 4.4.2, 5.3.1, 5.3.2
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014).Large-scale video classification with convolutional neural networks. In Computer
116
Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, (pp. 1725–1732).IEEE. 1
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprintarXiv:1408.5882. 2.1.3
Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980. 1, 2.1.2, 3.5.2
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114. 4.2.1, 4.2.2, 4.2.2, 4.2.2, 4.2.3
Kohonen, T. (2012). Content-addressable memories, vol. 1. Springer Science &Business Media. 2.7
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems,(pp. 1097–1105). 2.1.1, 2.1.3
Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. (2015). Deep convolutionalinverse graphics network. In Advances in Neural Information Processing Systems,(pp. 2539–2547). 4.1
Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariantspatio-temporal features for action recognition with independent subspace analysis.In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, (pp.3361–3368). IEEE. 5-2
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning appliedto document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 2.1.3
Lee, K.-C., Ho, J., Yang, M.-H., & Kriegman, D. (2005). Visual tracking and recognitionusing probabilistic appearance manifolds. Computer Vision and Image Understanding,99(3), 303–331. 5-3
Li, K., & Prıncipe, J. C. (2016). The kernel adaptive autoregressive-moving-averagealgorithm. IEEE transactions on neural networks and learning systems, 27 (2),334–346. 1
Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for objectrecognition. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, (pp. 3367–3375). 2.6
Lotter, W., Kreiman, G., & Cox, D. (2016). Deep predictive coding networks for videoprediction and unsupervised learning. arXiv preprint arXiv:1605.08104. 2.6, 4.1
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Interna-tional journal of computer vision, 60(2), 91–110. 2.1.2
117
Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuousrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712. 4.2.1, 4.2.2,4.2.2
Mahjourian, R., Wicke, M., & Angelova, A. (2016). Geometry-based next frameprediction from monocular video. arXiv preprint arXiv:1609.06377 . 4.1
Makhzani, A., & Frey, B. J. (2015). Winner-take-all autoencoders. In Advances in NeuralInformation Processing Systems, (pp. 2773–2781). 5.1
McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2016). Scenenet rgb-d: 5mphotorealistic images of synthetic indoor trajectories with ground truth. arXiv preprintarXiv:1612.05079. 5
Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherencein video. In Proceedings of the 26th Annual International Conference on MachineLearning, (pp. 737–744). ACM. 5.1, 5-2
Nene, S. A., Nayar, S. K., Murase, H., et al. (1996). Columbia object image library(coil-20). Tech. rep., technical report CUCS-005-96. 5-3, 5.1
Olshausen, B. A., et al. (1996). Emergence of simple-cell receptive field properties bylearning a sparse code for natural images. Nature, 381(6583), 607–609. 4.1
Omlin, C. W., & Giles, C. L. (1996). Constructing deterministic finite-state automata inrecurrent neural networks. Journal of the ACM (JACM), 43(6), 937–972. 2.4
Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchyof markov random fields. In Advances in neural information processing systems, (pp.1121–1128). 4.1
Pagiamtzis, K., & Sheikholeslami, A. (2006). Content-addressable memory (cam)circuits and architectures: A tutorial and survey. Solid-State Circuits, IEEE Journal of ,41(3), 712–727. 2.7
Palm, G., Schwenker, F., Sommer, F. T., & Strey, A. (1997). Neural associativememories. Associative processing and processors, (pp. 307–326). 2.7
Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of training recurrentneural networks. arXiv preprint arXiv:1211.5063. 1, 3.5.2
Patraucean, V., Handa, A., & Cipolla, R. (2015). Spatio-temporal video autoencoder withdifferentiable memory. arXiv preprint arXiv:1511.06309. 2.6
Principe, J. C., & Chalasani, R. (2014). Cognitive architectures for sensory processing.Proceedings of the IEEE , 102(4), 514–525. (document), 1, 1, 2-2, 2.3, 4
Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (1999). Neural and adaptive systems:fundamentals through simulations with CD-ROM. John Wiley & Sons, Inc. 2.4, 2.7
118
Principe, J. C., Xu, D., & Fisher, J. (2000). Information theoretic learning. Unsupervisedadaptive filtering, 1, 265–319. 3.4
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE , 77 (2), 257–286. 1
Ravanbakhsh, S., Schneider, J., & Poczos, B. (2016). Deep learning with sets and pointclouds. arXiv preprint arXiv:1611.04500. 5
Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn featuresoff-the-shelf: an astounding baseline for recognition. In Computer Vision and Pat-tern Recognition Workshops (CVPRW), 2014 IEEE Conference on, (pp. 512–519).IEEE. 2.1.2
Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016). Learning whatand where to draw. arXiv preprint arXiv:1610.02454. 4.1
Rezende, D. J., Eslami, S., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess,N. (2016). Unsupervised learning of 3d structure from images. arXiv preprintarXiv:1607.00662. 4.1
Robinson, A., & Fallside, F. (1987). The utility driven dynamic error propagation network .University of Cambridge Department of Engineering. 2.4
Rolls, E. T. (2007). An attractor network in the hippocampus: theory andneurophysiology. Learning & Memory , 14(11), 714–731. 1
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage andorganization in the brain. Psychological review , 65(6), 386. 2.1
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations byback-propagating errors. Cognitive modeling, 5, 3. 2.1
Sandberg, I. W., & Xu, L. (1997). Uniform approximation of multidimensional myopicmaps. Circuits and Systems I: Fundamental Theory and Applications, IEEE Transac-tions on, 44(6), 477–500. 2.2, 2.4
Santana, E., Burt, R., & Principe, J. C. (2017). Memory augmented auto-encoders. In inpreparation. (document), 3.3
Santana, E., Cinar, G. T., & Principe, J. C. (2015). Parallel flow in deep predictive codingnetworks. In 2015 International Joint Conference on Neural Networks (IJCNN), (pp.1–5). IEEE. (document)
Santana, E., Emigh, M., & Principe, J. C. (2016a). Information theoretic-learningauto-encoder. 2016 International Joint Conference on Neural Networks (IJCNN).(document)
119
Santana, E., Emigh, M., Zerges, P., & Principe, J. C. (2016b). Exploiting spatio-temporalstructure with recurrent winner-take-all networks. arXiv preprint arXiv:1611.00050.2.6, 5, 5, 5.1
Santana, E., & Hotz, G. (2016). Learning a driving simulator. arXiv preprintarXiv:1608.01230. (document)
Santana, E., & Principe, J. C. (2015). Mixed generative and supervised learning modesin deep predictive coding networks. In 2015 International Joint Conference on NeuralNetworks (IJCNN), (pp. 1–4). IEEE. (document)
Santana, E., & Principe, J. C. (2016). Perception updating networks: On architecturalconstraints for interpretable video generative models. ICLR 2017 (submitted).(document)
Shirley, P., Ashikhmin, M., & Marschner, S. (2015). Fundamentals of computer graphics.CRC Press. 4.1
Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets.Journal of computer and system sciences, 50(1), 132–150. 2.4
Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neuralrepresentation. Annual review of neuroscience, 24(1), 1193–1216. 4.1
Socher, R., Huang, E. H., Pennin, J., Manning, C. D., & Ng, A. Y. (2011). Dynamicpooling and unfolding recursive autoencoders for paraphrase detection. In Advancesin Neural Information Processing Systems, (pp. 801–809). 2.4
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving forsimplicity: The all convolutional net. arXiv preprint arXiv:1412.6806. 2.1.3
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning ofvideo representations using lstms. arXiv preprint arXiv:1502.04681. 1, 4.4
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance ofinitialization and momentum in deep learning. In Proceedings of the 30th interna-tional conference on machine learning (ICML-13), (pp. 1139–1147). 2.4, 3.5.2
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, (pp. 3104–3112). 1,2.4
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude. COURSERA: Neural Networks for MachineLearning, 4. 2.1.2
van Handel, R. (2014). Probability in high dimension. 3.4
120
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stackeddenoising autoencoders: Learning useful representations in a deep network with alocal denoising criterion. The Journal of Machine Learning Research, 11, 3371–3408.2.1.2
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scenedynamics. arXiv preprint arXiv:1609.02612. 4.1, 4.4.1
Wang, R., & Chen, X. (2009). Manifold discriminant analysis. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on, (pp. 429–436). IEEE.5-3
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrentgas market model. Neural Networks, 1(4), 339–356. 2.4
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it.Proceedings of the IEEE , 78(10), 1550–1560. 5.1
Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv preprintarXiv:1410.3916. 1, 3, 3.1.1
Williams, R. J., & Zipser, D. (1989). Experimental analysis of the real-time recurrentlearning algorithm. Connection Science, 1(1), 87–111. 2.4
Wu, J., Zhang, C., Xue, T., Freeman, B., & Tenenbaum, J. (2016). Learning aprobabilistic latent space of object shapes via 3d generative-adversarial modeling. InAdvances in Neural Information Processing Systems, (pp. 82–90). 5
Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., & WOO, W.-c. (2015).Convolutional lstm network: A machine learning approach for precipitationnowcasting. In Advances in Neural Information Processing Systems, (pp. 802–810).2.6
Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015).Show, attend and tell: Neural image caption generation with visual attention. arXivpreprint arXiv:1502.03044. 2.7
Yan, X., Yang, J., Yumer, E., Guo, Y., & Lee, H. (2016). Perspective transformer nets:Learning single-view 3d object reconstruction without 3d supervision. In Advances InNeural Information Processing Systems, (pp. 1696–1704). 5
Zaremba, W., & Sutskever, I. (2015). Reinforcement learning neural turing machines.arXiv preprint arXiv:1505.00521. 3
Zhu, Q., Yeh, M.-C., Cheng, K.-T., & Avidan, S. (2006). Fast human detection usinga cascade of histograms of oriented gradients. In Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on, vol. 2, (pp. 1491–1498).IEEE. 2.1.2
121
Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578. 6
122
BIOGRAPHICAL SKETCH
Ewaldo Eder Carvalho Santana Jr. (or just Eder Santana because no system in
America has space enough to sign his full name) was born in Brazil in 1988. Eder
graduated from Federal University of Maranhao, where he got both Bachelor of Science
(2011) and Master of Science (2012) degrees in electrical engineering.
In 2013 Eder dropped out from the PhD program at the Federal University of
Maranhao to try to make a living in America. In 2017 Eder received his PhD in electrical
and computer engineering from the University of Florida, where he was advised by Dr.
Jose C. Principe.
Eder Santana is active in the Machine Learning community. He contributed to Keras
the most popular high level neural network design framework and published the video
course Deep Learning with Python. He also worked at Comma.ai helping to develop AI
for self-driving cars and Paracosm.io leveraging deep learning for 3D object recognition.
123