Multidimensional RNN

  • View

  • Download

Embed Size (px)

Text of Multidimensional RNN

  • Multi-Dimensional RNNsGrigory Sapunov

    Moscow Computer Vision GroupMoscow, Yandex, 27.04.2016

  • Outline RNN intro


    Some architectural issues BRNN MDRNN MDMDRNN HSRNN

    Some recent results (same ideas) 2D LSTM, CLSTM, C-RNN, C-HRNN (new ideas) ReNet, PyraMiD-LSTM, Grid LSTM

    (Discussion) RNN vs CNN for Computer Vision Resources

  • A tiny intro into RNNs

  • Feedforward NN vs. Recurrent NN

    Recurrent neural networks (RNNs) allow cyclical connections.

  • Unfolding the RNN and training using BPTT

    Can do backprop on the unfolded network: Backpropagation through time (BPTT)

  • Neural Network propertiesFeedforward NN (FFNN):

    FFNN is a universal approximator: feed-forward network with a single hidden layer, which contains finite number of hidden neurons, can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function.

    Typical FFNNs have no inherent notion of order in time. They remember only training.

    Recurrent NN (RNN):

    RNNs are Turing-complete: they can compute anything that can be computed and have the capacity to simulate arbitrary procedures.

    RNNs possess a certain type of memory. They are much better suited to dealing with sequences, context modeling and time dependencies.

  • RNN problem: Vanishing gradients

    Solution: Long short-term memory (LSTM, Hochreiter, Schmidhuber, 1997)

  • LSTM cell

  • LSTM network

  • LSTM: Fixing vanishing gradient problem

  • Comparing LSTM and Simple RNN

    More on LSTMs:

  • Another solution: Gated Recurrent Unit (GRU)

    GRU (Cho et al., 2014) is a bit simpler than LSTM (less weights)

  • Another useful thing: CTC Output LayerCTC (Connectionist Temporal Classification; Graves, Fernndez, Gomez, Schmidhuber, 2006) was specifically designed for temporal classification tasks; that is, for sequence labelling problems where the alignment between the inputs and the target labels is unknown.

    CTC models all aspects of the sequence with a single neural network, and does not require the network to be combined with a hidden Markov model. It also does not require presegmented training data, or external post-processing to extract the label sequence from the network outputs.

    The CTC network predicts only the sequence of phonemes (typically as a series of spikes, separated by blanks, or null predictions), while the framewise network attempts to align them with the manual segmentation.

  • Example: CTC vs. Framewise classification

  • End of IntroSo, further we will not make a distinction between RNN/GRU/LSTM, and will usually be using the word RNN for any kind of internal block. Typically most RNNs now are actually LSTMs.

    Significant part of the presentation is based on works of Alex Graves et al.

  • Some interesting generalizations of simple RNN architecture

  • #1 Directionality (BRNN/BLSTM)

  • Bidirectional RNN/LSTMThere are many situations when you see the whole sequence at once (OCR, speech recognition, translation, caption generation, ).

    So you can scan the [1-d] sequence in both directions, forward and backward.

    Here comes BLSTM (Graves, Schmidhuber, 2005).


  • Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN

  • Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN

  • Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN

  • Example: BLSTM classifying the utterance one oh five

  • #2 Dimensionality (MDRNN/MDLSTM)

  • Multidimensional RNN/LSTMStandard RNNs are inherently one dimensional, and therefore poorly suited to multidimensional data (e.g. images).

    The basic idea of MDRNNs (Graves, Fernandez, Schmidhuber, 2007) is to replace the single recurrent connection found in standard RNNs with as many recurrent connections as there are dimensions in the data.

    It assumes some ordering on the multidimensional data.


    The basic idea of MDRNNs is to replace the single recurrent connection found in standard RNNs with as many recurrent connections as there are dimensions

    in the data.

  • Uni-directionality

    MDRNN assumes some ordering on the multidimensional data. And its not the only possible one.

  • Uni-directionality

  • #3 Directionality + Dimensionality (MDMDRNN?)

  • Multidirectional multidimensional RNN (MDMDRNN?)The previously mentioned ordering is not the only possible one. It might be OK for some tasks, but it is usually preferable for the network to have access to the surrounding context in all directions. This is particularly true for tasks where precise localisation is required, such as image segmentation.

    For one dimensional RNNs, the problem of multidirectional context was solved by the introduction of bidirectional recurrent neural networks (BRNNs). BRNNs contain two separate hidden layers that process the input sequence in the forward and reverse directions.

    BRNNs can be extended to n-dimensional data by using 2n separate hidden layers, each of which processes the sequence using the ordering defined above, but with a different choice of axes.

  • Multi-directionality

  • Multi-directionality

    As before, the hidden layers are connected to a single output layer, which now has access to all surrounding context

  • MDMDRNN example: Air Freight database (2007)A ray-traced colour image sequence that comes with a ground truth segmentation into the different textures mapped onto the 3-d models. The sequence is 455 frames (160x120 px) long and contains 155 distinct textures.

  • MDMDRNN example: Air Freight database (2007)Network structure:

    Multidirectional 2D LSTM. 4 layers (not levels! just 4 directional layers on a single level) consisted of 25

    memory blocks, each containing 1 cell, 2 forget gates, 1 input gate, 1 output gate and 5 peephole weights.

    The input and output activation function of the cells was tanh, and the activation function for the gates was the logistic sigmoid.

    The input layer was size 3 (RGB) and the output layer (softmax) was size 155 (one unit for each texture).

    The network contained 43,257 trainable weights in total. The final pixel classification error rate, after 330 training epochs, was 7.1% on

    the test set.

  • MDMDRNN example: Air Freight database (2007)

  • MDMDRNN example: MNIST (2007)

    Additional evaluation on the warped dataset (not used in training at all)

  • MDMDRNN example: MNIST (2007)

  • #4 Hierarchical subsampling (HSRNN)

  • Hierarchical Subsampling Networks (HSRNN)So-called hierarchical subsampling is commonly used in fields such as computer vision where the volume of data is too great to be processed by a flat architecture. As well as reducing computational cost, it also reduces the effective dispersal of the data, since inputs that are widely separated at the bottom of the hierarchy are transformed to features that are close together at the top.

    A hierarchical subsampling recurrent neural network (HSRNN, Graves and Schmidhuber, 2009) consists of an input layer, an output layer and multiple levels of recurrently connected hidden layers. The output sequence of each level in the hierarchy is used as the input sequence for the next level up. All input sequences are subsampled using subsampling windows of predetermined width. The structure is similar to that used by convolutional networks, except with recurrent, rather than feedforward, hidden layers.


  • HSRNNFor each layer in the hierarchy, the forward pass equations are identical to those for a standard RNN, except that the sum over input units is replaced by a sum of sums over the subsampling window.

    A good rule of thumb is to choose the layer sizes so that each level consumes roughly half the processing time of the level below.


  • HSRNNCan be easily extended into multidimensional and multidirectional case.

    The problem is that each level of the hierarchy requires 2n hidden layers instead of one. To connect every layer at one level to every layer at the next therefore requires O(22n) weights.

    One way to reduce the number of weights is to separate the levels with nonlinear feedforward layers, which reduces the number of weights between the levels to O(2n)the same as standard MDRNNs.

    As a rule of thumb, giving each feedforward layer between half and one times as many units as the combined hidden layers in the level below appears to work well in practice.


  • HSRNN example: Arabic handwriting recognitionNetwork structure:

    The hierarchy contained three levels, multidirectional MDLSTM (so 4 hidden layers for 2D data).

    The three levels were separated by two feedforward layers with the tanh activation function.

    Subsampling windows were applied in three places: to the input sequence, to the output sequence of the first hidden level, and to the output sequence of the second hidden level.


  • Offline arabic handwriting recognition (2009) 32,492 black-and-white images of individual handwritten Tunisian town and

    village names, of which we used 30,000 for training, and 2,492 for validation Each image