Understanding the difficulty of training deep feedforward

Understanding the difficulty of training deep feedforward neural networks

Arthur: Xavier Glorot & Yoshua Bengio

Presented by : Sheng Huang

OverviewObject: Understand better why standard gradient descent from random initialization is doing poorly with deep neural networks

● Observe that logistic sigmoid function is unsuited for deep networks with random initialization, which can drive the top hidden layer into saturation. Saturated units can move out of saturation by themselves.

● Study how activations and gradients vary across layers.● Propose a new initialization scheme that brings substantially faster

convergence.

● The analysis is driven by investigate experiments to monitor activations and gradients, across layers and across training iterations.

● Evaluate the effects on these of choices of activation function and initialization procedure.

Experimental Setting and Dataset

1 Dataset

Online Learning on an Infinite Dataset:Shapeset 3x2

Finite Datasets:MNIST digits, CIFAR-10, Small-ImageNet

2 Experimental Setting

Effect of Activation Functions and Saturation During Training

1 Experiments with the sigmoid

● Top layer saturation

2 Experiment with the Hyperbolic tangent

● Hyperbolic tangent networks do not suffer form saturation behavior of top hidden layer with sigmoid networks

● With standard weight initialization, there is a sequentially occurring saturation starting with layer 1 and propagate up

3 Experiment with the Softsign

● Saturation does not occur one layer after the other like the hyperbolic tangent.● Saturation faster at the beginning and then slow, and all layers move together

towards larger weights.

Studying Gradients and their Propagation

1 Effect of cost function

● Logistic regression or conditional log likelihood cost function -logP(y|x) coupled with softmax output worked much better for classification problems than quadratic cost.

● The plateaus in the training criterion are less present with the log-likelihood cost function

Theoretical Considerations and a New Normalized Initialization

● The normalization factor is important when initializing deep networks because of the multiplicative effect through layers.

● Suggest normalized initialization.

Gradient Propagation Study

● Compare activation values and back-propagated gradients at initialization with two different initialization methods.

Back-propagated Gradients During Learning

● The variance of the back-propagated gradients gets smaller as it is propagated downwards with standard initialization. Using normalized initialization there is no decreasing back-propagated gradients.

● The variance of weights gradients is roughly constant across layers with normalization and standard initialization

● During training, normalization allows to keep the same variance of weight gradients across layers. With standard initialization, the gradients diverge from each other.

● Softsign networks share similarity with the tanh networks with normalized initialization.


● The variance of the back-propagated gradients gets smaller as it is propagated downwards with standard initialization. Using normalized initialization there is no decreasing back-propagated gradients.

● The variance of weights gradients is roughly constant across layers with normalization and standard initialization


Back-propagated Gradients During Learning● During training, normalization

allows to keep the same variance of weight gradients across layers. With standard initialization, the gradients diverge from each other.

● Softsign networks share similarity with the tanh networks with normalized initialization.

Error Curves and Conclusions

Final test error for all the datasets studied.

● The error curve for supervised fine-tuning from the initialization obtained after unsupervised pre-training with denoising auto-encoders

Conclusions from error curves

● The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards poorer local minima

● The softsign networks seems to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity

● Normalization can be helpful for tanh networks, because the layer-to-layer transformations maintain magnitudes of activations and gradients

Other Conclusions from the study

● Monitoring activations and gradients across layers and training iterations is a powerful investigation tool for understanding training difficulties in deep nets.

● Sigmoid activations(not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer.

● Keeping the layer-to -layer transformations such that both activations and gradients flow well appears helpful, and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning.

● Many of the observations remain unexplained, suggesting further investigations.

Documents

Understanding the difficulty of training deep feedforward