Non-Gaussianity of Stochastic Gradient Noise · Microsoft Research, Bangalore, India [email protected] Praneeth Netrapalli Microsoft Research, Bangalore, India [email protected]

Non-Gaussianity of Stochastic Gradient Noise

Abhishek PanigrahiMicrosoft Research, Bangalore, India

[email protected]

Raghav SomaniUniversity of Washington, Seattle, WA, USA

[email protected]

Navin GoyalMicrosoft Research, Bangalore, India

[email protected]

Praneeth NetrapalliMicrosoft Research, Bangalore, India

[email protected]

Abstract

What enables Stochastic Gradient Descent (SGD) to achieve better generalizationthan Gradient Descent (GD) in Neural Network training? This question has at-tracted much attention. In this paper, we study the distribution of the StochasticGradient Noise (SGN) vectors during the training. We observe that for batch sizes256 and above, the distribution is best described as Gaussian at-least in the earlyphases of training. This holds across data-sets, architectures, and other choices.

1 Introduction

Stochastic Gradient Descent (SGD) algorithm [17] and its variants are workhorses of modern deeplearning, e.g. [2, 4]. Not only does SGD allow efficient training of Neural Networks (a non-convexoptimization problem) achieving small training loss, it also achieves good generalization, oftenbetter compared to Gradient Descent (GD). While striking progress has been made in theoreticalunderstanding of highly over-parameterized networks in recent years, for realistic settings we lacksatisfactory understanding. In particular, what property of the distribution of the small-batch gradientsarising in SGD is responsible for the efficacy of SGD?

Considerable effort has been expended on understanding SGD for Neural Networks. In particular,there have been attempts to relate SGD to a discretization of a continuous time process via StochasticDifferential Equations (SDEs), e.g., [12, 15, 23, 6, 7, 24]. The Stochastic Gradient (SG) can be writtenas Gradient + Stochastic Gradient Noise (SGN). [12, 6, 7, 24] assume that SGN is approximatelyGaussian when the batch size b is large enough, assuming the Central Limit Theorem (CLT) conditions,which we discuss in the next section. [15, 23] add explicit gaussian noise to stochastic gradientsin each iteration. One can analyze these algorithms by approximating them with the continuoustime process in the form of Langevin diffusion, and one can show that the Markov process exhibitedby the continuous time iterates is ergodic with its unique invariant measure whose log-density isproportional to the negative of the objective function [18]. Therefore, SGD can be seen as a first-orderEuler-Maruyama discretization of the Langevin dynamics [7, 11].

Since the gradient vectors are very high-dimensional, the number of samples required in the ap-plication of the CLT, so that a multivariate Gaussian distribution provides a sufficiently accurateapproximation, may be very high, raising doubts about the applicability of the CLT for SGN com-puted using small batch size. For SGD training of neural networks, the assumption of SGD noisebeing Gaussian-like was contested recently in [21]. They argue that the SGN instead follows astable distribution with infinite variance obtained by the application of the Generalized Central LimitTheorem (GCLT). They provided experimental evidence for their claim. Furthermore, they provided

Science meets Engineering of Deep Learning (SEDL) workshop, 33rd Conference on Neural InformationProcessing Systems (NeurIPS 2019), Vancouver, Canada.

arX

iv:1

910.

0962

6v2

[cs

.LG

] 2

5 O

ct 2

019

arguments based on the properties of the associated continuous time process which can then beviewed as an SDE with the stochastic term being a Levy motion.

2 Preliminaries

The problem of training a Neural Network is essentially an unconstrained optimization problem thatusually constitutes a finite sum non-convex objective function. Function f : Rp×Rd+1 → R denotesthe composition of the Neural Network and the loss function. On learnable parameters w ∈ Rp,and input (x, y) ∈ Rd × R, function value f(w; (x, y)) measures the accuracy of the underlyingNeural Network on (x, y). For a set of n training data samples {xi, yi}ni=1 ∈

(Rd × R

)n, define

f (w;B) := 1|B|∑

i∈B f(w; {xi, yi}

), and ∇wf (w;B) := 1

|B|∑

i∈B ∇wf(w; {xi, yi}). Thenthe optimization problem is given by

w∗ ∈ argminw∈Rp

f(w; [n]

). (1)

SGD update rule is given by w(t+1) ← w(t) − η∇wf(w(t), Bt), where w(t) denotes the iterate at

time t ≥ 0, η > 0 denotes the learning rate, and Bt ⊆ [n] denotes a batch of training examples of sizeb ≤ n picked at time t. A random SGN vector at time t is then given by f(w(t);Bt)− f(w(t); [n]).

Gaussian random variables. Perhaps not as well-known as the scalar CLT, there is a CLT forvector-valued random variables showing that the appropriately scaled sums of i.i.d. random variableswith finite covariance matrix tends to multivariate Gaussian in distribution. A useful fact aboutmultivariate Gaussians is that if for a random vector X, all its one-dimensional marginals areunivariate Gaussians then X must be multivariate Gaussian (see, e.g., Exercise 3.3.4 in [22]).

Stable distribution. For a real-valued random variableX , letX1 andX2 denote its i.i.d. copies. Xis said to be stable if for any a, b > 0, random variable aX1+bX2 has the same distribution as cX+dfor some constants c > 0 and d. Stable distributions are a four parameter family S(α, β, γ, δ) ofunivariate probability distributions. There is no closed form formula known for these distributions ingeneral, instead they are defined using their characteristic function. We do not include that definitionhere since we will not need it. Instead, we will just note the properties of stable distributions thatare relevant for us. See [14] for more information. The special case of S(α, 0, 1, 0) is represented asSαS distribution, and can be considered as the symmetric, centered and normalized case of stabledistribution. α ∈ (0, 2] is the stability parameter; we omit the discussion of other parameters. Thisfamily includes the standard Gaussian distribution (S(2, 0, 1, 0)), and standard Cauchy distribution(S(1, 0, 1, 0)). Interest in stable distributions comes from the Generalized Central Limit Theorem(see [14]) which informally states that the distribution of appropriately scaled sums of i.i.d. randomvariables with infinite variance tends to a stable distribution. This was the basis of the suggestionin [21] that SGN vectors follow a stable distribution with infinite variance. More specifically, [21]posited that the coordinates of SGN vectors are i.i.d. stable random variables. In parallel withmultivariate Gaussians, there is a notion of multivariate stable distributions (see [14]) which satisfiesthe property that one-dimensional marginals are univariate stable. A more general hypothesis forSGN vectors could be that they are multivariate stable.

3 Method, Experiments and Observations

Gaussianity testing is a well-studied problem in statistics; we use tests due to [19] and [1]. These werefound to be the top two tests in a comparison study [16]. Both of the tests assume the null hypothesisthat the given set of scalar examples come from a Gaussian distribution. A small p-value implies thatthe null hypothesis is likely to be false. We follow the following steps to perform statistical tests onstochastic gradient noises (SGN).

1. We train a model using minibatch SGD and consider SGN vectors for 1000 independent mini-batches, which are of the same size as the training batch-size and randomly sampled with replace-ment, at multiples of 100 iterations.

2. We project each SGN vector along 1000 random unit vectors, and perform Gaussianity tests onthe projections along each direction. We plot the average confidence value across the dimensions,

2

(a) (b) (c)

Figure 1: (a) Gaussianity tests on SαS variables; (b) and (c) training Resnet18 with 256 batch-sizedSGD at l.r. 10−2, (b) Accuracy & loss curves, and (c) Gaussianity tests on projection.

as given by Shapiro–Wilk test [19], and the fraction of directions accepted by the Anderson Wilktest [1] and compare with the case when actual Gaussian data is fed to the tests.

3.1 Experimental Setup

We conduct experiments on different models, namely a 3-layer fully connected network, Alexnet,Resnet18 and VGG16 model, with and without batch-normalization (BN), and different data-sets,namely CIFAR10 and MNIST [8, 10, 9, 5, 20]. All the layers were initialized with Xavier initializa-tion [3]. The models were trained with constant learning rate mini-batch SGD, where mini-batch-size was varied from 32, 256 and 4096, on cross entropy loss. The learning rates were varied in10{−1,−2,−3}.

3.2 Observations and Conclusion

We first perform sanity checks to see how well the hypothesis tests do on generated SαS samples. Itcan be seen in Figure 1a that for 0.2 . α . 1.8 the tests very confidently and correctly declare thatthe distribution is not Gaussian. As α→ 2, we expect the distribution to become closer to Gaussian,and for very small α the tails become very heavy. Thus the behavior of the statistical tests on SαSdistribution is as expected.

We see that SGN looks Gaussian to statistical tests in Fig 2a for Resnet18 throughout the training forbatch-size 4096. In contrast, in Fig. 2c the behavior is not Gaussian anywhere for batch-size 32. Forintermediate batch-sizes such as 256, behavior was Gaussian at the beginning of the training, andlater becomes non-Gaussian. We observe this behavior consistently across different architectures,data-sets, and larger batch-sizes. This is in contrast with the results in [21], which suggests stabledistribution right from the start for batch-size 500. When the behavior is Gaussian in 1000 randomdirections, by the characterization of multivariate Gaussian mentioned above, this suggests—thoughdoes not definitively prove—that SGN vectors are multivariate Gaussians early on in the training.For batch-size 32 is it possible that the SGN distribution here is stable? This is unlikely because byGCLT we would expect this behavior at higher batch-sizes, but we observe Gaussian behavior.

We briefly explain possible reasons why our results do not agree with those in [21]. They use theestimator of [13] to estimate α. The application of this estimator in [21] suffers from 2 errors: (1)In [13], the estimator is shown to give a good estimate of α under the assumption that the randomvariable is indeed α-stable; it makes no claims about the estimate when this assumption is not satisfied.[21] seems to tacitly assume stability. (2) The estimator assumes that the coordinates of SGN vectorare i.i.d. This assumption is invalid in the typical over-parameterized setting.

3

(a) (b) (c) (d)

Figure 2: (a) and (b) training Resnet18 with 4096 batch-sized SGD at l.r. 10−2; (a) Gaussianity testson projections (b) Accuracy & loss behavior with training; (c) and (d) training Resnet18 with 32batch-sized SGD at l.r. 10−2; (c) Gaussianity tests on projections (d) Accuracy & loss curves. For (a)and (c), legend is the same as the legend for Fig. 1c.

References[1] T. W. Anderson and D. A. Darling. A test of goodness of fit. Journal of the American Statistical

Association, 49(268):765–769, 1954.

[2] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018.

[3] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor-ward neural networks. In Proceedings of the thirteenth international conference on artificialintelligence and statistics, pages 249–256, 2010.

[4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1.MIT Press, 2016.

[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 770–778, 2016.

[6] Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation ofnonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.

[7] Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, YoshuaBengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprintarXiv:1711.04623, 2017.

[8] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. CurranAssociates, Inc., 2012.

[10] Yann LeCun, LD Jackel, Leon Bottou, A Brunot, Corinna Cortes, JS Denker, Harris Drucker,I Guyon, UA Muller, Eduard Sackinger, et al. Comparison of learning algorithms for handwrittendigit recognition. In International conference on artificial neural networks, volume 60, pages53–60. Perth, Australia, 1995.

[11] Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradientalgorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume70, pages 2101–2110. JMLR. org, 2017.

[12] Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradientalgorithms. In International conference on machine learning, pages 354–363, 2016.

[13] Mohammad Mohammadi, Adel Mohammadpour, and Hiroaki Ogata. On estimating the tailindex and the spectral measure of multivariate α-stable distributions. Metrika, 78:549–561, 072015.

4

[14] J. P. Nolan. Stable Distributions - Models for Heavy Tailed Data. Birkhauser, Boston, 2018. Inprogress, Chapter 1 online at http://fs2.american.edu/jpnolan/www/stable/stable.html.

[15] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochasticgradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.

[16] Nornadiah Mohd Razali and Yap Bee Wah. Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J. Stat. Model. Analytics, 2, 01 2011.

[17] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals ofmathematical statistics, pages 400–407, 1951.

[18] Gareth O Roberts and Osnat Stramer. Langevin diffusions and metropolis-hastings algorithms.Methodology and computing in applied probability, 4(4):337–357, 2002.

[19] S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples).Biometrika, 52(3/4):591–611, 1965.

[20] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556, 2014.

[21] Umut Simsekli, Levent Sagun, and Mert Gürbüzbalaban. A tail-index analysis of stochasticgradient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov,editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5827–5837, Long Beach, California, USA,09–15 Jun 2019. PMLR.

[22] Roman Vershynin. High-Dimensional Probability. Cambridge University Press, 2018.

[23] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradientlangevin dynamics. COLT 2017, 2017.

[24] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise instochastic gradient descent: Its behavior of escaping from minima and regularization effects.arXiv preprint arXiv:1803.00195, 2018.

5

A Appendix

Here we present the various plots for the remaining data-sets and models.

(a) (b) (c)

Figure 3: Gaussianity Test Experiments on VGG16, without BN and trained on CIFAR10 at l.r. 10−2,(a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 4096.

(a) (b) (c)

Figure 4: Gaussianity Test Experiments on VGG16, with BN and trained on CIFAR10 at l.r. 10−2,(a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 4096.

6

(a) (b) (c)

Figure 5: Gaussianity Test Experiments on Resnet18, without BN and trained on CIFAR10 at l.r.10−2, (a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 4096.

(a) (b) (c)

Figure 6: Gaussianity Test Experiments on Resnet18, with BN and trained on CIFAR10 at l.r. 10−2,(a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 4096.

(a) (b) (c)

Figure 7: Gaussianity Test Experiments on VGG16, without BN and trained on CIFAR10 with 256minibatch-size SGD, (a) learning rate is 10−1 (b) learning rate is 10−2 (c) learning rate is 10−3.

7

(a) (b) (c)

Figure 8: Gaussianity Test Experiments on VGG16, with BN and trained on CIFAR10 with 256minibatch-size SGD, (a) learning rate is 10−1 (b) learning rate is 10−2 (c) learning rate is 10−3.

(a)

(b) (c)

Figure 9: Gaussianity Test Experiments on Resnet18, without BN and trained on CIFAR10 with 256minibatch-size SGD, (a) learning rate is 10−1 (b) learning rate is 10−2 (c) learning rate is 10−3.

(a) (b) (c)

Figure 10: Gaussianity Test Experiments on Resnet18, with BN and trained on CIFAR10 with 256minibatch-size SGD, (a) learning rate is 10−1 (b) learning rate is 10−2 (c) learning rate is 10−3.

8

Figure 11: Gaussianity Test Experiments on Alexnet, without BN and trained on CIFAR10 at l.r.10−2, (a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 1024.

(a) (b) (c)

Figure 12: Gaussianity Test Experiments on Alexnet, with BN and trained on CIFAR10 at l.r. 10−2,(a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 1024.

(a) Gaussianity testing on gradientprojection

(b) Gaussianity testing on gradientprojection

(c) Gaussianity testing on gradientprojection

Figure 13: Gaussianity Test Experiments on a 3 layer fully connected network, with BN and trainedon CIFAR10 at l.r. 10−2, (a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is1024.

9

Figure 14: Gaussianity Test Experiments on a 3 layer fully connected network, without BN andtrained on MNIST at l.r. 10−2, (a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-sizeis 1024.

(a) Gaussianity testing on gradientprojection

(b) Gaussianity testing on gradientprojection

(c) Gaussianity testing on gradientprojection

Figure 15: Gaussianity Test Experiments on a LeNet, without BN and trained on MNIST at l.r. 10−2,(a) Mini batch-size is 32 (b) Mini batch-size is 256 (c) Mini batch-size is 1024.

10

Documents

Non-Gaussianity of Stochastic Gradient Noise · Microsoft Research, Bangalore, India [email protected] Praneeth Netrapalli Microsoft Research, Bangalore, India [email protected]