Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs · holz et al.,2016;Yang & Schoenholz,2017) and convolu-tional (Xiao et al.,2018) feed-forward networks, as well as for

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Dar Gilboa 1 2 * Bo Chang 3 * Minmin Chen 4 Greg Yang 5 Samuel S. Schoenholz 4 Ed H. Chi 4

Jeffrey Pennington 4

AbstractTraining recurrent neural networks (RNNs) onlong sequence tasks is plagued with difficultiesarising from the exponential explosion or van-ishing of signals as they propagate forward orbackward through the network. Many techniqueshave been proposed to ameliorate these issues,including various algorithmic and architecturalmodifications. Two of the most successful RNNarchitectures, the LSTM and the GRU, do ex-hibit modest improvements over vanilla RNNcells, but they still suffer from instabilities whentrained on very long sequences. In this work, wedevelop a mean field theory of signal propagationin LSTMs and GRUs that enables us to calculatethe time scales for signal propagation as well asthe spectral properties of the state-to-state Jaco-bians. By optimizing these quantities in termsof the initialization hyperparameters, we derive anovel initialization scheme that eliminates or re-duces training instabilities. We demonstrate theefficacy of our initialization scheme on multiplesequence tasks, on which it enables successfultraining while a standard initialization either failscompletely or is orders of magnitude slower. Wealso observe a beneficial effect on generalizationperformance using this new initialization.

1. IntroductionA common paradigm for research and development in deeplearning involves the introduction of novel network archi-tectures followed by experimental validation on a selectionof tasks. While this methodology has undoubtedly gener-ated significant advances in the field, it is hampered by thefact that the full capabilities of a candidate model may be

*Work performed while interning at Google Brain.1Department of Neuroscience, Columbia University 2DataScience Institute, Columbia University 3Department ofStatistics, University of British Columbia 4Google Brain5Microsoft Research. Correspondence to: Dar Gilboa<[email protected]>.

200 400 600 800Sequence length

0.0

0.2

0.4

0.6

0.8

1.0

Accu

racy

Critical initializationStandard initialization

Figure 1. Critical initialization improves trainability of recurrentnetworks. Test accuracy for peephole LSTM trained to classifysequences of MNIST digits after 8000 iterations. As the sequencelength increases, the network is no longer trainable with standardinitialization, but still trainable using critical initialization.

obscured by difficulties in the training procedure. It is oftenpossible to overcome such difficulties by carefully select-ing the optimizer, batch size, learning rate schedule, initial-ization scheme, or other hyperparameters. However, thestandard strategies for searching for good values of thesehyperparameters are not guaranteed to succeed, especiallyif the trainable configurations are constrained to a low-dimensional subspace of hyperparameter space, which canrender random search, grid search, and even Bayesian hy-perparameter selection methods unsuccessful.

In this work, we argue that, for long sequence tasks, thetrainable configurations of initialization hyperparametersfor LSTMs and GRUs lie in just such a subspace, whichwe characterize theoretically. In particular, we identify pre-cise conditions on the hyperparameters governing the ini-tial weight and bias distributions that are necessary to en-sure trainability. These conditions derive from the observa-tion that in order for a network to be trainable, (a) signalsfrom the relevant parts of the input sequence must be ableto propagate all the way to the loss function and (b) the gra-dients must be stable (i.e. they must not explode or vanishexponentially).

As shown in Figure 1, training of recurrent networks withstandard initialization on certain tasks begins to fail asthe sequence length increases and signal propagation be-

arX

iv:1

901.

0898

7v1

[cs

.LG

] 2

5 Ja

n 20

19


comes harder to achieve. However, as we shall show, asuitably-chosen initialization scheme can dramatically im-prove trainability on such tasks.

We study the effect of the initialization hyperparameters onsignal propagation for a very broad class of recurrent archi-tectures, which includes as special cases many state-of-the-art RNN cells, including the GRU (Cho et al., 2014), theLSTM (Hochreiter & Schmidhuber, 1997), and the peep-hole LSTM (Gers et al., 2002). The analysis is based onthe mean field theory of signal propagation developed ina line of prior work (Schoenholz et al., 2016; Xiao et al.,2018; Chen et al., 2018; Yang et al., 2019), as well as theconcept of dynamical isometry (Saxe et al., 2013; Penning-ton et al., 2017; 2018) that is necessary for stable gradientbackpropagation and which was shown to be crucial fortraining simpler RNN architectures (Chen et al., 2018). Weperform a number of experiments to corroborate the resultsof the calculations and use them to motivate initializationschemes that outperform standard initialization approacheson a number of long sequence tasks.

2. Background and related work2.1. Mean field analysis of neural networks

Signal propagation at initialization can be controlled byvarying the hyperparameters of fully-connected (Schoen-holz et al., 2016; Yang & Schoenholz, 2017) and convolu-tional (Xiao et al., 2018) feed-forward networks, as well asfor simple gated recurrent architectures (Chen et al., 2018).In all these cases, such control was used to obtain initial-ization schemes that outperformed standard initializationson benchmark tasks. In the feed-forward case, this enabledthe training of very deep architectures without the use ofbatch normalization or skip connections.

By forward signal propagation, we specifically refer to thepersistence of correlations between the hidden states of net-works with different inputs as a function of time (or depthin the feed-forward case), as will be made precise in Sec-tion 4.2. Backward signal propagation depends not onlyon the norm of the gradient, but also on its stability, whichis governed by the state-to-state Jacobian matrix, as dis-cussed in (Bengio et al., 1994). In our context, the goalof the backward analysis is to enhance the conditioningof the Jacobian by controlling the first two moments ofits squared singular values. Forward signal propagationand the spectral properties of the Jacobian at initializationcan be studied using mean field theory and random matrixtheory (Poole et al., 2016; Schoenholz et al., 2016; Yang& Schoenholz, 2017; 2018; Xiao et al., 2018; Penningtonet al., 2017; 2018; Chen et al., 2018; Yang et al., 2019).More generally, mean field analysis is also emerging asa promising tool for studying the dynamics of learning in

neural networks and even obtaining generalization boundsin some settings (Mei et al., 2018).

While extending an analysis at initialization to a trainednetwork might appear hopeless due to the complexity of thetraining process, intriguingly, it was recently shown that inthe infinite width, continuous time limit neural networksexhibit certain invariances during training (Jacot et al.,2018), further motivating the study of networks at initial-ization. In fact, the general strategy of proving the exis-tence of some property beneficial for training at initializa-tion and controlling it during the training process is the coreidea behind a number of recent global convergence resultsfor over-parametrized networks (Du et al., 2018; Allen-Zhuet al., 2018), some of which (Allen-Zhu et al., 2018) alsorely explicitly on control of forward and backward signalpropagation (albeit not defined in the exact sense as in thiswork).

As neural network training is a nonconvex problem, using amodified initialization scheme could lead to convergence todifferent points in parameter space in a way that adverselyaffects the generalization error. We provide some empiri-cal evidence that this does not occur, and in fact, the useof initialization schemes satisfying these conditions has abeneficial effect on the generalization error.

2.2. The exploding/vanishing gradient problem andsignal propagation in recurrent networks

The exploding/vanishing gradient problem is a well-knownphenomenon that hampers training on long time sequencetasks (Bengio et al., 1994; Pascanu et al., 2013). Apart fromthe gating mechanism, there have been numerous proposalsto alleviate the vanishing gradient problem by constrainingthe weight matrices to be exactly or approximately orthog-onal (Pascanu et al., 2013; Wisdom et al., 2016; Vorontsovet al., 2017; Jose et al., 2017), or more recently by modi-fying some terms in the gradient (Arpit et al., 2018), whileexploding gradients can be handled by clipping (Pascanuet al., 2013). Another recently proposed approach to ensur-ing signal propagation in long sequence tasks introducesauxiliary loss functions (Trinh et al., 2018). This modifi-cation of the loss can be seen as a form of regularization.Chang et al. (2019) study the connections between recur-rent networks and certain ordinary differential equationsand propose the AntisymmetricRNN that can capture longterm dependencies in the inputs. While many of these ap-proaches have been quite successful, they typically requiremodifying the training algorithm, the loss function, or thearchitecture, and as such exist as complementary methodsto the one we investigate here. We postpone the investiga-tion of a combination of techniques to future work.


VanillaRNN

Minimal RNN(Chen et al., 2018)

GRU(Cho et al., 2014)

peephole LSTM(Gers et al., 2002)

LSTM(Hochreiter & Schmidhuber, 1997)

st ht ht ht ct ht

K {f} {f, r} {f, r} {i, f, r, o} {i, f, r, o}p st st st σ(uto) ◦ tanh(st) st

gk . . σ(ur) . .

f Idσ(utf ) ◦ st−1

+(1− σ(utf )) ◦ xtσ(utf ) ◦ st−1

+(1− σ(utf )) ◦ tanh(utr2)σ(utf ) ◦ st−1

+σ(uti) ◦ tanh(utr)

σ(uto) ◦ tanh(c0+t∑

j=1

t

Πl=j

σ(ulf ) ◦ σ(uji ) ◦ tanh(ujr))

Table 1. A number of recurrent architectures written in the form 1. The LSTM cell state is unrolled in order to emphasize that it can bewritten as a function of variables that are Gaussian at the large N limit.

3. NotationWe denote matrices by bold upper case Latin characters andvectors by bold lower case Latin characters. Dx denotesa standard Gaussian measure. The normalized trace of arandom N ×N matrix A, 1

NEtr(A), is denoted by τ(A).◦ is the Hadamard product, σ(·) is a sigmoid function andboth σ(·), tanh(·) act element-wise. We denote by Da adiagonal matrix with a on the diagonal.

4. Mean field analysis of signal propagationand dynamical isometry

4.1. Model description and important assumptions

We present a general setup in this section and subsequentlyspecialize to the case of the GRU, LSTM and peepholeLSTM. This section follows closely the development in(Chen et al., 2018). We denote the state of a recurrentnetwork at time t by st ∈ RN with s0

i ∼ D0, and a se-quence of inputs to the network1 by {z1, ..., zT }, zt ∈ RN .. We also define sets of subscripts K and pre-activationsutk ∈ RN , k ∈ K defined by

utk = Wkst−1 + Ukz

t + bk (1a)

where Wk,Uk ∈ RN×N ,bk ∈ RN . We define additionalvariables given by

utk2 = Wk2Dgk(utk)s

t−1 + Uk2zt + bk2 (1b)

where gk : RN → RN is an element-wise function and utkis defined as in eqn. (1a) 2. In cases where there is no needto distinguish between variables of the form 1a and 1b wewill refer to both as utk. The state evolution of the networkis given by

st = f(st−1, {u1k}, ..., {utk},xt) (1c)

1We assume zt ∈ RN for convenience of exposition but thisdoes not limit the generality of the argument. We could for in-stance work with inputs xt ∈ RM for some M 6= N and define afeature map Φ : zt = Φ(xt) ∈ RN .

where f is an element-wise, affine function of st−1. Theoutput of the network at every time is given by p(st, {utk}).These dynamics describe all the architectures studied inthis paper, as well as many others, as detailed in Table 1. Inthe case of the peephole LSTM and GRU, eqn. (1c) will begreatly simplified.

In order to understand the properties of a neural networkat initialization, we take the weights of the network tobe random variables. Specifically, we assume Wk,ij ∼N (0, σ2

k/N), Uk,ij ∼ N (0, ν2k/N), bk,i ∼ N (µk, ρ

2k)

i.i.d. and denote Θ =⋃k

{σ2k, ν

2k , ρ

2k, µk}. As in (Chen

et al., 2018), we make the untied weights assumptionWk⊥st. Tied weights would increase the autocorrelationof states across time, but this might be dominated by theincrease due to the correlations between input tokens whenthe latter is strong. Indeed, we provide empirical evidencethat calculations performed under this assumption still haveconsiderable predictive power in cases where it is violated.

4.2. Forward signal propagation

We now consider two sequences of normalized in-puts {zta}, {ztb} with zero mean and covariance R =

R

(1 Σz

Σz 1

)fed into two copies of a network with

identical weights, and resulting in sequences of states{sta}, {stb}. We are interested in the moments and corre-lations

µts = E[stia] (2a)

Qts ≡ Qts,aa = E[stiastia] (2b)

Σt2s Cts + (µts)

2 = E[stiastib] (2c)

where we define Σt2s = Qts− (µts)2 (and Σt2k analogously).

At the large N limit we can invoke a CLT to find(utkiautkib

)∼ N

((µkµk

),Σt2k

(1 CtkCtk 1

))(3)

2For the specific instances we consider in detail, variables ofthe form 1b will be present only in the GRU (see Table 1).


where the second moment Qtk is given by

Qtk = E[utkiautkia] = σ2

kE[stiastia] + ν2

kE[ztiaztia] + ρ2

k + µ2k

= σ2kQ

ts + ν2

kR+ ρ2k + µ2

k

(4a)

and the correlations Ctk by Σt2k Ctk + µ2

k = E[utkiautkib] and

hence

Ctk =σ2k

(Σt2s C

ts + µ2

s

)+ ν2

kRΣz + ρ2k

Qtk − µ2k

. (4b)

(4c)

The variables utk2ia, utk2ib

given by 1b are also asymptoti-cally Gaussian, with their covariance detailed in AppendixA for brevity of exposition. We conclude that(

utkiautkib

)∼ N

((µkµk

),Σt2k

(1 CtkCtk 1

))(5)

and utk2ia, utk2ib

are distributed analogously with respect toQtk2 , C

tk2, µk2 . We will subsequently drop the vector sub-

script i since all elements are identically distributed andf ,gk act element-wise, and the input sequence index in ex-pressions that involve only one sequence.

For any l ≤ t, the ulk are independent of sl at the large Nlimit. Combining this with the fact that their distribution isdetermined completely by µl−1

s , Ql−1s , Cl−1

s and that f isaffine, one can rewrite eqn. (2) using eqn. (1) as the follow-ing deterministic dynamical system

(µts, Qts, C

ts) =M(µt−1

s , Qt−1s , Ct−1

s , ..., µ1s, Q

1s, C

1s )

(6)where the dependence on Θ and the data distribution hasbeen suppressed. In the peephole LSTM and GRU, theform will be greatly simplified to M(µt−1

s , Qt−1s , Ct−1

s ).In Appendix C we compare the predicted dynamics definedby eqn. (6) to simulations, showing good agreement.

One can now study the fixed points of eqn. (6) and the rateof convergence to these fixed points by linearizing aroundthem. The fixed points are pathological, in the sense thatany information that distinguishes two input sequences islost upon convergence. Therefore, delaying the conver-gence to the fixed point should allow for signals to prop-agate across longer time horizons. Quantitatively, the rateof convergence to the fixed point gives an effective timescale for forward signal propagation in the network.

While the dynamical system is multidimensional and anal-ysis of convergence rates should be performed by lineariz-ing the full system and studying the smallest eigenvalue ofthe resulting matrix, in practice as in (Chen et al., 2018)this eigenvalue appears to always corresponds to the Cts di-rection3 . Hence, if we assume convergence of Qts, µ

ts we

3This observation is explained in the case of fully-connectedfeed-forward networks in (Schoenholz et al., 2016)

need only linearize

Cts =MC(µ∗s, Q∗s, C

t−1s ) (7)

where MC also depends on expectations of functions of{u1

k}, ...{ut−2k } that do not depend on Ct−1

s . While thisdependence is in principle on an infinite number of Gaus-sian variables as the dynamics approach the fixed point,MC can still be reasonably approximated in the case ofthe LSTM as detailed in Section 4.5, while there is no suchdependence for the peephole LSTM and GRU (See Table1).

We study the dynamics approaching the fixed point by set-ting Cts = C∗s + εt and writing

C∗s + εt+1 =MC(µ∗s, Q∗s, C

∗s + εt)

=MC(µ∗s, Q∗s, C

∗s ) +

dMC(µ∗s ,Q∗s ,Cs)

dCs|Cs=C∗s

εt +O((εt)2)

and sinceMC(µ∗s, Q∗s, C

∗s ) = C∗s ,

εt+1

εt=

∂MC(µ∗s ,Q∗s ,Cs)

∂Cs

+∑k

∂MC(µ∗s ,Q∗s ,Cs)

∂Ck

dCk

dCs

Cs=C∗s︸︷︷︸

≡χC∗s

+O(εt).

(8)

We show that in the case of the peephole LSTM this map isconvex in Appendix B. This can be shown for the GRU bya similar argument. It follows directly that it has a singlestable fixed point in these cases.

The time scale of convergence to the fixed point is given by

ξC∗s = − 1

logχC∗s, (9)

which diverges as χC∗s approaches 1 from below. Due tothe detrimental effect of convergence to the fixed point de-scribed above, it stands to reason that a choice of Θ suchthat χC∗s = 1 − δ for some small δ > 0 would enable sig-nals to propagate from the initial inputs to the final hiddenstate when training on long sequences.

4.3. Backwards signal propagation - the state-to-stateJacobian

We now turn to controlling the gradients of the network.A useful object to consider in this case is the asymptoticstate-to-state transition Jacobian

J = limt→∞

∂st+1

∂st.

This matrix and powers of it will appear in the gradientsof the output with respect to the weights as the dynamics


approach the fixed point (specifically, the gradient of a net-work trained on a sequence of length T will depend on amatrix polynomial of order T in J), hence we desire to con-trol the squared singular value distribution of this matrix.The moments of the squared singular value distribution aregiven by the normalized traces

mJJT ,n = τ((JJT )n).

Since U t′, t′ < t is independent of st, if we index by k, k′

the variables defined by eqn. (1a) and (1b) respectively weobtain

J =

∂f∂s∗ +

∑k

∂f∂u∗k

Wk

+∑k′

∂f∂u∗

k′2

Wk′2

(Dgk(u∗

k′ )+ Wk′Dgk

′(u∗k′ )

Ds∗

) .

(10)

Under the untied assumption Wk,Wk2 are independent ofst,utk at the largeN limit, and are also independent of eachother and their elements have mean zero. Using this and thefact that f acts element-wise, we have

τ(JJT ) = E∑

k∈K∪{0}

ak (11)

where

ak =

D2

0 k = 0

σ2kD

2k uk is given by (1a)

σ2k2D2k

(g2k(u∗k)

+σ2ks∗2g′2k (u∗k)

)uk is given by (1b)

(12)and D0 = ∂f

∂s∗ , Dk = ∂f∂u∗k

. The values of D0, Dk for thearchitectures considered in the paper are detailed in Ap-pendix A. Forward and backward signal propagation are infact intimately related, as the following lemma shows:Lemma 1. For a recurrent neural networks defined by (1),the mean squared singular value of the state-to-state Jaco-bian defined in (11) and χCs

that determines the time scaleof forward signal propagation (given by (8)) are related by

mJJT ,1 = χC∗s =1,Σz=1 (13)

Proof. See Appendix B.

Controlling the first moment of JJT is not sufficient to en-sure that the gradients do not explode or vanish, since thevariance of the singular values may still be large. This vari-ance is a function of the first two moments and is given by

σJJT = mJJT ,2 −m2JJT ,1.

The second moment mJJT ,2 can be calculated from (10),and is given by

mJJT ,2 = E∑

k,l∈K∪{0}

2akal − a20 (14)

where the ak are defined in (12). One could compute highermoments as well either explicitly or using tools from non-Hermitian random matrix theory.

4.4. Dynamical Isometry

The objective of the analysis is to ensure, at least approx-imately, that the following equations are satisfied simulta-neously

χC∗s = 1 (15a)mJJT ,1 = 1 (15b)σJJT = 0. (15c)

We refer to these as dynamical isometry conditions. We de-mand that these equations are only satisfied approximatelysince for a given architecture, there may not be a value of Θthat satisfies all the conditions. Additionally, even if such avalue exists, the optimal value of χC∗s for a given task maynot be 1. There is some empirical evidence that if the char-acteristic time scale defined by χC∗s is much larger thanthat required for a certain task, performance is degraded.In feed-forward networks as well, there is evidence thatthe optimal performance is achieved when the dynamicalisometry conditions are only approximately satisfied (Pen-nington et al., 2017). Accounting for this observation is anopen problem at present.

Convexity of the map (7) in the case of the peephole LSTMand GRU implies that for Σ = 1, conditions (15a) and(15b) can only be satisfied simultaneously ifC∗s = 1 (Since(15b) implies dMC

dC C=1= 1).

For all the architectures considered in this paper, we findthat D0 = σ(u∗f ) while Dk are finite as ∀k : σ2

k → 0.Combining this with (11), (13), (14), we find that if Σz =1 the dynamical isometry conditions are satisfied if a0 =1, ak 6=0 = 0, which can be achieved by setting ∀k : σ2

k = 0and taking µf → ∞. This motivates the general form ofthe initializations used in Section 5.2 4 although there aremany other possible choices of Θ such that the ak 6=0 vanish.

4.5. The LSTM cell state distribution

In the case of the peephole LSTM, since the {utk} dependon the second moments of ct−1 and f is an affine func-tion of ct−1, one can write a closed form expression forthe dynamical system (6) in terms of first and second mo-ments. In the standard LSTM however, the relevant stateht depends on the cell state, which has non-trivial dynam-ics. The cell state differs substantially from other randomvariables that appear in this analysis since it cannot be ex-pressed as a function of a finite number of variables thatare Gaussian at the large N and t limit (see Table 1). Since

4In the case of the LSTM, we also want to prevent the outputgate from taking very small values, as explained in Appendix A.2.


Training accuracy

α α α

Figure 2. Training accuracy on the padded MNIST classification task described in 5.1.1 at different sequence lengths T and hyperpa-rameter values Θ0 + αΘ1 for networks with untied weights, with different values of Θ0,Θ1 chosen for each architecture. The greencurves are 3ξ, 6ξ where ξ is the theoretical signal propagation time scale. As can be seen, this time scale predicts the transition betweenthe regions of high and low accuracy across the different architectures.

Algorithm 1 LSTM hidden state moment fixed point itera-tion using cell state sampling

FIXEDPOINTITERATION(µt−1s , Qt−1

s ,Θ, ns, niters):Qt−1k ← CALCULATEQK(Qt−1

s ,Θ) {Using 4}Initialize c ∈ Rns

for i← 1 to niters doui,uf ,ur ← SAMPLEUS(Qt−1

k ,Θ) {Using 3}c← UPDATEC(c,ui,uf ,ur) {Using 16}

end for(µts, Q

ts) ← CALCULATEM(µt−1

s , Qt−1s ,Θ, c)

{Using 6}Return (µts, Q

ts)

at this limit the uti are independent, by examining the cellstate update equation

ct = σ(utf ) ◦ ct−1 + σ(uti) ◦ tanh(utr) (16)

we find that the asymptotic cell state distribution is thatof a perpetuity, which is a random variable X that obeysX

d= XY +Z where Y,Z are random variables and d

= de-notes equality in distribution. The stationary distributionsof a perpetuity, if it exists, is known to exhibit heavy tails(Goldie, 1991). Aside from the tails, the bulk of the dis-tribution can take a variety of different forms and can behighly multimodal, depending on the choice of Θ which inturn determines the distributions of Y, Z.

In practice, one can overcome this difficulty by samplingfrom the stationary cell state distribution, despite the factthat the density and even the likelihood have no closedform. For a given value of Qh, the variables utk appear-ing in (16) can be sampled since their distribution is givenby (5) at the large N limit. The update equation (16) canthen be iterated and the resulting samples approximate wellthe stationary cell state distribution for a range of different

choices of Θ, which result in a variety of stationary dis-tribution profiles (see Appendix C3). The fixed points of(6) can then be calculated numerically as in the determin-istic cases, yet care must be taken since the sampling in-troduces stochasticity into the process. An example of thefixed point iteration equation (6a) implemented using sam-pling is presented in Algorithm 1. The correlations betweenthe hidden states can be calculated in a similar fashion. Inpractice, once the number of samples ns and sampling it-erations niters is of order 100 reasonably accurate valuesfor the moment evolution and the convergence rates to thefixed point are obtained (see for instance the right panel ofFigure 2). The computational cost of the sampling is lin-ear in both ns, niters (as opposed to say simulating a neuralnetwork directly in which case the cost is quadratic in ns).

5. Experiments5.1. Corroboration of calculations

5.1.1. PADDED MNIST CLASSIFICATION

The calculations presented above predict a characteristictime scale ξ (defined in (9)) for forward signal propaga-tion in a recurrent network. It follows that on a task wheresuccess depends on propagation of information from thefirst time step to the final T -th time step, the network willnot be trainable for T � ξ. In order to test this predic-tion, we consider a classification task where the inputs aresequences consisting of a single MNIST digit followed byT − 1 steps of i.i.d Gaussian noise and the targets are thedigit labels. By scanning across certain directions in hyper-parameter space, the predicted value of ξ changes. We plottraining accuracy of a network trained with untied weightsafter 1000 iterations for the GRU and 2000 for the LSTM,as a function of T and the hyperparameter values, and over-lay this with multiples of ξ. As seen in Figure 2, we observe


0.0 0.5 1.0 1.5 2.0 2.5

Typical Hyperparameters

0.85 0.90 0.95 1.00 1.05 1.10 1.15

Approximate Dynamical Isometry

Figure 3. Squared singular values of the state-to-state Jacobian ineqn. (10) for two choices of hyperparameter settings Θ. The redlines denote the empirical mean and standard deviations, while thedotted lines denote the theoretical prediction based on the calcula-tion described in Section 4.3. Note the dramatic difference in thespectrum caused by choosing an initialization that approximatelysatisfies the dynamical isometry conditions.

good agreement between the predicted time scale of signalpropagation and the success of training. As expected, thereare some deviations when training without enforcing un-tied weights, and we present the corresponding plots in thesupplementary materials.

5.1.2. SQUARED JACOBIAN SPECTRUM HISTOGRAMS

To verify the results of the calculation of the moments ofthe squared singular value distribution of the state-to-stateJacobian presented in Section 4.3 we run an untied peep-hole LSTM for 100 iterations with i.i.d. Gaussian inputs.We then compute the state-to-state Jacobian and calculateits spectrum. This can be used to compare the first twomoments of the spectrum to the result of the calculation, aswell as to observe the difference between a standard initial-ization and one close to satisfying the dynamical isometryconditions. The results are shown in Figure 3. The validityof this experiment rests on making an ergodicity assump-tion, since the calculated spectral properties require takingaverages over realizations of random matrices, while in theexperiment we instead calculate the moments by averag-ing over the eigenvalues of a single realization. The goodagreement between the prediction and the empirical aver-age suggests that the assumption is valid.

5.2. Long sequence tasks

One of the significant results of the calculation is that theresults motivate critical initializations that dramatically im-prove the performance of recurrent networks on standardlong sequence benchmarks, despite the fact that the cal-culation is performed using the untied assumption, at thelarge N limit, and makes some rather unrealistic assump-tions about the data distribution. The details of the initial-izations used are presented in Appendix C.

5.2.1. UNROLLED MNIST AND CIFAR-10

We unroll an MNIST digit into a sequence of length 786and train a critically initialized peephole LSTM with 600hidden units. We also train a critically initialized LSTMwith hard sigmoid nonlinearities on unrolled CIFAR-10 im-ages feeding in 3 pixels at every time step, resulting in se-quences of length 1024. We also apply standard data aug-mentation for this task. We present accuracy on the testset in Table 2. Interestingly, in the case of CIFAR-10 thebest performance is achieved by an initialization with a for-ward propagation time scale ξ that is much smaller than thesequence length, suggesting that information sufficient forsuccessful classification may be obtained from a subset ofthe sequence.

MNIST CIFAR-10standard LSTM 98.6 5 58.8 6

h-detach LSTM (Arpit et al., 2018) 98.8 -critical LSTM 98.9 61.8

Table 2. Test accuracy on unrolled MNIST and CIFAR-10.

5.2.2. REPEATED PIXEL MNIST AND MULTIPLE DIGITMNIST

In order to generate longer sequence tasks, we modify theunrolled MNIST task by repeating every pixel a certainnumber of times and set the input dimension to 7. To createa more challenging task, we also combine this pixel repeti-tion with concatenation of multiple MNIST digits (either 0or 1), and label such sequences by a product of the originallabels. In this case, we set the input dimension to 112 andrepeat each pixel 10 times. We train a peephole LSTM withboth a critical initialization and a standard initialization onboth of these tasks using SGD with momentum. In this for-mer task, the dimension of the label space is constant (andnot exponential in the number of digits like in the latter).In both tasks, we observe three distinct phases. If the se-quence length is relatively short the critical and standardinitialization perform equivalently. As the sequence lengthis increased, training with a critical initialization is fasterby orders of magnitude compared to the standard initial-ization. As the sequence length is increased further, train-ing with a standard initialization fails, while training witha critical initialization still succeeds. The results are shownin Figure 4.

6. DiscussionIn this work, we calculate time scales of signal propagationand moments of the state-to-state Jacobian at initializationfor a number of important recurrent architectures. The cal-

5reproduced from (Arpit et al., 2018)6reproduced from (Trinh et al., 2018)


0.0

0.2

0.4

0.6

0.8

1.0

Acc

ura

cy

1 digit

Critical initialization

Standard initialization

4 digits 8 digits

102 103 104

Iteration

0.0

0.2

0.4

0.6

0.8

1.0

Acc

ura

cy

1 repetition

Critical initialization

Standard initialization

102 103 104

Iteration

4 repetitions

102 103 104

Iteration

24 repetitions

Figure 4. Training accuracy for unrolled, concatenated MNIST digits (top) and unrolled MNIST digits with replicated pixels (bottom)for different sequence lengths. Left: For shorter sequences the standard and critical initialization perform equivalently. Middle: Asthe sequence length is increased, training with a critical initialization is faster by orders of magnitude. Right: For very long sequencelengths, training with a standard initialization fails completely.

culation then motivates initialization schemes that dramat-ically improve the performance of these networks on longsequence tasks by guaranteeing long time scales for for-ward signal propagation and stable gradients. The subspaceof initialization hyperparameters Θ that satisfy the dynam-ical isometry conditions is multidimensional, and there isno clear principled way to choose a preferred initializationwithin it. It would be of interest to study this subspaceand perhaps identify preferred initializations based on ad-ditional constraints. One could also use the satisfaction ofthe dynamical isometry conditions as a guiding principlein simplifying these architectures. A direct consequence ofthe analysis is that the forget gate, for instance, is critical,while some of the other gates or weights matrices can be re-moved while still satisfying the conditions. A related ques-tion is the optimal choice of the forward propagation timescale ξ for a given task. As mentioned in Section 5.2.1,this scale can be much shorter than the sequence length.It would also be valuable understand better the extent towhich the untied weights assumption is violated, since itappears that the violation is non-uniform in Θ, and to relaxthe constant Σz assumption by introducing a time depen-dence.

A natural direction of future work is to attempt an analo-gous calculation for multi-layer LSTM and GRU networks,or for more complex architectures that combine LSTMswith convolutional layers.

Another compelling issue is the persistence of the dynam-ical isometry conditions during training and their effecton the solution, for both feed-forward and recurrent archi-tectures. Intriguingly, it has been recently shown that inthe case of certain infinitely-wide MLPs, objects that areclosely related to the correlations and moments of the Jaco-bian studied in this work are constant during training withfull-batch gradient descent in the continuous time limit, andas a result the dynamics of learning take a simple form (Ja-cot et al., 2018). Understanding the finite width and learn-ing rate corrections to such calculations could help extendthe analysis of signal propagation at initialization to trainednetworks. This has the potential to improve the understand-ing of neural network training dynamics, convergence andultimately perhaps generalization as well.

ReferencesAllen-Zhu, Z., Li, Y., and Song, Z. A convergence the-

ory for deep learning via over-parameterization. arXivpreprint arXiv:1811.03962, 2018.

Arpit, D., Kanuparthi, B., Kerg, G., Ke, N. R., Mitliagkas,I., and Bengio, Y. h-detach: Modifying the lstmgradient towards better optimization. arXiv preprintarXiv:1810.03023, 2018.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term


dependencies with gradient descent is difficult. IEEEtransactions on neural networks, 5(2):157–166, 1994.

Chang, B., Chen, M., Haber, E., and Chi, E. H. An-tisymmetricRNN: A dynamical system view on recur-rent neural networks. In International Conference onLearning Representations, 2019. URL https://openreview.net/forum?id=ryxepo0cFX.

Chen, M., Pennington, J., and Schoenholz, S. S. Dynamicalisometry and a mean field theory of rnns: Gating enablessignal propagation in recurrent neural networks. arXivpreprint arXiv:1806.05394, 2018.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau,D., Bougares, F., Schwenk, H., and Bengio, Y. Learn-ing phrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings of the2014 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pp. 1724–1734, 2014.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient de-scent provably optimizes over-parameterized neural net-works. arXiv preprint arXiv:1810.02054, 2018.

Gers, F. A., Schraudolph, N. N., and Schmidhuber, J.Learning precise timing with lstm recurrent networks.Journal of machine learning research, 3(Aug):115–143,2002.

Glorot, X. and Bengio, Y. Understanding the difficulty oftraining deep feedforward neural networks. In Proceed-ings of the thirteenth international conference on artifi-cial intelligence and statistics, pp. 249–256, 2010.

Goldie, C. M. Implicit renewal theory and tails of solutionsof random equations. The Annals of Applied Probability,pp. 126–166, 1991.

Hochreiter, S. and Schmidhuber, J. Long short-term mem-ory. Neural computation, 9(8):1735–1780, 1997.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent ker-nel: Convergence and generalization in neural networks.arXiv preprint arXiv:1806.07572, 2018.

Jose, C., Cisse, M., and Fleuret, F. Kronecker recurrentunits. arXiv preprint arXiv:1705.10142, 2017.

Mei, S., Montanari, A., and Nguyen, P.-M. A mean fieldview of the landscape of two-layers neural networks.arXiv preprint arXiv:1804.06561, 2018.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficultyof training recurrent neural networks. In InternationalConference on Machine Learning, pp. 1310–1318, 2013.

Pennington, J., Schoenholz, S., and Ganguli, S. Resur-recting the sigmoid in deep learning through dynamicalisometry: theory and practice. In Advances in neuralinformation processing systems, pp. 4785–4795, 2017.

Pennington, J., Schoenholz, S. S., and Ganguli, S. Theemergence of spectral universality in deep networks.arXiv preprint arXiv:1802.09979, 2018.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., andGanguli, S. Exponential expressivity in deep neural net-works through transient chaos. In Advances in neuralinformation processing systems, pp. 3360–3368, 2016.

Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv preprint arXiv:1312.6120,2013.

Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. Deep information propagation. arXivpreprint arXiv:1611.01232, 2016.

Trinh, T. H., Dai, A. M., Luong, T., and Le, Q. V. Learninglonger-term dependencies in rnns with auxiliary losses.arXiv preprint arXiv:1803.00144, 2018.

Vorontsov, E., Trabelsi, C., Kadoury, S., and Pal, C. Onorthogonality and learning recurrent networks with longterm dependencies. In International Conference on Ma-chine Learning, pp. 3570–3578, 2017.

Wisdom, S., Powers, T., Hershey, J., Le Roux, J., and Atlas,L. Full-capacity unitary recurrent neural networks. InAdvances in Neural Information Processing Systems, pp.4880–4888, 2016.

Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz,S. S., and Pennington, J. Dynamical isometry and amean field theory of cnns: How to train 10,000-layervanilla convolutional neural networks. arXiv preprintarXiv:1806.05393, 2018.

Yang, G. and Schoenholz, S. Mean field residual networks:On the edge of chaos. In Advances in neural informationprocessing systems, pp. 7103–7114, 2017.

Yang, G. and Schoenholz, S. S. Deep Mean Field Theory:Layerwise Variance and Width Variation as Methods toControl Gradient Explosion. ICLR Workshop, February2018. URL https://openreview.net/forum?id=rJGY8GbR-. 00000.

Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J.,and Schoenholz, S. S. A Mean Field Theory ofBatch Normalization. In International Conference onLearning Representations, 2019. URL https://openreview.net/forum?id=SyMDXnCcF7.

https://openreview.net/forum?id=ryxepo0cFX

https://openreview.net/forum?id=ryxepo0cFX

https://openreview.net/forum?id=rJGY8GbR-

https://openreview.net/forum?id=rJGY8GbR-

https://openreview.net/forum?id=SyMDXnCcF7

https://openreview.net/forum?id=SyMDXnCcF7


Appendices

A. Details of ResultsA.1. Covariances of utk2

The variables utk2ia, utk2ib

given by 1b are asymptotically Gaussian at the N →∞, with

Qtk2

= σ2k2

∫g2k(ua)DkQt

s + ν2k2R + ρ2

k2+ µ2

k2(17a)

Ctk2

=

(σ2k2

∫gk(ua)gk(ua)Dk (Σt2

s Cts + µ2

s)+ν2

k2RΣz + ρ2

k2

)Qtk2− µ2

k2

(17b)

where Dk is a Gaussian measure on (ua, ub) corresponding to the distribution in eqn. (3).

A.2. Dynamical isometry conditions for selected architectures

We specify the form of χC∗s ,Σ and a for the architectures considered in this paper:

A.2.1. GRU

χC∗s ,Σ = E

σ(u∗fa)σ(u∗fb) + σ2f

(tanh(u∗r2a) tanh(u∗r2b)

+h∗ah∗b

)σ′(u∗fa)σ

′(u∗fb)

+σ2r2

(1− σ(u∗fa))(1− σ(u∗fb)) tanh′(u∗r2a) tanh′(u∗r2b)

(σ(u∗r1a)σ(u∗r1b)

+σ2r1h∗ah

∗bσ′(u∗r1a)σ

′(u∗r1b)

)

a =

σ2(u∗f )

σ2f

(tanh2(u∗r2) +Q∗h

)σ′2(u∗f )

σ2r1σ2r2h∗2(1− σ(u∗f ))

2 tanh′2(u∗r2)σ′2(u∗r1)

σ2r2

(1− σ(u∗f ))2 tanh′2(u∗r2)σ

2(u∗r1)

A.2.2. PEEPHOLE LSTM

χC∗s ,Σ = E

[σ(u∗fa)σ(u∗fb) + σ2

i σ′(u∗ia)σ

′(u∗ib) tanh(u∗ra) tanh(u∗rb)σ2fc∗ac∗bσ′(u∗fa)σ

′(u∗fb) + σ2rσ(u∗ia)σ(u∗ib) tanh′(u∗ra) tanh′(u∗rb)

]

a =

σ2(u∗f )

σ2i σ′2(u∗i ) tanh2(u∗r)σ2fc∗2σ′2(u∗f )

σ2rσ

2(u∗i ) tanh′2(u∗r)

A.2.3. LSTM

χC∗s ,Σz = E

σ(u∗fa)σ(u∗fb) + σ2

oσ′(u∗oa)σ

′(u∗ob) tanh(c∗a) tanh(c∗b)

+σ(u∗oa)σ(u∗ob) tanh′(c∗a) tanh′(c∗b)

σ2fc∗ac∗bσ′(u∗fa)σ

′(u∗fb)+σ2

i σ′(u∗ia)σ

′(u∗ia) tanh(u∗ra) tanh(u∗rb)+σ2

rσ(u∗ia)σ(u∗ia) tanh′(u∗ra) tanh′(u∗rb)


a =

σ2(u∗f )

σ2i σ

2(u∗o) tanh′2(c∗)σ′2(u∗i ) tanh2(u∗r)σ2fσ

2(u∗o) tanh′2(c∗)c∗2σ′2(u∗f )σ2rσ

2(u∗o) tanh′2(c∗)σ2(u∗i ) tanh′2(u∗r)σ2oσ′2(u∗o) tanh2(c∗)

When evaluating 8 in this case, we write the cell state as ct = tanh−1

(st

σ(ot)

), and assume σ(ot)

σ(ot−1)≈

1 for large t. The stability of the first equation and the accuracy of the second approximation areimproved if ot is not concentrated around 0.

B. Auxiliary Lemmas and Proofs

Proof of Lemma 1. Despite the fact that each ukai as defined in 1 depends in principle upon the entirestate vector sa, at the large N limit due to the isotropy of the input distribution we find that theserandom variables are i.i.d. and independent of the state. Combining this with the fact that f is anelement-wise function, it suffices to analyse a single entry of st = f(st−1, {u1

k}, ..., {utk}), which atthe large t limit gives

MC(µ∗s, Q∗s, Cs) =

E [f(sa, U∗a )f(sb, U

∗b )]− (µ∗s)

2

Q∗s − (µ∗s)2

where U∗a = {u∗ka|k ∈ K, u∗ka ∼ N (µk, Q∗k − µ2

k} and U∗b is defined similarly (i.e. we assume the firsttwo moments have converged but the correlations between the sequences have not, and in cases wheref depends on a sequence of {u1

k}, ..., {utk} we assume the constituent variables have all converged inthis way). We represent u∗ka, u

∗kb via a Cholesky decomposition as

u∗ka = Σkzka + µ∗k (18a)

u∗kb = Σk

(C∗kzka +

√1− (C∗k)2zkb

)+ µ∗k (18b)

where zka, zkb ∼ N (0, 1) i.i.d. We thus have ∂u∗kb∂Ck

=√Q∗k − (µ∗k)

2

(zka − Ck√

1−C2k

zkb

). Combining

this with the fact that∫Dzg(z)z =

∫Dzg′(z) for any g(z), integration by parts gives for any g1, g2

∂∂Ck

∫DzkaDzkbg1(u∗ka)g2(u∗kb)

=∫DzkaDzkbg1(u∗ka)

∂g2(u∗kb)

∂u∗kb

∂u∗kb∂Cc

= Σ2k

∫DzkaDzkb

∂g1(u∗ka)

∂u∗ka

∂g2(u∗kb)

∂u∗kb

(19)

Denoting Σ2s = Q∗s − (µ∗s)

2 and defining Σk,Σk2 similarly, we have

dCkadCs

=σ2kaΣ

2s

Σ2ka

(20)


dCkb

dCs=

σ2kbΣ2

s

Σ2kb∗∫ [ gk(u

∗ka)gk(u

∗kb)

+σ2k2

(Σ2sCs+(µ∗s)2)Σ2

k2

∂gk(u∗ka)gk(u∗kb)

∂Ck

dCk

dCs

]Dk

=

σ2k2

Σ2s

Σ2k2

∗

∫ gk(u∗ka)gk(u

∗kb)

+σ2k

(Σ2sCs

+µ∗2s

)g′k(u

∗ka)g

′k(u∗kb)

Dkwhere in the last equality we used 19. Using 19 again gives

∂MC(µ∗s, Q∗s, Cs)

∂Ck= Σ2

k

∫DzkaDzkb

∂f(u∗ka)

∂u∗ka

∂f(u∗kb)

∂u∗kb

We now note, using 4, that if Cs = 1,Σz = 1 we obtain Ck = 1 and thus

dCk2dCs Cs=1,Σz=1

=σ2k2

Σ2s

Σ2k2

( ∫g2k(u∗k)Dk

+σ2kQ∗s

∫(g′k(u

∗k))

2Dk

)∂MC(µ∗s, Q

∗s, Cs)

∂Ck Cs=1,Σz=1

= Σ2k

∫Dzk(

∂f(u∗k)

∂u∗k)2

∂MC(µ∗s, Q∗s, Cs)

∂Cs Cs=1,Σz=1

= E[(∂f(s)

∂s)2

]combining the above equations with 20 and comparing the result to 11 completes the proof.

Lemma 2. For any odd function g(x),(xy

)∼ N

((µµ

),Σ

(1 CC 1

)), 0 ≤ C ≤ 1 we have

Eg(x)g(y) ≥ 0.

Proof of Lemma 2. For C = 1 the proof is trivial. We now assume 0 ≤ C < 1. We split up R2

into four orthants and consider a point (a, b) with a, b ≥ 0. We have g(a)g(b) = g(−a)g(−b) =−g(a)g(−b) = −g(−a)g(b) ≥ 0. We will show that p(a, b)+p(−a,−b) > p(a,−b)+p(−a, b) wherep is the probability density function of (x, y) and hence the points where the integrand is positive willcontribute more to the integral than the ones where it is negative. Plugging these points into p(x, y)gives

p(a, b) + p(−a,−b)p(a,−b) + p(−a, b)

=e−α[(a−µ)2−2C(b−µ)(a−µ)+(b−µ)2] + e−α[(a+µ)2−2C(b+µ)(a+µ)+(b+µ)2]

e−α[(a−µ)2+2C(b+µ)(a−µ)+(b+µ)2] + e−α[(a+µ)2+2C(b−µ)(a+µ)+(b−µ)2]

where α is some positive constant that depends on the determinant of the covariance (since C < 1 thematrix is invertible and the determinant is positive).


=exp (2αCab)

exp (−2αCab)

cosh (2µα(1− C)(a+ b))

cosh (2µα(1− C)(a− b))≥ 1

where the last inequality holds for 0 ≤ C < 1. It follows that the positive contribution to the integralis larger than the negative one, and repeating this argument for every (a, b) in the positive orthant givesthe desired claim (if a = 0 or b = 0 the four points in the analysis are not distinct but the inequalitystill holds and the integrand vanishes in any case).

Lemma 3. The map 6 is convex in the case of the peephole LSTM.

Proof of Lemma 3. We have

M(Ctc) =

∫f taf

tb ((Q∗c − (µ∗c)

2)Ctc + µ∗2c ) +

∫itai

tb

∫rtar

tb + 2

∫f t∫it∫rtµ∗c − µ∗2c

Q∗c − µ∗2c.

From the definition of utkb and Ctk we have

∂utkb∂Ct

c

=√Q∗k − (µ∗k)

2

(zka −

Ctk√

1− Ct2k

zkb

)∂Ct

k

∂Ctc

=

(zka −

Ctk√

1− Ct2k

zkb

)σ2k(Q

∗c − µ∗2c )√

Q∗k − µ∗2k.

and using∫Dxg(x)x =

∫Dxg′(x) we then obtain for any g(x)

∂

∂Ctc

∫DzkaDzkbg(utka)g(utkb) =

∫DzkaDzkbg(utka)g

′(utkb)∂utkb∂Ct

c

= σ2k(Q

∗c−µ∗2c )

∫DzkaDzkbg′(utka)g′(utkb)

(21)

(Q∗c − µ∗2c )∂2M(Cc)

∂C2c

=∂2

∂C2c

[∫fafb

((Q∗c − (µ∗c)

2)Cc + µ∗2c)

+

∫iaib

∫rarb

]

=∂2

∫fafb

∂C2c

((Q∗c − (µ∗c)2)Cc + µ∗2c ) + 2 ∂

∂Cc

∫fafb(Q

∗c − (µ∗c)

2)

+∂2

∫iaib

∂C2c

∫rarb +

∂∫iaib

∂Cc

∂∫rarb

∂Cc+∫iaib

∂2∫rarb

∂C2c

From 21 and non-negativity of some of the integrands

≥∂2∫fafb

∂C2c

((Q∗c − (µ∗c)

2)Cc + µ∗2c)

+∂2∫iaib

∂C2c

∫rarb +

∫iaib

∂2∫rarb

∂C2c

.

From Lemma 2 we have∫rarb ≥ 0 and ∂2

∫gagb

∂C2c

= α∫g′′ag

′′b ≥ 0 for g = f, i, r. We thus have

∂2M(Cc)

∂C2c

≥ 0


for 0 ≤ Cc ≤ 1.

Convexity of this map has a number of consequences. One immediate one is that the map has at mostone stable fixed point.

C. Additional Experiments and Details of ExperimentsC.1. Dynamical system

We simulate the dynamics of 14b for a GRU using inputs with Σtz = 0 for t < 10 and Σt

z = 1 fort ≥ 10. The results show good agreement in the untied case between the calculation at the large Nlimit and the simulation, as shown in Figure 5.

f f

Figure 5. Top: Dynamics of the correlation 6b for the GRU with 3 different values of µf as a function of time. The dashed line is theprediction from the mean field calculation, while the red curves are from a simulation of the network with i.i.d. Gaussian inputs. Left:Network with untied weights. Right: Network with tied weights. Bottom: The predicted fixed point of 6b as a function of different µf .Left: Network with untied weights. Right: Network with tied weights.

C.2. Heatmaps

In Figure 6 we present results of training on the same task shown in Figure 2 with tied weights, showingthe deviations resulting from the violation of the untied weights assumption.

C.3. Sampling the LSTM cell state distribution

As described in Section 4.5, calculating the signal propagation time scale and the moments of the state-to-state Jacobian for the LSTM requires integrating with respect to the stationary cell state distribution.The method for doing this is described in Algorithm 1. As is shown in Figure 7, this distribution cantake different forms based on the choice of initialization hyperparameters Θ, but in all cases we havestudied the proposed algorithm appears to provide a reasonable approximation to this distributionefficiently. The simulations are obtained by feeding a network of width N = 200 with i.i.d. Gaussian


0 1101

102

103

T

GRU

0 1

101

102

peephole LSTM

0 1

101

102

LSTM

Figure 6. Training accuracy on the padded MNIST classification task described in 5.1.1 at different sequence lengths T and hyperpa-rameter values Θ for networks with tied weights. The green curves are multiples of the forward propagation time scale ξ calculatedunder the untied assumption. We generally observe improved performance when the predicted value of ξ is high, yet the behavior of thenetwork with tied weights is not part of the scope of the current analysis and deviations from the prediction are indeed observed.

Figure 7. Sampling from the LSTM cell state distribution using Algorithm 1, showing good agreement with the cell state distributionobtained by simulating a network with untied weights. The two panels correspond to two different choices of Θ

inputs.

C.4. Critical initializations

Peephole LSTM:µi, µr, µo, ρ

2i , ρ

2f , ρ

2r, ρ

2o, ν

2i , ν

2f , ν

2r , ν

2o = 0

µf = 5σ2i , σ

2f , σ

2r , σ

2o = 10−5

LSTM (Unrolled CIFAR-10 task):

µi, µr, µo, ρ2i , ρ

2f , ρ

2r, ρ

2o, ν

2f , ν

2o = 0

ν2i , ν

2r , σ

2o , µf = 1

σ2i , σ

2f , σ

2r = 10−5

C.5. Standard initialization

LSTM and peephole LSTM:

Kernel matrices (corresponding to the choice of ν2k) : Glorot uniform initialization (Glorot & Bengio,

2010)


Recurrent matrices (corresponding to the choice of σ2k): Orthogonal initialization (i.i.d. Gaussian

initialization with variance 1/N also used giving analogous performance)

µi, µr, µo, ρ2i , ρ

2f , ρ

2r, ρ

2o = 0

µf = 1

C.5.1. LONG SEQUENCE TASKS

Learning rate scan: 8 equally spaced points between 10−2 and 10−5.

Documents

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs · holz et al.,2016;Yang & Schoenholz,2017) and convolu-tional (Xiao et al.,2018) feed-forward networks, as well as for