StochasticModiﬁedEquationsandDynamicsofStochastic ...jmlr.csail.mit.edu/papers/volume20/17-526/17-526.pdfStochastic Modified Equations I: Mathematical Foundations imations that establish

Journal of Machine Learning Research 20 (2019) 1-47 Submitted 9/17; Revised 2/19; Published 3/19

Stochastic Modified Equations and Dynamics of StochasticGradient Algorithms I: Mathematical Foundations

Qianxiao Li [email protected] of High Performance ComputingAgency for Science, Technology and Research1 Fusionopolis Way, Connexis North, Singapore 138632

Cheng Tai [email protected] Institute of Big Data Researchand Peking UniversityBeijing, China, 100080

Weinan E [email protected] UniversityPrinceton, NJ 08544, USABeijing Institute of Big Data ResearchBeijing, China, 100080

Editor: Francis Bach

AbstractWe develop the mathematical foundations of the stochastic modified equations (SME)framework for analyzing the dynamics of stochastic gradient algorithms, where the latteris approximated by a class of stochastic differential equations with small noise param-eters. We prove that this approximation can be understood mathematically as an weakapproximation, which leads to a number of precise and useful results on the approximationsof stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov’s acceler-ated gradient method in the general setting of stochastic objectives. We also demonstratethrough explicit calculations that this continuous-time approach can uncover importantanalytical insights into the stochastic gradient algorithms under consideration that maynot be easy to obtain in a purely discrete-time setting.Keywords: stochastic gradient algorithms, modified equations, stochastic differentialequations, momentum, Nesterov’s accelerated gradient

1. Introduction

Stochastic gradient algorithms are often used to solve optimization problems of the form

minx∈Rd

f(x) := Efγ(x) (1.1)

where fr : r ∈ Γ is a family of functions from Rd to R and γ is a Γ-valued random variable,with respect to which the expectation is taken (these notions will be made precise in thefollowing sections). For empirical loss minimization in supervised learning applications, γis usually a uniform random variable taking values in Γ = 1, 2, . . . , n. In this case, f isthe total empirical loss function and fr, r ∈ Γ are the loss function due to the rth training

c©2019 Qianxiao Li, Cheng Tai and Weinan E.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are providedat http://jmlr.org/papers/v20/17-526.html.

https://creativecommons.org/licenses/by/4.0/

http://jmlr.org/papers/v20/17-526.html

Li, Tai and E

sample. In this paper, we shall consider the general situation of a expectation over arbitraryindex sets and distributions.

Solving (1.1) using the standard gradient descent (GD) on x gives the iteration scheme

xk+1 = xk − η∇Efγ(xk), (1.2)

for k ≥ 0 and η is a small positive step-size known as the learning rate. Note that this requiresthe evaluation of the gradient of an expectation, which can be costly (in this empirical riskminimization case, this happens when n is large). In its simplest form, the stochastic gradientdescent (SGD) algorithm replaces the expectation of the gradient with a sampled gradient,i.e.

xk+1 = xk − η∇fγk(xk), (1.3)

where each γk is an independent and identically distributed (i.i.d.) random variable with thesame distribution as γ. Under mild conditions, we then have E[∇fγk(xk)|xk] = ∇Ef(xk).In other words, (1.3) is a sampled version of (1.2).

In the literature, many convergence results are available for SGD and its variants (Shamirand Zhang, 2013; Moulines and Bach, 2011; Needell et al., 2014; Xiao and Zhang, 2014;Shalev-Shwartz and Zhang, 2014; Bach and Moulines, 2013; Défossez and Bach, 2015).However, it is often the case that different analysis techniques must be adopted for dif-ferent variants of the algorithms and there generally lacked a systematic approach to studytheir precise dynamical properties. In Li et al. (2015), a general approach was introducedto address this problem, in which discrete-time stochastic gradient algorithms are approx-imated by continuous-time stochastic differential equations with the noise term dependingon a small parameter (the learning rate). This can be viewed as a generalization of themethod of modified equations (Hirt, 1968; Noh and Protter, 1960; Daly, 1963; Warming andHyett, 1974) to the stochastic setting, and allows one to employ tools from stochastic calcu-lus to systematically analyze the dynamics of stochastic gradient algorithms. The stochasticmodified equations (SME) approach was further developed in Li et al. (2017), where a weakapproximation result for the SGD was proved in a finite-sum-objective setting.

The present series of papers builds on the earlier work of Li et al. (2015, 2017) andaims to establish the framework of stochastic modified equations and their applications ingreater generality and depth, and highlight the advantages of this systematic frameworkfor studying stochastic gradient algorithms using continuous-time methods. As the first inthe series, this paper will focus on mathematical aspects, namely the main approximationtheorems relating stochastic gradient algorithms to stochastic modified equations in the formof weak approximations. These generalize the approximation results in Li et al. (2017) invarious aspects. In a subsequent paper in the series, we will discuss the application of thisformalism to adaptive stochastic gradient algorithms and related problems.

The organization of this paper is as follows. We first discuss related work in Sec. 2,especially in the context of continuous-time approximations. Next, we motivate the SMEapproach and set up the precise mathematical framework in Sec. 3.2. We then prove in Sec. 4a central result relating discrete stochastic algorithms and continuous stochastic processes,which allows us to derive SMEs for stochastic gradient descent and variants. In Sec. 5, theSME approach is used to analyze the dynamics of stochastic gradient algorithms when ap-plied to optimize a simple yet non-trivial objective. Lastly, we conclude with some discussion

2

Stochastic Modified Equations I: Mathematical Foundations

of our results in Sec. 6. The longer proofs of the results used in the paper are organized in theappendix. These are essentially self-contained, but basic knowledge of stochastic calculusand probability theory are assumed. Unfamiliar readers may refer to standard introductorytexts, such as Durrett (2010) and Oksendal (2013).

1.1. Notation

In this paper, we adhere wherever possible to the following notation. Dimensional indicesare written as subscripts with a bracket to avoid confusion with other sequential indices(e.g. time, iteration number), which do not have brackets. When more than one indices arepresent, we separate them with a comma, e.g. xk,(i) is the i-th coordinate of the vectorxk, the kth member of a sequence. We adopt the Einstein’s summation convention, whererepeated (spatial) indices are summed, i.e. x(i)x(i) :=

∑di=1 x(i)x(i). For a matrix A, we

denote by λ(A) = λ1(A), λ2(A), . . . the set of eigenvalues of A. If A is Hermitian, thenthe eigenvalues are ordered so that λ1(A) denotes a maximum eigenvalue. We denote theusual Euclidean norm by | · | and for higher rank tensors, we use the same notation to denotethe flattened vector norms (e.g. for matrices it will be the Frobenius norm). The ∧ symbolsdenotes the minimum operator, i.e. a ∧ b := min(a, b).

For a probability space (or generally, a measure space) (Ω,F ,P), the symbol L(Ω,F ,P),p ∈ (1,∞) denotes the usual Lebesgue spaces, i.e. u ∈ Lp(Ω,F ,P) if

‖u‖pLp(Ω,F ,P) :=

∫Ω|u(ω)|pdP(ω) ≡ E|u|p <∞.

When the underlying probability space is obvious, we use the shorthand Lp(Ω) ≡ L(Ω,F ,P).In addition, when Ω = Rd, we also write the local Lp spaces as Lploc(R

d), which contains ufor which |u|p is integrable on compact subsets of Rd.

Finally, we note that in the proofs of various results, we typically use the letter C(whose value may change across results) to denote a generic positive constant. This isusually independent of the learning rate η, but if not explicitly stated otherwise, it maydepend on e.g. Lipschitz constants, ambient dimensions, etc.

2. Related work

In this section, we discuss several related works on analyzing discrete-time algorithms usingcontinuous-time approaches. The idea of approximating discrete-time stochastic algorithmsby continuous equations dates back to the large body of work known as stochastic approxi-mation theory (Kushner and Yin, 2003; Ljung et al., 2012). These typically establish law oflarge numbers type results where the limiting equation is an ODE, which can then be usedto prove powerful convergence results for the stochastic algorithms under consideration. Anotion of convergence in distribution, similar to a central limit theorem, was also studiedfor the purpose of estimating the rate of convergence of the ODE methods (Kushner, 1978;Kushner and Shwartz, 1984; Kushner and Clark, 2012), where connections between leadingorder perturbations and Ornstein-Uhlenbeck (OU) processes are established. However, theseestimates are not used to systematically study the dynamics of stochastic gradient descentand their variants, beyond convergence analysis for the discrete-time algorithms.

3

Li, Tai and E

As far as the authors are aware, the first work on using stochastic differential equationsto study the precise dynamical properties of stochastic gradient descent are the independentworks of Li et al. (2015) and Mandt et al. (2015). In Li et al. (2015), a systematic frameworkof SDE approximation of SGD and SGD with momentum are derived and applied to studydynamical properties of the stochastic algorithms as well as adaptive parameter tuningschemes. These go beyond OU process approximations and this distinction is importantsince the OU process is not always the appropriate stochastic approximation in generalsettings (See Sec. 4.2 of this paper). In Mandt et al. (2015), a similar procedure is employedto derive a SDE approximation for the SGD, from which issues such as choice of learning ratesare studied. Although the concrete analysis in Mandt et al. (2015) is on the restricted case ofconstant diffusion matrices leading to OU processes, the essential ideas on the general leadingorder approximation are also discussed. We also mention the work of Raginsky and Bouvrie(2012) and Mertikopoulos and Staudigl (2017), where the authors considered a continuous-time variant of mirror descent in the form of a coupled SDE and studied its convergenceproperties. In Krichene and Bartlett (2017), a SDE approximation of a general class ofaccelerated mirror descent algorithms is introduced. Under appropriate choices of the mirrormap, the approximations considered there can be linked with the order-1 approximationsderived in this paper. In contrast with the present paper which studies the precise sense inwhich these approximations hold, these works are primarily concerned with investigating theconvergence of the optimization algorithms themselves using the approximate continuous-time equations.

It is important to note that the approximation arguments in the aforementioned works,including both Li et al. (2015) and Mandt et al. (2015), are heuristic from a mathematicalpoint of view. In Li et al. (2017), the SME approximation is rigorously proved in the finite-sum-objective case with strong regularity conditions, and further asymptotic analysis andtuning algorithms are studied. The SME approach has subsequently been utilized to studyvariants of stochastic gradient algorithms, including those in the distributed optimizationsetting (An et al., 2018). The work of Mandt et al. (2015) is further developed in Mandtet al. (2016, 2017), with applications such as the development scalable MCMC algorithms.

The present paper builds on the earlier work of Li et al. (2015, 2017), but focuses onextending and solidifying the mathematical aspects. In particular, we present an entirelyrigorous and self-contained mathematical formulation of the SME framework that applies tomore general algorithms (including momentum SGD and stochastic Nesterov’s acceleratedgradient method) and more general objectives (expectation over random functions, insteadof just a finite-sum). Moreover, various regularity conditions in Li et al. (2017) have beenrelaxed. The main approximation procedure is inspired by the seminal works of Milstein(1986, 1975) in numerical analysis of stochastic differential equations, but lower regularityconditions are required in our case due to the presence of the small noise parameter, whichallows for better truncation of Itô-Taylor expansions. The mathematical analysis of theSME-type approximation for the SGD was also performed in Feng et al. (2017); Hu et al.(2017) using semi-group approaches, although the smoothness requirements presented thereare greater than those established using the current methods. Lastly, the Nesterov’s accel-erated gradient SME we derive in Sec. 4.4 can be viewed as a generalization of the ODEapproach in Su et al. (2014) to stochastic gradients, and we show that the presence of noisegives additional features to the dynamics. Finally, we note that continuous-time approx-

4


imations that establish links between optimization, calculus of variations and symplecticintegration has been studied in Wibisono et al. (2016); Betancourt et al. (2018).

3. Stochastic modified equations

We now introduce the stochastic modified equations framework. The starting motivationis the observation that GD iterations is a (Euler) discretization of the continuous-time,ordinary differential equation

dxdt = −∇f(x), (3.1)

and studying (3.1) can give us important insights to the dynamics of the discrete-timealgorithm for small enough learning rates. The natural question when extending this toSGD is, what is the right continuous-time equation to consider? Below, we begin with someheuristic considerations.

3.1. Heuristic motivations

we rewrite the SGD iteration (1.3) as

xk+1 = xk − η∇f(xk) +√ηVk(xk, γk), (3.2)

where Vk(xk, γk) =√η(∇f(xk)−∇fγk(xk)) is a d-dimensional random vector. A straight-

forward calculation shows that

E[Vk|xk] = 0

cov[Vk, Vk|xk] = ηΣ(xk),

Σ(xk) := E[(∇fγk(xk)−∇f(xk))(fγk(xk)−∇f(xk))T |xk], (3.3)

i.e. conditional on xk, Vk(xk) has 0 mean and covariance ηΣ(xk). Here, Σ is simply theconditional covariance of the stochastic gradient approximation ∇fγ of ∇f .

Now, consider a time-homogeneous Itô stochastic differential equation (SDE) of the form

dXt = b(Xt)dt+√ησ(Xt)dWt, (3.4)

where Xt ∈ Rd for t ≥ 0 and Wt is a standard d-dimensional Wiener process. The functionb : Rd → Rd is known as the drift and σ : Rd → Rd×d is the diffusion matrix. The key obser-vation is that if we apply the Euler discretization with step-size η to (3.4), approximatingXkη by Xk, we obtain the following discrete iteration for the latter:

Xk+1 = Xk + ηb(Xk) + ησ(Xk)Zk, (3.5)

where Zk := W(k+1)η − Wkη are d-dimensional i.i.d. standard normal random variables.Comparing with (3.2), if we set b = −∇f , σ(x) = Σ(x)

1/2 and identify t with kη, we then havematching first and second conditional moments. Hence, this motivates the approximatingequation

dXt = −∇f(Xt)dt+ (ηΣ(Xt))1/2dWt. (3.6)

5

Li, Tai and E

Note that as this heuristic argument shows, the presence of the small parameter √η onthe diffusion term is necessary to model the fact that when learning rate decreases, thefluctuations to the SGA iterates must also decrease.

The immediate mathematical question is then: in what sense is an SDE like (3.6) anapproximation of (1.3)? Let us now establish the precise mathematical framework in whichwe can answer this question.

3.2. The mathematical framework

Let (Ω,F ,P) be a sufficiently rich probability space and (Γ,FΓ) be a measure space repre-senting the index space for our random objectives. Let γ : Ω → Γ be a random variableand (r, x) 7→ fr(x) a measurable mapping from Γ × Rd to R. Hence, for each x, fγ(x) is arandom variable. Throughout this paper, we assume the follow facts about fγ(x):

Assumption 3.1 The random variable fγ(x) satisfies

(i) fγ(x) ∈ L1(Ω) for all x ∈ Rd

(ii) fγ(x) is continuously differentiable in x almost surely and for each R > 0, there ex-ists a random variable MR,γ such that max|x|≤R |∇fγ(x)| ≤ MR,γ almost surely, withE|MR,γ | <∞

(iii) ∇fγ(x) ∈ L2(Ω) for all x ∈ Rd

Note that in the empirical risk minimization case where Γ is finite, the conditions aboveare often trivially satisfied. Condition (i) in Assumption 3.1 allows us to define the totalobjective function we would like to minimize as the expectation

f(x) := Efγ(x) ≡∫

Ωfγ(ω)(x)dP(ω). (3.7)

Moreover, Assumption 3.1 (ii) implies via the dominated convergence theorem that E∇fγ =∇Efγ ≡ ∇f . Now, let γk : k = 0, 1, . . . be a sequence of i.i.d.Γ-valued random variableswith the same distribution as γ. Let x0 ∈ Rd be fixed and define the generalized stochasticgradient iteration as the stochastic process

xk+1 = xk + ηh(xk, γk, η) (3.8)

for k ≥ 0, where h : Rd × Γ × R → Rd is a measurable function and η > 0 is the learningrate. In the simple case of SGD, we have h(x, r, η) = −∇fr(x), but we shall consider thegeneralized version above so that modified equations for SGD variants can also be derivedfrom our approximation theorems.

Next, let us define the class of approximating continuous stochastic processes, whichwe call stochastic modified equations. Consider the time-homogeneous Itô diffusion processXt : t ≥ 0 represented by the following stochastic differential equation (SDE)

dXt = b(Xt, η)dt+√ησ(Xt, η)dWt, X0 = x0 (3.9)

where Wt : t ≥ 0 is a standard d-dimensional Wiener process independent of γk, b :Rd×R→ Rd is the approximating drift vector and σ : Rd×R→ Rd×d is the approximating

6


diffusion matrix. In the following, we will need to pick b, σ appropriately so that (3.8) isapproximated by (3.9), the sense of which we now describe.

First, notice that the stochastic process xk induces a probability measure on theproduct space Rd × Rd × · · ·, whereas Xt induces a probability measure on C0([0,∞),Rd).To compare them, one can form a piece-wise linear interpolation of the former. Alternatively,as we do in this work, we sample a discrete number of points from the latter. Second,the process xk is adapted to the filtration generated by γk (e.g. in the case of SGD,this is the random sampling of functions in fr), whereas the process Xt is adaptedto an independent, Wiener filtration. Hence, it is not appropriate to compare individualsample paths. Rather, we define below a sense of weak approximations by comparing thedistributions of the two processes.

Definition 1 Let G denote the set of continuous functions Rd → R of at most polynomialgrowth, i.e. g ∈ G if there exists positive integers κ1, κ2 > 0 such that

|g(x)| ≤ κ1(1 + |x|2κ2),

for all x ∈ Rd. Moreover, for each integer α ≥ 1 we denote by Gα the set of α-times contin-uously differentiable functions Rd → R which, together with its partial derivatives up to andincluding order α, belong to G. Note that each Gα is a subspace of Cα, the usual space ofα-times continuously differentiable functions. Moreover, if g depends on additional parame-ters, we say g ∈ Gα if the constants κ1, κ2 are independent of these parameters, i.e. g ∈ Gαuniformly. Finally, the definition generalizes to vector-valued functions coordinate-wise inthe co-domain.

Definition 2 Let T > 0, η ∈ (0, 1 ∧ T ), and α ≥ 1 be an integer. Set N = bT/ηc. We saythat a continuous-time stochastic process Xt : t ∈ [0, T ] is an order α weak approximationof a discrete stochastic process xk : k = 0, . . . , N if for every g ∈ Gα+1, there exists apositive constant C, independent of η, such that

maxk=0,...,N

|Eg(xk)− Eg(Xkη)| ≤ Cηα. (3.10)

Let us discuss briefly the notion of weak approximation as introduced above. These areapproximations of the distribution of sample paths, instead of the sample paths themselves.This is enforced by requiring that the expectations of the two processes Xt and xk overa sufficiently large class of test functions to be close. In our definition, the test function classGα+1 is quite large, and in particular it includes all polynomials. Thus, Eq. (3.10) implies inparticular that all moments of the two processes become close at the rate of ηα, and henceso must their distributions. The notion of weak approximation must be contrasted with thatof strong approximations, where one would for example require (in the case of mean-squareapproximations)

[E|xk −Xkη|2]1/2 ≤ Cηα.

The above forces the actual sample-paths of the two processes to be close, per realization ofthe random process, which severely limits its application. In fact, one important advantage of

7

Li, Tai and E

weak approximations is that the continuous-time process Xt can in fact approximate discretestochastic processes whose step-wise driving noise is not Gaussian, as long as appropriatemoments are matched (see e.g. the content of Theorem 3). This additional flexibility is usefulas it allows the treatment of more general classes of stochastic gradient iterations.

4. The approximation theorems

We now present the main approximation theorems. The derivation is based on the followingtwo-step process:

1. We establish a connection between one-step approximation and approximation on afinite time interval.

2. We construct a one-step approximation that is of order α+1, and so the approximationon a finite interval is of order α.

4.1. Relating one-step to N-step approximations

Let us consider generally the question of the relationship between one-step approximationsand approximations on a finite interval. Let T > 0, η ∈ (0, 1∧T ) and N = bT/ηc and recallthe general SGA iterations

xk+1 = xk + ηh(xk, γk, η), x0 ∈ Rd, k = 0, . . . , N. (4.1)

and the general candidate family of approximating SDEs

dXη,εt = b(Xη,ε

t , η, ε)dt+√ησ(Xη,ε

t , η, ε)dWt, X0 = x0, t ∈ [0, T ], (4.2)

where ε ∈ (0, 1) is a mollification parameter, whose role will become apparent later. Toreduce notational clutter and improve readability, unless some limiting procedure is consid-ered, we shall not explicitly write the dependence of Xη,ε

t on η, ε and simply denote by Xt

the solution of the above SDE. Let us also denote for convenience Xk := Xkη. Further, letXx,s

t : t ≥ s denote the stochastic process obeying the same equation (4.2), but with theinitial condition Xx,s

s = x. We similarly write Xx,lk := Xx,lη

kη and denote by xx,lk : k ≥ l thestochastic process satisfying (4.1) but with xl = x.

Throughout this section, we assume the following conditions:

Assumption 4.1 The functions b : Rd × (0, 1 ∧ T )× (0, 1)→ Rd and σ : Rd × (0, 1 ∧ T )×(0, 1)→ Rd×d satisfy:

1. Uniform linear growth condition

|b(x, η, ε)|2 + |σ(x, η, ε)|2 ≤ L2(1 + |x|2)

for all x, y ∈ Rd, η ∈ (0, 1 ∧ T ), ε ∈ (0, 1).

2. Uniform Lipschitz condition

|b(x, η, ε)− b(y, η, ε)|+ |σ(x, η, ε)− σ(y, η, ε)| ≤ L|x− y|

for all x, y ∈ Rd, η ∈ (0, 1 ∧ T ), ε ∈ (0, 1).

8


Note that 2 implies 1 if there is at least one x where the supremum of b, σ over η, ε is finite.In particular, these conditions imply via Thm. 18 that there exists a unique solution toEq. 4.2.

Now, let us denote the one-step changes

∆(x) := xx,01 − x, ∆(x) := Xx,01 − x. (4.3)

We prove the following result which relates one-step approximations with approximationson a finite time interval.

Theorem 3 Let T > 0, η ∈ (0, 1 ∧ T ), ε ∈ (0, 1) and N = bT/ηc. Let α ≥ 1 be an integer.Suppose further that the following conditions hold:

(i) There exists a function ρ : (0, 1)→ R+ and K1 ∈ G independent of η, ε such that∣∣∣∣∣∣Es∏j=1

∆(ij)(x)− Es∏j=1

∆(ij)(x)

∣∣∣∣∣∣ ≤ K1(x)(ηρ(ε) + ηα+1),

for s = 1, 2, . . . , α and

Eα+1∏j=1

∣∣∣∆(ij)(x)∣∣∣ ≤ K1(x)ηα+1,

for all ij ∈ 1, . . . , d.

(ii) For each m ≥ 1, the 2m-moment of xx,0k is uniformly bounded with respect to k and η,i.e. there exists a K2 ∈ G, independent of η, k, such that

E|xx,0k |2m ≤ K2(x),

for all k = 0, . . . , N ≡ bT/ηc.

Then, for each g ∈ Gα+1, there exists a constant C > 0, independent of η, ε, such that

maxk=0,...,N

|Eg(xk)− Eg(Xkη)| ≤ C(ηα + ρ(ε))

The proof of Thm. 3 requires a number of technical results that we defer to the appendix.Below, we demonstrate the main ingredients of the proof and refer to the appendix wherethe proofs of the auxiliary results are fully presented.Proof In this proof, since there are many conditioning on the initial condition, to preventnested superscripts we shall introduce the alternative notationXt(x, s) ≡ Xx,s

t , and similarlyfor Xk and xk. Fix g ∈ Gα+1 and 1 ≤ k ≤ N . We have

Eg(Xkη) = Eg(Xk) = Eg(Xk(X1, 1))− Eg(Xk(x1, 1)) + Eg(Xk(x1, 1)).

If k > 1, by noting that Xk(x1, 1) = Xk(X2(x1, 1), 2), we get

Eg(Xk(x1, 1)) = Eg(Xk(X2(x1, 1), 2))− Eg(Xk(x2, 2)) + Eg(Xk(x2, 2))

9

Li, Tai and E

Continuing this process, we then have

Eg(Xk) =

k−1∑l=1

Eg(Xk(Xl(xl−1, l − 1), l))− Eg(Xk(xl, l))

+ Eg(Xk(xk−1, k − 1))

and hence by subtracting Eg(xk) ≡ Eg(xk(xk−1, k − 1)) we get

Eg(Xk)− Eg(xk) =k−1∑l=1

Eg(Xk(Xl(xl−1, l − 1), l))− Eg(Xk(xl, l))

+ Eg(Xk(xk−1, k − 1))− Eg(xk(xk−1, k − 1))

and so

Eg(Xk)− Eg(xk) =k−1∑l=1

EE[g(Xk(Xl(xl−1, l − 1), l))

∣∣∣Xl(xl−1, l − 1)]− EE

[g(Xk(xl, l))

∣∣∣xl]

+ Eg(Xk(xk−1, k − 1))− Eg(xk(xk−1, k − 1)),

Now, let u(x, s) = Eg(Xkη(x, s)). Then, we have

|Eg(Xk)− Eg(xk)| ≤k−1∑l=1

|Eu(Xl(xl−1, l − 1), lη)− Eu(xl(xl−1, l − 1), lη)|

+ |Eg(Xk(xk−1, k − 1))− Eg(xk(xk−1, k − 1))|

≤k−1∑l=1

E|E[u(Xl(xl−1, l − 1), lη)|xl−1]− E[u(xl(xl−1, l − 1), lη)|xl−1]|

+ E|E[g(Xk(xk−1, k − 1))|xk−1]− E[g(xk(xk−1, k − 1))|xk−1]|.

Using Prop. 25, u(·, s) ∈ Gα+1 uniformly in s, t, η and ε. Thus, by Assumption (i) andLem. 27,

|Eg(xk)− Eg(Xk)| ≤(ηρ(ε) + ηα+1)

(k−1∑l=1

EKl−1(xl−1) + EKk−1(xk−1)

)

≤(ηρ(ε) + ηα+1)

N∑l=0

κl,1(1 + E|xl|2κl,2),

where in the last line we used moment estimates from Thm. 19. Finally, using Assumption(ii) and the fact that N ≤ T/η, we have

|Eg(xk)− Eg(Xkη)| = |Eg(xk)− Eg(Xk)| ≤ C(ρ(ε) + ηα).

10


4.2. SME for stochastic gradient descent

Thm. 3 allows us to prove the main approximation results for the current paper. In partic-ular, in this section we derive a second-order accurate weak approximation for the simpleSGD iterations (1.3), from which a simpler, first-order accurate approximation also follows.As seen in Thm. 3, we need only verify the conditions (i)-(ii) in order to prove the weakapproximation result. These conditions mostly involve moment estimates, which we nowperform. To simplify presentation, we introduce the following shorthand. Whenever wewrite

ψ(x) = ψ0(x) + ηψ1(x) +O(r(η, ε)),

for some remainder term r(η, ε), we mean: there exists K ∈ G independent of η, ε such that

|ψ(x)− ψ0(x)− ηψ1(x)| ≤ K(x)r(η, ε).

Now, let us set in (4.2)

b(x, η, ε) = b0(x, ε) + ηb1(x, ε)

σ(x, η, ε) = σ0(x, ε),

where b0, b1, σ0 are functions to be determined. We have the following moment estimate.

Lemma 4 Let ∆(x) be defined as in (4.3). Suppose further that with b0, b1, σ0 ∈ G3. Thenwe have

(i) E∆(i)(x) = b0(x, ε)(i)η + [12b0(x, ε)(j)∂(j)b0(x, ε)(i) + b1(x, ε)(i)]η

2 +O(η3),

(ii) E∆(i)(x)∆(j)(x) = [b0(x, ε)(i)b0(x, ε)(j) + σ0(x, ε)(i,k)σ0(x, ε)(j,k)]η2 +O(η3),

(iii) E∏3j=1 |∆(ij)(x)| = O(η3).

Proof To obtain (i)-(iii), we simply apply Lem. 28 with ψ(z) =∏sj=1(z(ij) − x(ij)) for

s = 1, 2, 3 respectively.

Next, we estimate the moments of the SGA iterations below.

Lemma 5 Let ∆(x) be defined as in (4.3) with the SGD iterations, i.e.h(x, r, η) = −∇fr(x).Suppose that for each x ∈ Rd, f ∈ G1. Then,

(i) E∆(i)(x) = −∂(i)f(x)η,

(ii) E∆(i)(x)∆(j)(x) = ∂(i)f(x)∂(j)f(x)η2 + Σ(x)(i,j)η2,

(iii) E∏3j=1 |∆(ij)(x)| = O(η3),

where Σ(x) := E(∇fγ(x)−∇f(x))(∇fγ(x)−∇f(x))T .

11

Li, Tai and E

Proof We have ∆(x) = −η∇fγ0(x). Taking expectations, the results then follow.

We now prove the main approximation theorem for the simple SGD. Before presenting thestatement and proof, we shall note a few technical issues that prevents the direct applicationof Thm. 3 with the moment estimates in Lem.4 and 5. The latter suggest ignoring ε andsetting

b0(x, ε) = −∇f(x), b1(x, ε) = −− 14∇|∇f(x)|2, σ0(x, ε) = Σ(x)

12 .

Then, we would see from Lem.4 and 5 that the SGD and the SDE have matching moments upto O(η3). The first issue with this approach is that even if Σ(x) is sufficiently smooth (whichmay follow from the regularity of ∇fγ), the smoothness of Σ(x)

1/2 cannot be guaranteedunless Σ(x) is positive-definite, which is often too strong an assumption in practice andexcludes interesting cases where Σ(x) is a singular diffusion matrix. However, the resultsin Sec. 4.1 require smoothness. Second, we would like to consider functions fγ that maynot have higher strong derivatives required by the Lemmas, beyond those required to definethe modified equation itself. To fix both of these issues, we will use a simple mollifyingtechnique. This is the reason for the inclusion of the ε parameter in the results in Sec. 4.1.

Definition 6 Let us denote by ν : Rd → R, ν ∈ C∞c (Rd) the standard mollifier

ν(x) :=

C exp(− 1

1−|x|2 ) |x| < 1

0 |x| ≥ 1,

where C := (∫Rd ν(y)dy)

−1 is chosen so that the integral of ν is 1. Further, define νε(x) =ε−dν(x/ε). Let ψ ∈ L1

loc(Rd) be locally integrable, then we may define its mollification by

ψε(x) := (νε ∗ ψ)(x) =

∫Rdνε(x− y)ψ(y)dy =

∫B(0,ε)

νε(y)ψ(x− y)dy,

where B(z, ε) is the d-dimensional ball of radius ε centered at z. The mollification of vector(or matrix) valued functions are defined element-wise.

The mollifier has very useful properties. In particular, we will use the following well-known facts (see e.g. Evans (2010) for proof)

(i) If ψ ∈ L1loc(Rd), then ψε ∈ C∞(Rd)

(ii) ψε(x)→ ψ(x) as ε→ 0 for almost every x ∈ Rd (with respect to the Lebesgue measure)

(iii) If ψ is continuous, then ψε(x)→ ψ(x) as ε→ 0 uniformly on compact subsets of Rd

Next, we make use of the idea of weak derivatives.

Definition 7 Let Ψ ∈ L1loc(Rd) and J be a multi-index of order |J |. Suppose that there

exists a ψ ∈ L1loc(Rd) such that∫

RdΨ(x)∇Jφ(x)dx = (−1)|J |

∫Rdψ(x)φ(x)dx

12


for all φ ∈ C∞c . Then, we call ψ the order J weak derivative of Ψ and write DJΨ = ψ. Notethat when it exists, the weak derivative is unique almost everywhere and if Ψ is differentiable,∇JΨ = DJΨ almost everywhere (Evans, 2010).

The introduction of weak derivatives motivates the definition of the weak version of thefunction spaces Gα.

Definition 8 For α ≥ 1, we define the space Gαw to be the subspace of L1loc(Rd) such that if

g ∈ Gαw, then g has weak derivatives up to order α and for each multi-index J with |J | ≤ α,there exists positive integers κ1, κ2 such that

|DJg(x)| ≤ κ1(1 + |x|2κ2) for a.e. x ∈ Rd.

As in Def. 1, if g depends on additional parameters, we say that g ∈ Gαw if the aboveconstants do not depend on the additional parameters. Also, vector-valued g are defined asabove element-wise in the co-domain. Note that Gαw is a subspace of the Sobolev space Wα,1

loc .

Theorem 9 Let, T > 0, η ∈ (0, 1 ∧ T ) and set N = bT/ηc. Let xk : k ≥ 0 be the SGDiterations defined in (1.3). Suppose the following conditions are met:

(i) f ≡ Efγ is twice continuously differentiable, ∇|∇f |2 is Lipschitz, and f ∈ G4w.

(ii) ∇fγ satisfies a Lipschitz condition:

|∇fγ(x)−∇fγ(y)| ≤ Lγ |x− y| a.s.

for all x, y ∈ Rd, where Lγ is a random variable which is positive a.s. and ELmγ < ∞for each m ≥ 1.

Define Xt : t ∈ [0, T ] as the stochastic process satisfying the SDE

dXt = −∇(f(Xt) + 14η|∇f(Xt)|2)dt+

√ηΣ(Xt)

1/2dWt X0 = x0, (4.4)

with Σ(x) = E(∇fγ(x)−∇f(x))(∇fγ(x)−∇f(x))T . Then, Xt : t ∈ [0, T ] is an order-2 weak approximation of the SGD, i.e. for each g ∈ G3, there exists a constant C > 0independent of η such that

maxk=0,...,N

|Eg(xk)− Eg(Xkη)| ≤ Cη2.

Proof First, we check that Eq. (4.4) admits a unique solution, which amounts to checkingthe conditions in Thm. 18. Note that the Lipschitz condition (ii) implies∇f is Lipschitz withconstant ELγ . To see that Σ(x)1/2 is also Lipschitz, observe that u(x) := ∇fγ(x) − ∇f(x)is Lipschitz (in the sense of (ii), with constant at most Lγ + ELγ), and

|Σ(x)1/2 − Σ(y)

1/2| =∣∣∣‖[u(x)u(x)T ]

1/2‖L2(Ω) − ‖[u(y)u(y)T ]1/2‖L2(Ω)

∣∣∣≤‖[u(x)u(x)T ]

1/2 − [u(y)u(y)T ]1/2‖L2(Ω).

13

Li, Tai and E

Moreover, observe that for vectors u ∈ Rd the mapping u 7→ (uuT )1/2

= uuT /|u| is Lipschitz,which implies

|Σ(x)1/2 − Σ(y)

1/2| ≤ L′‖u(x)− u(y)‖L2(Ω) ≤ L′′|x− y|.

The Lipschitz conditions on the drift and the diffusion matrix imply uniform linear growth,so by Thm. 18, Eq. (4.4) admits a unique solution.

For each ε ∈ (0, 1), define the mollified functions

b0(x, ε) = −νε ∗ ∇f(x), b1(x, ε) = −14ν

ε ∗ (∇|∇f(x)|2), σ0(x, ε) = νε ∗ Σ(x)1/2.

Observe that b0 + ηb1, σ0 satisfies a Lipschitz condition in x uniformly in η, ε. To see this,note that for any Lipschitz function ψ with constant L, we have

|νε ∗ ψ(x)− νε ∗ ψ(y)| ≤∫B(0,ε)

νε(z)|ψ(x− z)− ψ(y − z)|dz ≤ L|x− y|,

which proves b0 + ηb1 and σ0 are uniformly Lipschitz. Similarly, the linear growth conditionfollows. Hence, we may define a family of stochastic processes Xε

t : ε ∈ (0, 1) satisfying

dXεt = b0(Xε

t , ε) + ηb1(Xεt , ε) +

√ησ0(Xε

t , ε)dWt Xε0 = x0,

which each admits a unique solution by Thm. 18. Now, we claim that b0(·, ε), b1(·, ε), σ0(·, ε) ∈G3 uniformly in ε. To see this, simply observe that mollifications are smooth, and moreover,the polynomial growth is satisfied since νε ∗DJψ = ∇J(νε ∗ ψ) and furthermore, if ψ ∈ G,then we have

|ψε(x)| ≤∫B(0,ε)

νε(y)|ψ(x− y)|dy

≤κ1

(1 + 22κ2−1|x|2κ2 + 22κ2−1 1

εd

∫B(0,ε)

|y|2κ2dy

)

But∫B(0,ε) |y|

2κ2dy ≤ Vol(B(0, ε)) = Cεd, where C is independent of ε. This shows thatψε ∈ G uniformly in ε. This immediately implies that b0(·, ε), b1(·, ε), σ0(·, ε) ∈ G3.

Now, since b0(x, ε) → b0(x, 0) (and similarly for b1, σ0), and the limits are continuous,by Lem. 4, 5, 29, 30„ all conditions of Thm. 3 are satisfied, and hence we conclude that foreach g ∈ G3, we have,

maxk=0,...,N

|Eg(Xεkη)− Eg(xk)| ≤ C(η2 + ρ(ε)),

where C is independent of η and ε and ρ(ε)→ 0 as ε→ 0. Moreover, since b0(x, ε)→ b0(x, 0)(and similarly for b1, σ0) uniformly on compact sets, we may apply Thm. 20 to conclude that

supt∈[0,T ]

E|Xεt −Xt|2 → 0 as ε→ 0.

14


Thus, we have

|Eg(Xkη)− Eg(xk)|≤|Eg(Xε

kη)− Eg(xk)|+ |Eg(Xεkη)− Eg(Xkη)|

≤C(η2 + ρ(ε)) +(E|Xε

kη −Xkη|2)1/2

×(∫ 1

0E|∇2g(λXε

kη + (1− λ)Xkη)|2dλ)1/2

Using Thm. 19 and assumption that ∇2g ∈ G, the last expectation is finite and hence takingthe limit ε→ 0 yields our result.

By going for a lower order approximation, we of course have the following:

Corollary 10 Assume the same conditions as in Thm. 9, except that we replace (i) with

(i)’ f ≡ Efγ is continuously differentiable, and f ∈ G3w.

Define Xt : t ∈ [0, T ] as the stochastic process satisfying the SDE

dXt = −∇f(Xt)dt+√ηΣ(Xt)

1/2dWt X0 = x0, (4.5)

with Σ(x) = E(∇fγ(x)−∇f(x))(∇fγ(x)−∇f(x))T . Then, Xt : t ∈ [0, T ] is an order-1 weak approximation of the SGD, i.e. for each g ∈ G2, there exists a constant C > 0independent of η such that

maxk=0,...,N

|Eg(Xkη)− Eg(xk)| ≤ Cη.

Remark 11 In the above results, the most restrictive condition is probably the Lipschitzcondition on ∇fγ. Such Lipschitz conditions are important to ensure that the SMEs admitunique strong solutions and the SGA having uniformly bounded moments. Note that followingsimilar techniques in SDE analysis (e.g. Kloeden and Platen (2011)), these global conditionsmay be relaxed to their respective local versions if we assume in addition a uniform globallinear growth condition on ∇fγ. Finally, for applications, typical loss functions have inwardpointing gradients for all sufficiently large x, meaning that the SGD iterates will be uniformlybounded almost surely. Thus, we may simply modify the loss functions for large x (withoutaffecting the SGA iterates) to satisfy the conditions above.

Remark 12 The constant C does not depend on η, but as evidenced in the proof of thetheorem, it generally depends on g, T , d and the various Lipschitz constants. For the fairlygeneral situation we are consider, we do not derive tight estimates of these dependencies.

4.3. SME for stochastic gradient descent with momentum

Let us discuss the corresponding SME for a popular variant of the SGD called themomentumSGD (MSGD). The momentum SGD augments the usual SGD iterations with a “memory”term. In the usual form, we have the iterations

vk+1 = µvk − η∇fγk(xk)

xk+1 = xk + vk+1

15

Li, Tai and E

where µ ∈ (0, 1) (typically close to 1) is called the momentum parameter and η is thelearning rate. Let us consider a rescaled version of the above that is easier to analyze viacontinuous-time approximations. We redefine

η :=√η, vk := vk/

√η, µ := (1− µ)/

√η (4.6)

to obtain

vk+1 = vk − µηvk − η∇fγk(xk)

xk+1 = xk + ηvk+1.(4.7)

In view of the rescaling, the range of momentum parameters we consider becomes µ ∈(0, η−1/2), which we may replace by (0,∞) for simplicity.

Let us now derive the SME satisfied by the iterations (4.7). Observe that this is againa special case of (4.1) with x now replaced by (v, x) and

h(v, x, γ, η) = (−µv −∇fγ(x), v − ηµv − η∇fγ(x))

In view of Thm. 14 and the results in Sec. 4.2, in order to derive the SMEs we simply matchmoments up to order 3. As in Sec. 4.2, let us define the one step difference

∆(v, x) := (vv,x,01 − v, xv,x,01 − x). (4.8)

The following moment expansions are immediate.

Lemma 13 Let ∆(x, v) be defined as in (4.8). We have

(i) E∆(i)(v, x) = η(−µv(i) − ∂(i)f(x), v) + η2(0,−µv(i) − ∂(i)f(x)),

(ii) E∆(i)(v, x)∆(j)(v, x) =

η2

µ2v(i)v(j) + µv(i)∂(j)f(x) + µv(j)∂(i)f(x)

+Σ(x)(i,j) + ∂(i)∂(j)f(x) −µv(i)v(j) − v(i)∂(j)f(x)

−µv(i)v(j) − v(j)∂(i)f(x) v(i)v(j)

+O(η3),

(iii) E∏3j=1 |∆(ij)(v, x)| = O(η3),


Proof The proof follows from direct calculation of the moments.

Hence, proceeding exactly as in Sec. 4.2 and using Lem.4, 13, we see that we may set

b0(v, x) = (−µv −∇f(x), v)

b1(v, x) = −12

(µ[µv +∇f(x)]−∇2f(x)v, µv +∇f(x)

)σ0(v, x) =

(Σ(x)1/2 0

0 0

)in order to match the moments. By similar mollification and limiting arguments as inThm. 9, we arrive at the following approximation theorem, where we can see that the SMEfor MSGD takes the form of a Langevin equation.

16


Theorem 14 Assume the same conditions as in Thm. 9 and that the drift terms in thefollowing SDE are Lipschitz. Let µ > 0 be fixed and define Vt, Xt : t ∈ [0, T ] as thestochastic process satisfying the SDE

dVt = −[(µI + 12η[µ2I −∇2f(Xt)])Vt + (1 + 1

2ηµ)∇f(Xt)]dt+√ηΣ(Xt)

1/2dWt V0 = v0,

dXt = [(1− 12ηµ)Vt − 1

2η∇f(Xt)]dt X0 = x0, (4.9)

with Σ(x) as defined in Thm. 9. Then, (Vt, Xt) : t ∈ [0, T ] is an order-2 weak approxima-tion of the MSGD.

Moreover, if we relax the assumptions to Cor. 10, we have the order-1 weak approximation

dVt = −[µVt +∇f(Xt)]dt+√ηΣ(Xt)

1/2dWt V0 = v0,

dXt = Vtdt X0 = x0. (4.10)

Note that by inverting the scaling (4.6), the order-1 SME (4.10) is the formal equationderived in Li et al. (2015). As discussed in Remark 11, using standard techniques therestrictive global Lipschitz conditions can be relaxed to a local Lipschitz condition togetherwith a linear growth condition.

4.4. SME for a momentum variant: Nesterov accelerated gradient

It follows from the calculation above that we can also obtain the SME for the stochasticgradient version of the Nesterov accelerated gradient (NAG) method (Nesterov, 1983), whichwe refer to as SNAG. In the non-stochastic case, the NAG method has been analyzed usingthe ODE approach (Su et al., 2014). Therefore, the derivations in this section can be viewedas a stochastic parallel. The NAG method is sometimes used with stochastic gradients, andhence it is useful to analyze its properties in this setting and compare it to MSGD.

The unscaled NAG iterations are

vk+1 = µkvk − η∇fγk(xk + µkvk)

xk+1 = xk + vk+1

with v0 = 0, which differs from the momentum iterations as the gradient is now evaluatedat a “predicted” position xk + µkvk, instead of the original position xk. Moreover, themomentum parameter µk is now allowed to vary as k increases, and in fact, the usual choiceof

µk = k−1k+2 (4.11)

this has important links to stability and acceleration in the deterministic case (Nesterov,1983; Su et al., 2014). In particular, it achieves O(1/k2) convergence rate for general convexfunctions. On the other hand, a constant µk is suggested for strongly convex functions (Nes-terov, 2013). In the following, we shall first consider the case of constant momentum pa-rameter with µk ≡ µ, and then the choice (4.11) subsequently.

17

Li, Tai and E

Constant momentum. Using the same rescaling in (4.6), we have

vk+1 = vk − µηvk − η∇fγk(xk + η(1− µη)vk)

xk+1 = xk + ηvk+1.(4.12)

which is again (4.1) with

h(v, x, γ, η) = (−µv −∇fγ(x+ η(1− µη)v), v − ηµv − η∇fγ(x+ η(1− µη)v))

Hence, we have the following moment expansion.

Lemma 15 Let ∆(x, v) := (vv,x,01 − v, xv,x,01 − x). We have

(i) E∆(i)(v, x) = η(−µv(i) − ∂(i)f(x), v)+ η2(∂(i)∂(j)f(x)v(j),−µv(i) − ∂(i)f(x+ v)) +O(η3),

(ii) E∆(i)(v, x)∆(j)(v, x) =

η2

µ2v(i)v(j) + µv(i)∂(j)f(x+ v) + µv(j)∂(i)f(x+ v)

+Σ(x+ v)(i,j) + ∂(i)∂(j)f(x+ v) −µv(i)v(j) − v(i)∂(j)f(x+ v)

−µv(i)v(j) − v(j)∂(i)f(x+ v) v(i)v(j)

+O(η3),

(iii) E∏3j=1 |∆(ij)(v, x)| = O(η3),


Proof The proof follows from direct calculation of the moments and Taylor’s expansion.

Hence, we may match moments by setting

b0(v, x) = (−µv −∇f(x), v)

b1(v, x) = −12

(µ[µv +∇f(x)] +∇2f(x)v, µv +∇f(x)

)σ0(v, x) =

(Σ(x)

12 0

0 0

)from which we obtain the following approximation theorem for SNAG.

Theorem 16 Assume the same conditions as in Thm. 14. Define Vt, Xt : t ∈ [0, T ] asthe stochastic process satisfying the SDE

dVt = −[(µI + 12η[µ2I +∇2f(Xt)])Vt + (1 + 1

2ηµ)∇f(Xt)]dt+√ηΣ(Xt)

1/2dWt V0 = v0,

dXt = [(1− 12ηµ)Vt − 1

2η∇f(Xt)]dt X0 = x0, (4.13)

with Σ as defined in Thm. 14. Then, (Vt, Xt) : t ∈ [0, T ] is an order-2 weak approximationof SNAG. Moreover, the same order-1 weak approximation of MSGD in (4.10) holds for theSNAG.

The result above shows that for constant momentum parameters, the modified equations forMSGD and the SNAG are equivalent at leading order, but differ when we consider the secondorder modified equation. Let us now discuss the case where the momentum parameter isallowed to vary.

18


Varying momentum. Now let us take µ as in (4.11). Then, using the same rescalingarguments, we arrive at

vk+1 = vk − µkηvk − η∇fγk(xk + η(1− µkη)vk)

xk+1 = xk + ηvk+1.(4.14)

with µk = 3/(2η + kη). Now, in order to apply our theoretical results to deduce the SME,simply notice that we may introduce an auxiliary scalar variable

zk+1 = zk + η, z0 = 0.

Then, µk = 3/(2η + zk), and hence all terms are now not explicitly k-independent, thus wemay proceed formally as in the previous sections to arrive at the order-1 SME for SNAGwith varying momentum

dVt = −[3tVt +∇f(Xt)]dt+

√ηΣ(Xt)

1/2dWt V0 = 0,

dXt = Vtdt X0 = x0. (4.15)

This result is formal because the term 3/t does not satisfy our global Lipschitz conditions,unless we restrict our interval to some [t0, T ] with t0 > 0, in which case the above resultbecomes rigorous. Alternatively, some limiting arguments have to be used to establish well-posedness of the equation on [0, T ] individually. We shall omit these analyses in the currentpaper, and only consider (4.15) on some interval [t0, T ], where initial conditions are thenreplaced by (vt0 , xt0). As a point of comparison, (4.15) reduces to the ODE derived in Suet al. (2014) if Σ(x) ≡ 0 (i.e. the gradients are non-stochastic).

5. Applications of the SMEs to the analysis of SGA

In this section, we apply the SME framework developed to analyze the dynamics of the threestochastic gradient algorithm variants discussed above, namely SGD, MSGD and SNAG. Weshall focus on simple but non-trivial models where to a large extent, analytical computationsusing SME are tractable, giving us key insights into the algorithms that are otherwisedifficult to obtain without appealing to the continuous formalism presented in this paper.We consider primarily the following model:

Model: Let H ∈ Rd×d be a symmetric, positive definite matrix. Define the sample objec-tive

fγ(x) := 12(x− γ)TH(x− γ)− 1

2Tr(H)

γ ∼ N (0, I) (5.1)

which gives the total objective f(x) ≡ Efγ(x) = 12x

THx.

5.1. SME analysis of SGD

We first derive the SME associated with (5.1). For simplicity, we will only consider theorder-1 SME (4.5). A direct computation shows that Σ(x) = H2 and so the SME for SGDapplied to model (5.1) is

dXt = −HXtdt+√ηHdWt,

19

Li, Tai and E

This is a multi-dimensional Ornstein-Uhlenbeck (OU) process and admits the explicit solu-tion

Xt = e−tH(x0 +

√η

∫ t

0esHHdWs

).

Observe that for each t ≥ 0, the distribution of Xt is Gaussian. Using Itô’s isometry, wethen deduce the dynamics of the objective function

Ef(Xt) =12x

T0 He

−2tHx0 + 12η

∫ t

0Tr(H3e−2(t−s)H)ds

=12x

T0 He

−2tHx0 + 14η

n∑i=1

λ2i (H)(1− e−2tλi(H)). (5.2)

The first term decays linearly with asymptotic rate 2λd(H), and the second term is inducedby noise, and its asymptotic value is proportional to the learning rate η. This is the well-known two-phase behavior of SGD under constant learning rates: an initial descent phaseinduced by the deterministic gradient flow and an eventual fluctuation phase dominated bythe variance of the stochastic gradients. In this sense, the SME makes the same predic-tions, and in fact we can see that it approximates the SGD iterations well as η decreases(Fig. 5.1(a)), according to the rates we derived in Thm. 9 and Cor. 10.

10 210 110010 5

10 4

10 3

10 2

10 1

|f(X

T)f(x

T/)|

Order 1Slope=1

Order 1Slope=2

(a)

101 102 103

(H)

10 3

10 2

10 1

rate

SGDSlope = -1

(b)

Figure 5.1: SME prediction vs SGD dynamics. (a) SME as a weak approximation of theSGD. We compute the weak error with test function g equal to f (see Thm. 9). As predictedby our analysis, the order-2 SME (4.4) (order-1 SME (4.5)) should give a slope = 2 (1)decrease in error as η decreases (note that the x-axis is flipped). The SME solution iscomputed using an exact formula derived by the application of Itô isometry and the SGDexpectation is averaged over 1e6 runs. We took T = 2.0. We see that the predictions ofThm. 9 and Cor.10 hold. (b) Descent rate vs condition number. H is generated with differentcondition numbers, and the resulting descent rate of SGD is approximately ∝ κ(H)−1, aspredicted by the SME.

Moreover, notice that by the identification t = kη (k is the SGD iteration number),the SME analysis tells us that the asymptotic linear convergence rate (in k, i.e. rate ∼

20


− log[Ef(xk)]/k) in the descent phase of the SGD is 2λd(H)η. For numerical stability (evenin the non-stochastic case), we usually require η ∝ 1/λ1(H), thus the maximal descent rateis inversely proportional to the condition number κ(H) = λ1(H)/λd(H). We validate thisobservation by generating a collection of H’s with varying condition numbers and applyingSGD with η ∝ 1/λ1(H). In Fig 5.1(b), we plot the initial descent rates versus the conditionnumber of H and we observe that we indeed have rate ∝ κ(H)−1.

Alternate model. Now, we consider a slight variation of the model (5.1). The goal is showthat the dynamics of SGD (and the corresponding SME) is not always Gaussian-like andthus using the OU process to model the SGD is not always valid. Given the same positive-definite matrix H, we diagonalize it in the form H = QDQT where Q is an orthogonalmatrix and D is a diagonal matrix of eigenvalues. We then define the sample objective

fγ(x) := 12(QTx)

T[D + diag(γ)](QTx)

γ∼N (0, I) (5.3)

which gives the same total objective f(x) ≡ Efγ(x) = 12x

THx. However, we have a differentexpression for Σ(x)

Σ(x) = Qdiag(Qx)2QT ,

which gives the SME

dXt = −HXtdt+√ηQ| diag(QTXt)|QTdWt

in distribution= −HXtdt+

√ηQdiag(QTx)QTdWt.

We can rewrite the above as

dXt = −HXtdt+√η

d∑l=1

Q(l)XtdW(l),t,

where Q(l) = Qdiag(Q(l,·))QT and Q(l,·) denotes the lth row of Q. By observing that every

pair of H,Q(1), . . . , Q(d) commute, we have the explicit solution

Xt = e−12ηt+

√η∑dl=1Q

(l)W(l),te−Htx0.

which is a multi-dimensional Black-Scholes (Black and Scholes, 1973) type of stochasticprocess. In particular, the distribution is not Gaussian of any t > 0. Nevertheless, we maytake expectation to obtain

Ef(Xt) = 12eηtxT0 He

−2Htx0.

This immediately implies the following interesting behavior: if η < 2λd(H), then 2H − ηIis positive definite and so Ef(Xt) → 0 exponentially at constant, non-zero η; Otherwise,depending on initial condition x0, the objective may not converge to 0. In particular, ifη > 2λd(H) (which happens quite often if the condition number of H is large) and x0

21

Li, Tai and E

is in general position, then we have asymptotic exponential divergence. This is a variance-induced divergence typically observed in Black-Scholes and geometric Brownian motion typeof stochastic processes. The term “variance-induced” is important here since the determinis-tic part of the evolution equation is mean-reverting and in fact is identical to the stable OUprocess studied earlier. In Fig. 5.2(a), (b), we show the correspondence of the SME findingswith the actual dynamics of the SGD iterations. In particular, we see in Fig. 5.2(c) that forsmall η, we have exponential convergence of the SGD at constant learning rates, whereasfor η > 2λd(H), the SGD iterates start to oscillate wildly and its mean value is dominatedby few large values and diverges approximately at the rate predicted by the SME. Note thatthis divergence is predicted to be at a finite η, and from the theory developed so far wecannot conclude that the SME approximation always holds accurately at this regime (butthe approximation is guaranteed for η sufficiently small). Nevertheless, we observe at leastin this model that the variance-induced divergence of the SGD happens as predicted by theSME.

5.2. SME analysis of MSGD

Let us now use the SME to analyze MSGD applied to model (5.1). We have shown earlierthat Σ(x) = H. Thus, according to Thm. 14, the order-1 SME for MSGD is

dVt = −[µVt +HXt]dt+√ηHdWt,

dXt = Vtdt,(5.4)

with X0 = x0 and V0 = 0. If we set Yt := (Vt, Xt) ∈ R2d, Ut a 2d-dimensional Brownianmotion with first d coordinates equal to Wt, and define block matrices

A :=

(µI H−I 0

), B :=

(H 00 0

), (5.5)

we can then write (5.4) as

dYt = −AYt +√ηBdUt, Y0 = (0, x0),

which admits the explicit solution

Yt = e−At(Y0 +

√η

∫ t

0eAsBdUs.

).

By Itô’s isometry, we have

Ef(Xt) =12

[|diag(0, H)

1/2e−AtY0|2 + η

∫ t

0| diag(0, H)

1/2e−(t−s)AB|2ds], (5.6)

One can see immediately that a similar two-phase behavior is present, but the property ofthe descent phase now hinges on the spectral properties of the matrix A (instead of H).Before proceeding, we first observe that the eigenvalues of A can be written as

λ(A) := Λ+,Λ−, Λ±,i = 12

(µ±

√µ2 − 4λi(H)

), i = 1, 2, . . . , d. (5.7)

In particular, <λi(A) > 0 for all i as long as µ > 0. We also need the following simple resultconcerning the decay of the norm of e−tA if all eigenvalues of A have positive real part.

22


10 210 1100

10 5

10 4

10 3

10 2

10 1|

f(XT)

f(xT/

)|Order 1Slope=1

Order 1Slope=2

(a)

0 5 10 15 20t (k )

10 3

10 1

101

103

105

f

SME ( = 0.25)SGD ( = 0.25)

SME ( = 0.10)SGD ( = 0.10)

SME ( = 0.01)SGD ( = 0.01)

(b)

0 5 10 15 20t (k )

10 2

10 1

100

f

SME ( = 0.001)SGD ( = 0.001)

SME ( = 0.100)SGD ( = 0.100)

(c)

Figure 5.2: SME prediction vs SGD dynamics for the model variant (5.3). (a) Order ofconvergence of the SME to the SGD. We use the same setup as in Fig. 5.1(a). Observethat our analysis again predicts the correct rate of weak error decay as η decreases. (b)SGD paths vs order-1 SME prediction. Solid lines are SME exact solutions and dottedlines are means of SGD paths over 500 runs, and the 25-75 percentiles are shaded. Weobserve convergence of Ef at constant η, and that the sample mean is dominated by fewlarge values, as observed by the deviation of the percentiles from the mean. (b) Variance-induced explosion. As predicted by the SME analysis, if η > 2λd(H) (Here, λd(H) = 0.01),variance-induced instability sets in.

Lemma 17 Let A be a real square matrix such that all eigenvalues have positive real part.Then,

(i) For each ε > 0, there exists a constant Cε > 0 independent of t but depends on ε, suchthat

|e−tA| ≤ Cεe−t(mini <λi(A)−ε)

23

Li, Tai and E

(ii) If in addition A is diagonalizable, then there exists a constant C > 0 independent of tsuch that

|e−tA| ≤ Ce−tmini <λi(A)

Proof See Appendix E.

With the above results, we can now characterize the decay of the objective under mo-mentum SGD. From expression (5.7), we see that as long as µ2 6= 4λi for any i = 1, . . . , d,A has 2d distinct eigenvalues and is hence diagonalizable. We shall hereafter assume thatµ is in general position such that this is the case. Using Lem. 17 and expression (5.6), wearrive at the estimate

Ef(Xt) ≤12C

2|x0|2λ1(H)e−2tmini <λi(A) + 12ηC2λ1(H)3

mini <λi(A)(1− e−2tmini <λi(A)). (5.8)

This result tells us that the convergence rate of the descent phase is now controlled by theminimum real part of the eigenvalues of A, instead of the minimum eigenvalue of H. Inparticular, we achieve the best linear convergence rate by maximizing the smallest real partof the eigenvalues of A. This leads to the following optimization problem for the optimalconvergence rate:

supµ∈(0,∞)

mini=1,...,d

mins∈+1,−1

<[µ+ s

√µ2 − 4λi(H)

]Since H is positive definite, the supremum is attained at µ∗ = 2

√λd(H) with the rate also

equal to 2√λd(H). However, note that if we take µ = µ∗ exactly, one can check that A is

no longer diagonalizable and by Lem. 17, the rate is slightly diminished, thus technically wecan take µ as close to µ∗ as we like (i.e. the rate is as close to 2

√λd(H) as we like), but exact

equality is not technically deducible from current results. In Fig. 5.3(c), we demonstratethe optimal choice of µ and its effect on the convergence rate. Moreover, observe that as µincreases, the number of complex eigenvalues start to decrease, and the magnitudes of theimaginary parts of the complex eigenvalues also decrease. This signifies that increasing µcauses oscillations to decreases in magnitude and frequency. This is again corroborated bynumerical experiments (Fig. 5.3(c)).

Another interesting observation is that by the identification t = ηk, the descent rate(in terms of k) is 2

√λd(H)η. As before, if we choose the maximal stable learning rate we

would have η ∝ 1/λ1(H) (η = η2 according to the scaling introduced in (4.6)). Thus, forthe MSGD iterates we have its descent rate ∝ κ(H)−1/2, which is a huge improvement overSGD, whose rate is ∝ κ(H)−1, especially for badly conditioned matrices where κ(H) 1.In Fig. 5.3(d), we plot the MSGD initial descent rates for varying condition numbers of H.Again, we observe that the SME analysis gives the correct characterization of the precisedynamics and recovers the square-root relationship with condition number.

Finally, let us discuss the effect of adding momentum to the asymptotic fluctuations dueto noisy gradients. Note that it is not correct to conclude, using Eq. (5.8), that taking µ ≈ µ∗also gives the lowest fluctuations. This is because the constant C depends on µ as well, as isevidenced in the proof of Lem. 17, which shows that C depends on the conditioning of the

24


10 210 1100

10 4

10 3

10 2

10 1|

f(XT)

f(xT/

)|Order 1Slope=1

Order 1Slope=2

(a)

0 50 100 150t (k )

10 1

100

101

102

103

104

f

SME ( = 0.50)MSGD ( = 0.50)

SME ( = 0.10)MSGD ( = 0.10)

SME ( = 0.01)MSGD ( = 0.01)

(b)

0 50 100 150t (k )

10 1

101

103

f

SME ( = 0.48)MSGD ( = 0.48)

SME ( = 0.95)MSGD ( = 0.95)

SME ( = 1.91)MSGD ( = 1.91)

(c)

101 102 103

(H)

10 1rate

SGDSlope = 1

2

(d)

Figure 5.3: SME prediction vs MSGD dynamics. (a) and (b) SME vs MSGD dynamicsat µ = 0.1 for different learning rates η. As before, the SME prediction gets better as ηdecreases according to the predicted order. Notice also the presence of oscillations, dueto the complex eigenvalues of A. (c) Optimal descent rate of the SGD is achieved by theSME prediction µ = µ∗, which is 0.95 in this case. Notice that exactly as predicted by theSME, increasing µ decreases the oscillation frequency and magnitude (due to having fewercomplex eigenvalues and smaller imaginary parts), as well as the asymptotic fluctuations(due to formula (5.9)). (d) Descent rate vs condition number. H is generated with differentcondition numbers, and the descent rate of MSGD is ∝ κ(H)−1/2, as predicted by the SME,which for badly conditioned H gives a large improvement.

eigenvector matrix of A. To proceed, we do not use the bounds (5.8). Instead, we explicitlydiagonalize A and after some computations, we arrive at the exact expression for Ef(Xt)

Ef(Xt) =12 | diag(0, H)

1/2e−AtY0|2 (5.9)

+ 12η

d∑i=1

λ3i

|µ2−4λi|

[1−e−2t<Λ+,i

2<Λ+,i+ 1−e−2t<Λ−,i

2<Λ−,i− 2R(t, µ, λi(H))

](5.10)

25

Li, Tai and E

where

R(t, µ, λ) =

1−e−tµ

µ µ ≥ 2√λ

µ+√

4λ−µ2e−µt sin(t√

4λ−µ2)−µe−µt cos(t√

4λ−µ2)4λ µ < 2

√λ. (5.11)

In particular, the asymptotic loss value induced by noise is

limt→∞

Ef(Xt) =12η

d∑i=1

λi(H)3

|µ2−4λi(H)|

[1

2<Λ+,i+ 1

2<Λ−,i− 2 min

µ

4λi(H) ,1µ

](5.12)

Observe that this function (in fact, each term in the sum) is monotone-decreasing in µ, andfor µ 1 it scales like µ−1, and for µ 1 it scales like µ−3. Thus, increasing the momentumparameter decreases the asymptotic noise in the iterates, i.e. decreases the asymptotic valueof Ef , which should be 0 in the absence of noise. This again agrees with the actual MSGDdynamics (Fig. 5.3(b)). Consequently, to obtain “optimal” tradeoff between descent andnoise, we would like a momentum schedule that equals µ∗ in the descent phase and increasesto infinity (in the original scaling this corresponds to µ→ 0) as we approach the optimum.Finding this optimal schedule can be cast as an optimal control problem (Li et al., 2017),and a rigorous investigation of these approaches will be considered in subsequent work.

5.3. SME analysis of SNAG

Finally, let us see what we can say, using the SME approach, about the difference betweenMSGD and SNAG in this stochastic setting. Let us first consider the case of constantmomentum. From Thm. 16, we know that the order-1 SMEs are identical, so we mustconsider higher order SMEs. A straightforward computation yields the following order-2SMEs for MSGD and SNAG (again we let Yt = (Vt, Xt))

MSGD: dYt = −A1Yt +√ηBdUt, Y0 = (0, x0),

SNAG: dYt = −A2Yt +√ηBdUt, Y0 = (0, x0),

where Ai = A+ 12ηEi with A,B as defined in (5.5) and

E1 :=

(µ2I −H µHµI H

), E2 :=

(µ2I +H µHµI H

).

From the analysis in Sec. 4.3, the descent rate is dominated by the minimal real parts of theeigenvalues of Ai, which are respectively

λ(A1) =

14

(µ(ηµ+ 2)±

√µ2(ηµ+ 2)2 + 4η2λi(H)2 − 8λi(H)(ηµ+ 2)

), i = 1, . . . , d

λ(A2) =

14

(µ(ηµ+ 2) + 2ηλi(H)±

√ηµ+ 2

√µ2(ηµ+ 2) + 4λi(H)(ηµ− 2)

), i = 1, . . . , d

We observe that for small µ (i.e. µ ≈ 1 in the usual MSGD scaling), the terms in square-roots are negative and hence for the same small µ, the convergence rate of SNAG (in termsof initial descent rate) is 1

2ηλd(H) larger than that of MSGD. This says in particular that

26


for H with larger λd(H), the acceleration is more pronounced. Moreover, recall that theasymptotic fluctuations is given by

η limt→∞

∫ t

0|diag(0, H)

1/2e−(t−s)(A+12ηEi)B|2ds.

Without performing tedious computations, we can see that since E2−E1 is positive definite,the exponential for the SNAG case decays more rapidly, and hence the eventual fluctuationsare expected to be lower. These observations from the SME are again consistent with thebehavior of their SGA counter-parts, as shown in Fig. 5.4(a). On the other hand, if wepick µ for each case by separately maximizing the smallest real part of the eigenvalues(as in Sec. 4.3), we obtain similar convergence rates up to η2. In other words, if we tuneµ well, there would be no difference between MSGD and SNAG in terms of descent rate(Fig. 5.4(b)).

0 50 100 150t (k )

100

102

104

106

108

1010

f

d(H) = 0.50MSGDSNAG

0 50 100 150t (k )

d(H) = 0.10MSGDSNAG

0 50 100 150t (k )

d(H) = 0.01MSGDSNAG

(a)

0 50 100 150t (k )

100

102

104

106

108

1010

f

d(H) = 0.50MSGDSNAG

0 50 100 150t (k )

d(H) = 0.10MSGDSNAG

0 50 100 150t (k )

d(H) = 0.01MSGDSNAG

(b)

Figure 5.4: MSGD vs SNAG (with constant momentum) dynamics for η = 0.1 and differentλd(H). (a) Dynamics at fixed µ = 0.2. We observe that as predicted by the SME analysis,SNAG enjoys a faster linear convergence rate in the descent phase, as well as lower asymp-totic fluctuations. The acceleration is indeed more pronounced for larger λd(H). (b) When,µ for each case is chosen optimally for the descent (by maximizing the minimal real part ofthe eigenvalues of A1, A2 respectively), the dynamics becomes similar.

Now, let us discuss the varying momentum case. According to (4.15), for some smallt0 > 0 we have the order-1 SME for t ∈ [t0, T ]

dYt = −AtYt +√ηBdUt, Yt0 = (vt0 , xt0) At :=

(3t I H−I 0

),

27

Li, Tai and E

and B is defined as in (5.5). This admits the explicit solution

Yt = e−(t−t0)At

(Yt0 +

√η

∫ t

t0

esAsBdUs.

), At :=

(3 log(t/t0)

t−t0 I H

−I 0

).

The eigenvalues of At are

λ(At) =

12

(3 log(t/t0)

t−t0 ±√

9[ log(t/t0)t−t0 ]

2− 4λi(H)

), i = 1, . . . , d

.

Since there is no lower-bound on the minimal real part, the descent rate before the onsetof fluctuations is sub-linear. This is expected because the O(1/t) momentum schedule issuited for non-strongly-convex functions, whereas constant momentum is more appropriatefor strong-convex functions (Nesterov, 2013). Furthermore, we observe that since the realparts of all eigenvalues of At converge to 0 as t→∞, according to the analysis in Sec. 4.3,the asymptotic fluctuations due to noise should be large. Fig. 5.5 confirms these observationsand further suggests that in the case of stochastic gradient methods, more careful momentumschedules must be derived in order to balance descent and fluctuations, e.g. using the optimalcontrol framework presented in Li et al. (2017).

0 100 200 300 400 500t (k )

100

102

104

106

108

1010

f

d(H) = 0.50MSGDSNAG

0 100 200 300 400 500t (k )

d(H) = 0.10MSGDSNAG

0 100 200 300 400 500t (k )

d(H) = 0.01MSGDSNAG

Figure 5.5: MSGD vs SNAG (with dynamic momentum according to Nesterov’schoice (4.11)) dynamics for η = 0.1 and different λd(H). We see that the convergenceis indeed sub-linear, and moreover, the asymptotic fluctuations are large compared withMSGD, in which case µ here is picked to achieve optimal descent rate.

6. Conclusion

In this paper, we developed the general mathematical foundation of the stochastic modifiedequations framework for analyzing stochastic gradient algorithms. In particular, we demon-strate that this approach is (1) rigorous, (2) flexible and (3) useful. Indeed, the techniqueof weak approximations provides a precise mathematical framework for analyzing the rela-tionship between stochastic gradient algorithms and stochastic differential equations. Thisshould be contrasted with strong approximations in the numerical analysis of SDEs, whereapproximations are required to hold path-wise, say in the mean-square sense (Kloeden andPlaten, 2011). The weak formulation greatly increases the flexibility of modelling different

28


type of stochastic gradient algorithms, as we have demonstrated in Sec. 4. In fact, the mainresult relating discrete-time algorithms and continuous-time SDEs (Thm. 3) is proved in afairly general setting that allows one to derive a variety of SMEs for different variations ofthe SGAs (Sec. 4). Finally, in Sec. 5, we demonstrated the usefulness of the SME approachthrough explicit calculations. This is enabled by the precise approximation nature of theSMEs and the application of stochastic calculus tools. In particular, we uncovered inter-esting behaviors of SGAs when applied to a simple yet non-trivial setting, including thetradeoff of descent and fluctuations, the relationship with condition numbers and the subtledifferences of MSGD and SNAG in the stochastic setting. In the subsequent work in theseries, we will focus on applications, where we extend the SME formalism to study adaptivealgorithms and related topics.

Acknowledgements

QL is supposed by the Agency for Science, Technology and Research (A*STAR), Singapore.WE is supported by the Office of Naval Research, USA, ONR N00014-13-1-0338.

29

Li, Tai and E

Appendix A. General existence, uniqueness and moment estimates forSDEs

In this section, we establish general existence, uniqueness and moment estimates for thestochastic differential equations that we encounter in this paper. The results here will beused throughout the subsequent proofs. We note that although similar well-posedness resultsare well-known, here we require a slightly more general version (where the drift and diffusionterms are themselves random functions) in order to deal with the analysis in Appendix B.Moreover, we need uniform estimates with respect to parameters (η, ε), which warrants thefollowing standard but necessary analysis.

Let T > 0 and Q be a subset of a Euclidean space. For (x, t, q) ∈ Rd × [0, T ] × Q, letB(x, t, q) be a d-dimensional random vector and S(x, t, q) be a d × d-dimensional randommatrix. Throughout this section we assume:

Assumption A.1 The random functions B,S satisfy the following:

(i) B,S are Wt-adapted and continuous in (x, t) ∈ Rd × [0, T ] almost surely

(ii) B,S satisfy a uniform linear growth condition, i.e. there exists a non-random constantL > 0 such that

|B(x, t, q)|2 + |S(x, t, q)|2 ≤ L2(1 + |x|2) a.s.

for all x, y ∈ Rd, t ∈ [s, T ], q ∈ Q.

(iii) B,S satisfy a uniform Lipschitz condition in x, i.e.

|B(x, t, q)−B(y, t, q)|+ |S(x, t, q)− S(y, t, q)| ≤ L|x− y| a.s.

for all t ∈ [s, T ], q ∈ Q.

Theorem 18 Let s ∈ [0, T ) and for each q ∈ Q, let φqt : t ∈ [s, T ] be a Rd-valued,Wt-adapted random process that is continuous in t ∈ [s, T ] almost surely, with

supq∈Q

E supt∈[s,T ]

|φqt |2 <∞. (A.1)

Then, for each q ∈ Q the stochastic differential equation

ξqt = φqt +

∫ t

sB(ξqv , v, q)dv +

∫ t

sS(ξqv , v, q)dWv (A.2)

admits a unique solution ξqt : t ∈ [s, T ] which is continuous for t ∈ [s, T ] a.s. and satisfies

supq∈Q

E supt∈[s,T ]

|ξqt |2 ≤ C

(1 + sup

q∈QE supt∈[s,T ]

|φqt |2)

(A.3)

for some constant C > 0 that depends only on L, T .

30


Proof For each q ∈ Q, let us define the recursion

ξq,0t = φqt

ξq,m+1t = φqt +

∫ t

sB(ξq,mv , v, η)dv +

∫ t

sS(ξq,mv , v, η)dWv, m ≥ 0.

Note that Assumption A.1 implies each ξq,mt is well-defined. Now, let m ≥ 1. By Itô’sisometry, we have

|ξq,m+1t − ξq,mt |2 ≤2

∣∣∣∣∫ t

sB(ξq,mv , v, q)−B(ξq,m−1

v , v, q)dv

∣∣∣∣2 (A.4)

+ 2

∣∣∣∣∫ t

sS(ξq,mv , v, q)− S(ξq,m−1

v , v, q)dWv

∣∣∣∣2 (A.5)

≤2T

∫ t

s|B(ξq,mv , v, q)−B(ξq,m−1

v , v, q)|2dv (A.6)

+ 2

∫ t

s|S(ξq,mv , v, q)− S(ξq,m−1

v , v, q)|2dv. (A.7)

Thus, applying the Lipschitz assumption A.1 (iii) and taking expectations, we get

E|ξq,m+1t − ξq,mt |2 ≤ 2L2(1 + T )

∫ t

sE|ξq,mv − ξq,m−1

v |2dv. (A.8)

Now, for m = 0, Assumption A.1 (ii) together with (A.1) gives

E|ξq,1t − ξq,0t |2 ≤ C

∫ t

s

(1 + sup

q∈QE|φqv|2

)dv ≤ C ′(t− s). (A.9)

Combining (A.8) and (A.9), we have

E|ξq,m+1t − ξq,mt |2 ≤

[C(t−s)]m+1

(m+1)! , m ≥ 0 (A.10)

for some C > 0 that only depends on T , L and Cφ := supq∈Q E supt∈[s,T ] |φqt |2. Moreover,

Eq. (A.4) implies

E supt∈[s,T ]

|ξq,m+1t − ξq,mt |2 ≤2L2T

∫ T

sE|ξq,mt − ξq,m−1

t |2dt

+ 2E supt∈[s,T ]

∣∣∣∣∫ t

sS(ξq,mv , v, q)− S(ξq,m−1

v , v, q)dWv

∣∣∣∣2.Estimate (A.10) implies the last stochastic integral is a martingale, and hence using Doob’smaximal inequality and Itô’s isometry, we have

E supt∈[s,T ]

|ξq,m+1t − ξq,mt |2 ≤2L2T

∫ T


t |2dt+ 8L2

∫ T


t |2dt

≤2L2(T + 4)CmTm+1

(m+1)! = 2L2(T + 4)CmTm+1

(m+1)!

31

Li, Tai and E

Applying Markov’s inequality,

∑m≥0

P

[supt∈[s,T ]

|ξq,m+1t − ξq,mt | > 2−m

]≤ 2L2(T + 4)

∑m≥0

22mCmTm+1

(m+1)! <∞.

Thus, by the Borel-Cantelli lemma,

P

[supt∈[s,T ]

|ξq,m+1t − ξq,mt | > 2−m infinitely often

]= 0,

which immediately implies

ξq,kt = ξq,0t +k−1∑m=0

(ξq,m+1t − ξq,mt )→ ξqt a.s.

uniformly in t ∈ [s, T ], for some limiting process ξqt which is necessarily continuous almostsurely and Wt-adapted. Moreover, we also have convergence in L2(Ω) uniformly in t. Tosee this, for each k > l we observe that

supt∈[s,T ]

(E|ξkt − ξlt|2)1/2 ≤ sup

t∈[s,T ]

k−1∑m=l

(E|ξq,m+1t − ξq,mt |2)

1/2

≤∞∑m=l

√2L2(T + 4)C

mTm+1

(m+1)!

l→∞−→ 0.

And hence ξq,kt converges uniformly in L2(Ω) to ξqt as k → ∞ (the limit is the same as thea.s. limit since a sub-sequence of it must converge a.s.). This immediately implies via theLipschitz condition and Itô’s isometry that

E∣∣∣∣∫ T

sB(ξq,kt , t, η)−B(ξqt , t, η)dt

∣∣∣∣2 ≤ T 2L2 supt∈[s,T ]

E∣∣∣ξq,kt − ξqt ∣∣∣2 → 0,

E∣∣∣∣∫ T

sS(ξq,kt , t, η)− S(ξqt , t, η)dWt

∣∣∣∣2 ≤ TL2 supt∈[s,T ]

∣∣∣ξq,kt − ξqt ∣∣∣2 → 0.

Thus, ξqt satisfies (A.2).We now show the estimate (A.3). From Eq. (A.2), we have by Itô’s isometry,

E|ξqt |2 ≤3E|φqt |2 + 3E∣∣∣∣∫ T

sB(ξqv , v, q)dv

∣∣∣∣2 + 3E∣∣∣∣∫ t

sS(ξqv , v, q)dWv

∣∣∣∣2≤3Cφ + 3T 2L2

∫ T

sE(1 + |ξqv |2)dv + 3TL2

∫ T

sE(1 + |ξqv |2)dv

Thus, by Gronwall’s lemma, we have

E|ξqt |2 ≤ C(1 + Cφ) (A.11)

32


for some C > 0 depending only on T, L. Consequently, we have

E supt∈[s,T ]

|ξqt |2 ≤3Cφ + 3T 2L2E∫ T

s1 + |ξqs |2dt (A.12)

+ 3E supt∈[s,T ]

∣∣∣∣∫ t

sS(ξqt , t, q)dWt

∣∣∣∣2 . (A.13)

Assumption A.1 (ii) and (A.11) implies the last stochastic integral is a martingale, and soby Doob’s maximal inequality,(

E supt∈[s,T ]

∣∣∣∣∫ t

sS(ξt, t, η)dWt

∣∣∣∣)2

≤∫ T

s4L2(1 + E|ξt|2)dt. (A.14)

Combining (A.12) and (A.14), we arrive at (A.3).Finally, we show uniqueness. Suppose that ξt, ξ′t are two solutions to (A.2). The same

calculation as before shows that

E|ξt − ξ′t|2 ≤ 2L2T (1 + T )

∫ t

sE|ξv − ξ′v|2dv.

and Gronwall’s lemma implies

E|ξt − ξ′t|2 ≤ e2L2T (1+T )E|ξs − ξ′s|2 = 0.

Theorem 19 Let us assume the same conditions as in Thm. 18 and for each q ∈ Q, letξqt be the unique solution of (A.2). Let m ≥ 1 and suppose supq∈Q E supt∈[s,T ] |φ

qt |2m < ∞.

Then, there exists a constant C > 0 depending only on L, T,m such that

supq∈Q

E supt∈[s,T ]

|ξqt |2m ≤ C

(1 + sup


|φqt |2m).

Proof We have

|ξqt |2m ≤32m−1|φqt |2m + (3(t− s))2m−1∫ t

s|B(ξqv , v, q)|2mdv

+ 32m−1

∣∣∣∣∫ t

sS(ξqv , v, q)dWv

∣∣∣∣2mTaking expectations, using Itô’s isometry (inequality version) and Gronwall’s inequality, weobtain

E|ξqt |2m ≤32m−1 supq∈Q

E supt∈[s,T ]

|φqt |2m + (3(t− s))2m−1L2m

∫ t

s(1 + E|ξqv |2m)dv

+ 32m−1(m(2m− 1))m(t− s)m−1L2m

∫ t

s(1 + E|ξqv |2m)dv

≤C

(1 + sup


|φqt |2m)<∞,

33

Li, Tai and E

with C depending only on L, T,m. Next,

E supt∈[s,T ]

|ξqt |2m ≤32m−1 supq∈Q

E supt∈[s,T ]

|φqt |2m

+ 3T 2m−1E∫ T

s|B(ξqv , v, q)|2mdv

+ 32m−1E supt∈[s,T ]

∣∣∣∣∫ t

sS(ξqv , v, q)dWv

∣∣∣∣2mNow, in the last term, the stochastic integral is a local martingale and so its absolute valueis a submartingale, and hence the last term is bounded by

E supt∈[s,T ]

∣∣∣∣∫ t

sS(ξqv , v, q)dWv

∣∣∣∣2m ≤ ∣∣∣∣∫ T

sS(ξqv , v, q)dWv

∣∣∣∣2m≤C

∫ T

s|S(ξqv , v, q)|2mdv.

Thus, using Thm. 19 and the linear growth condition, we conclude that

supq∈Q

E supt∈[s,T ]

|ξqt |2m ≤ C

(1 + sup


|φqt |2m)

Finally, we examine some limiting behavior of solutions ξqt as q → q∗ for some q∗ ∈ Q.

Theorem 20 Let us assume the same conditions as in Thm. 18 and let q∗ ∈ Q be fixed.Suppose further that the following holds for any t ∈ [s, T ], R > 0 and ε > 0:

(i) limq→q∗ P[sup|x|≤R |B(x, t, q)−B(x, t, q∗)| > ε

]= 0

(ii) limq→q∗ P[sup|x|≤R |S(x, t, q)− S(x, t, q∗)| > ε

]= 0

(iii) limq→q∗ supt∈[s,T ] E|φqt − φ

q∗

t |2 = 0

Then, the solutions ξqt of (A.2) satisfy

limq→q∗

supt∈[s,T ]

E|ξqt − ξq∗

t |2 = 0,

i.e. ξqt → ξq∗

t in L2(Ω) uniformly in t ∈ [s, T ].

Proof We have

ξqt − ξq∗

t =ζqt +

∫ t

sB(ξqv , v, q)−B(ξq

∗v , v, q)dv

+

∫ t

sS(ξqv , v, q)− S(ξq

∗v , v, q)dWv,

34


where

ζqt :=φqt − φq∗

t +

∫ t

sB(ξq

∗v , v, q)−B(ξq

∗v , v, q

∗)dv

+

∫ t

sS(ξq

∗v , v, q)− S(ξq

∗v , v, q

∗)dWv.

Using the Lipschitz conditions,

E|ξqt − ξq∗

t |2 ≤ 3E|ζqt |2 + 6L2

∫ t

sE|ξqv − ξq

∗v |2dv,

which by Gronwall’s lemma implies

supt∈[s,T ]

E|ξqt − ξq∗

t |2 ≤ 3e6L2T supt∈[s,T ]

E|ζqt |2.

Thus, it remains to show that supt∈[s,T ] E|ζqt |2 → 0 as h→ 0. Now,

supt∈[s,T ]

E|ζqt |2 ≤3 supt∈[s,T ]

E|φqt − φq∗

t |2 + 3T

∫ T

sE|B(ξq

∗v , v, q)−B(ξq

∗v , v, q

∗)|2dv

+ 3

∫ T

sE|S(ξq

∗v , v, q)− S(ξq

∗v , v, q

∗)|2dv.

For each v ∈ [s, T ], the assumption (i) together with the a.s. continuity of B impliesB(ξq

∗v , v, q) → B(ξq

∗v , v, q∗) in probability. Moreover, by Assumption A.1 (ii) the last in-

tegrand is bounded by 2L2(1 + supv∈[s,T ] |ξq∗v |2), which is integrable. By the dominated con-

vergence theorem, the integral vanishes in the limit h→ 0. A similar calculation shows thelast integral also vanishes in the same limit. Together with (iii), we arrive at our assertion.

Appendix B. Derivatives with respect to initial condition

Let us denote by Xx,s,qt : t ≥ 0 the stochastic process defined by the SDE

dXx,s,qt = b(Xx,s,q

t , q)dt+ σ(Xx,s,qt , q)dWt, t ∈ [s, T ],

Xx,s,qs = x. (B.1)

As in the previous section, q ∈ Q where Q is a subset of a Euclidean space. Throughoutthis section, we assume the following:

Assumption B.1 The (non-random) functions b, σ satisfy

1. Uniform linear growth condition

|b(x, q)|2 + |σ(x, q)|2 ≤ L2(1 + |x|2)

for all x ∈ Rd, q ∈ Q.

35

Li, Tai and E

2. Uniform Lipschitz condition

|b(x, q)− b(y, q)|+ |σ(x, q)− σ(y, q)| ≤ L|x− y|

for all x, y ∈ Rd, q ∈ Q.

With the above assumptions, by Thm. 18 the SDE (B.1) admits a unique solution. Thefocus of this section is to derive the SDEs that characterize the derivatives of Xx,s,q

t withrespect to x, the initial condition. In doing so, we will make use the results proved in Sec. A.

Definition 21 Let Ψ : Rd → R and ψ : Rd → Rd be random functions and suppose for eachi = 1, . . . , d,

limh→0

E∣∣ 1h [Ψ(x(1), . . . , x(i−1), x(i) + h, x(i+1), . . . , x(d))− ψi(x(1), . . . , x(d))]− ψ(i)(x)

∣∣2 = 0.

Then, we call ψ the derivative (in the L2(Ω) sense) of Ψ and write ∂(i)Ψ = ψ(i), or ∇Ψ = ψ.For multidimensional Ψ, we similarly define the derivative element-wise. Note that thederivative is almost surely unique, if it exists.

Lemma 22 Let s ∈ [0, T ), q ∈ Q and suppose that b and σ are continuously differentiablewith respect to x. Then, ∇Xx,s,q

t exists and if we write ξx,s,q(i,j),t := ∂(j)Xx,s,q(i),t , then it satisfies

the linear random-coefficient stochastic differential equation

ξx,s,q(i,j),t = δ(i,j) +

∫ t

sξx,s,q(k,j),v∂(k)b(X

x,s,qv , v, q)(i)dv +

∫ t

sξx,s,q(k,j),v∂(k)σ(Xx,s,q

v , v, q)(i,l)dW(l),v,

(B.2)

where δ is the usual Kronecker delta. Moreover, we have

supq∈Q

E supt∈[s,T ]

|ξx,s,qt |2m <∞

for all m ≥ 1.

Proof Let j be fixed and hj be a d-dimensional vector of 0’s except the jth coordinatewhere it is equal hj(j) = h 6= 0. Then, we have

1h(Xx+hj ,s,q

(i),t −Xx,s,q(i),t ) =δ(i,j) + 1

h

∫ t

sb(Xx+hj ,s,q

v , q)(i) − b(Xx,s,qv , q)(i)dv

+ 1h

∫ t

sσ(Xx+hj ,s,q

v , q)(i,l) − σ(Xx,s,qv , q)(i,l)dW(l),v.

But,

1h

∫ t

sb(Xx+hj ,(i),s

v , q)(i) − b(Xx,s,qv , q)(i)dv

=

∫ 1

0

∫ t

s

1h(Xx+hj ,s,q

(k),v −Xx,s,q(k),v)∂(k)b(λX

x+hj ,s,qv + (1− λ)Xx,s,q

v )(i)dvdλ,

36


and similarly,

1h

∫ t

sσ(Xx+hj ,s,q

v , q)(i,l) − σ(Xx,s,qv , q)(i,l)dW(l),v

=

∫ 1

0

∫ t

s

1h(Xx+hj ,s,q

(k),v −Xx,s,q(k),v)∂(k)σ(λXx+hj ,s,q

v + (1− λ)Xx,s,qv )(i,l)dW(l),vdλ.

Therefore, ξx,s,q,ht := 1h(Xx+hj ,s,q

t − Xx,s,qt ) satisfies (A.2) with Q × [0, 1] in place of Q,

φq,ht,(i,j) = δ(i,j) and

B(z, t, q, h)(i) = z(k)

∫ 1

0

∫ t

s∂(k)b(λX

x+hj ,s,qv + (1− λ)Xx+hj ,s,q

v , q)(i)dvdλ,

S(z, t, q, h)(i) = z(k)

∫ 1

0

∫ t

s∂(k)σ(λXx+hj ,s,q

v + (1− λ)Xx+hj ,s,qv , q)(i,l)dW(l),vdλ,

if h > 0. If h < 0 we simply consider −h on the left hand side instead and the proof isidentical. Furthermore, the uniform Lipschitz conditions on b, σ implies bounded derivativesand so we may apply Thm. 18 to conclude that there is a process ξ0,q

t satisfying (A.2) withh = 0, i.e. satisfies (B.2).

It remains to show ξq,ht → ξq,0t in L2(Ω) uniformly in t ∈ [s, T ], which amounts tochecking conditions (i)-(iii) in Thm. 20, with q∗ = (q, 0). The last condition (iii) is triviallysatisfied. As for the first two, it is enough to show that Xx+hj ,s,q

t → Xx,s,qt in L2(Ω) as

h→ 0, uniformly in x, which follows from the straightforward estimate

E|Xx,s,qt −Xx+hj ,s,q

t |2 ≤3h2 + C

∫ t

sE|Xx,s,q

v −Xx+hj ,s,qv |2dv

≤C ′h2.

Now, we may apply Thm. 20 to deduce the satisfaction of the SDE. Finally, the last momentestimate follows from Thm. 19.

Let us now extend the above result to higher order derivatives. As before, we denote theorder α partial derivative of Ψ in the L2(Ω) sense by

∂α(J)Ψ ≡ ∂α(j1,...,jα)Ψ

where J is an order α multi-index.

Lemma 23 Suppose that b, σ ∈ G2. Then, for each i, j1, j2 ∈ 1, . . . , d, the derivativeξ2,x,s,q

(i,j1,j2),t := ∂2(j1,j2)X

x,s,q(i),t exists and is the unique solution of the linear random-coefficient

37

Li, Tai and E

stochastic differential equation

ξ2,x,s,q(i,j1,j2),t =

∫ t

s∂2

(k1,k2)b(Xx,s,qv , q)(i)ξ

1,x,s,q(k1,j1),vξ

1,x,s,q(k2,j2),vdv (B.3)

+

∫ t

s∂2

(k1,k2)σ(Xx,s,qv , q)(i,l)ξ

1,x,s,q(k1,j1),vξ

1,x,s,q(k2,j2),vdW(l),v (B.4)

+

∫ t

s∂(k)b(X

x,s,qv , q)(i)ξ

2,x,s,q(k,j1,j2),vdv (B.5)

+

∫ t

s∂(k)σ(Xx,s,q

v , q)(i,l)ξ2,x,s,q(k,j1,j2),vdW(l),v (B.6)

where ξ1,x,s,q(i,j),t := ∂(j)X

x,s,q(i),t is the first derivative. Moreover, for each m ≥ 1, we have

E supt∈[s,T ] |ξ2,x,s,qt |2m ∈ G, i.e.

supq∈Q,s∈[0,T ]

E supt∈[s,T ]

|ξ2,x,s,qt |2m ≤ κ1(1 + |x|2κ2) (B.7)

Proof Let us denote

φx,s,q(i,j1,j2),t =

∫ t

s∂2

(k1,k2)b(Xx,s,qv , q)(i)ξ

1,x,s,q(k1,j1),vξ

1,x,s,q(k2,j2),vdv

+

∫ t

s∂2

(k1,k2)σ(Xx,s,qv , q)(i,l)ξ

1,x,s,q(k1,j1),vξ

1,x,s,q(k2,j2),vdW(l),v.

Note that by Lem. 22, E supt∈[s,T ] |ξ1,x,s,qt |2m is finite for any m ≥ 1. Then, proceeding as

in the proof of Lem. 22, we have

E supt∈[s,T ]

|φx,s,qt |2

≤CE supt∈[s,T ]

(|∇2b(Xx,s,q

t , q)|2 + |∇2σ(Xx,s,qt , q)|2

)|ξ1,x,s,qt |4

≤C

[E supt∈[s,T ]

(|∇2b(Xx,s,q

t , q)|2 + |∇2σ(Xx,s,qt , q)|2

)2]1/2

[E supt∈[s,T ]

|ξ1,x,s,qt |8]

1/2

Here, C is independent of q and s. From the above, using the assumption that b, σ ∈ G2,and the moment estimate in Thm. 19 on Xx,s,q

t , we conclude that

supq∈Q,s∈[0,T ]

E supt∈[s,T ]

|φx,s,qt |2 ≤ κ1(1 + |x|2κ)

thus (B.3) admits a unique solution by Thm. 18, and the solution ξ2,x,s,qt satisfies the same

estimate. Moreover, the estimate above holds for any 2m power for m ≥ 1 by a similarcalculation, which shows that E supt∈[s,T ] |ξ

2,x,s,qt |2m ∈ G.

Finally, To show that ξ2,x,s,qt is the second derivative of Xx,s,q

t with respect to x, weproceed analogously as in the proof of 22, thanks to estimate (B.7) and polynomial growth

38


conditions, all the estimates required for interchanging the derivative and the integral signsare satisfied, so the equation for ξ2,x,s,q

t is obtained by formally differentiating under theintegral sign with respect to x, which is precisely (B.3).

Lemma 24 For each α ≥ 1, suppose that b, σ ∈ Gα+1, then, the derivative ∇α+1Xx,s,qt

exists and is the unique a.s. continuous solution of the linear random-coefficient SDE

ξα+1,x,s,q(i,J),t =φx,s,q(i,J),t +

∫ t

s∂(k)b(X

x,s,qv , q)(i)ξ

α+1,x,s,q(k,J),t dv (B.8)

+

∫ t

s∂(k)σ(Xx,s,q

v , q)(i,l)ξα+1,x,s,q(k,J),t dW(l),v, (B.9)

where J is a multi-index of order α+1 and φx,s,qt is an a.s. continuous stochastic process satis-fying E supt∈[s,T ] |φ

x,s,qt |2m ∈ G for all m ≥ 1. In fact, (B.8) is obtained by formally differen-

tiating (B.2) under the integral sign α times. Moreover, we have E supt∈[s,T ] |ξα+1,x,s,qt |2m ∈

G for all m ≥ 1.

Proof The proof is identical to the α = 1 case in Lem. 23. We omit writing out the wholeproof here.

We now prove the following useful result, which imparts polynomial growth conditionsonto expectations functionals.

Proposition 25 Let s ∈ [0, T ] and g ∈ Gα+1 for some α ≥ 1. For t ∈ [s, T ], define

u(x, s, q, t) := Eg(Xx,s,qt )

Then, u(·, s, q, t) ∈ Gα+1 uniformly in s, q, t.

Proof Consider first the case α = 1. We shall use the results in Lem. 22-24 to show that

∂(i)u(x, s, q, t) = E∂(k)g(Xx,s,qt )∂(i)X

x,s,q(k),t

and that ∂(i)u(x, s, q, t) ∈ G. Let hj be defined as in the proof of 22, we have

u(x+hj ,s,q,t)−u(x,s,q,t)h

=E∫ 1

0

1hddλg(λXx+hj ,s,q

t + (1− λ)Xx,s,qt )dλ

=E∫ 1

0∂(k)g(λXx+hj ,s,q

t + (1− λ)Xx,s,qt )dλ

Xx+hj,s,q(k),t

−Xx,s,q(k),t

h .

Now, 1h(Xx+hj ,s

t −Xx,s,qt )→ ∂(j)X

x,s,qt in L2(Ω). Moreover, set

Ih :=

∫ 1

0∂(k)g(λXx+hj ,s,q

t + (1− λ)Xx,s,qt )dλ.

39

Li, Tai and E

Since ∇g is continuous, |Ih − ∂(k)g(Xx,s,qt )|2 → 0 in probability. Moreover,

E|Ih − ∂(k)g(Xx,s,qt )|4 <∞

by the assumption that g ∈ G1. Thus, |Ih − ∂(k)g(Xx,s,qt )|2 : h ∈ [0, 1] is uniformly

integrable and so Ih → ∂(k)g(Xx,s,qt ) in L2(Ω). We have thus arrived at

∂(i)u(x, s, q, t) = E∂(k)g(Xx,s,qt )∂(i)X

x,s,q(k),t ,

and in particular,

|∇u(x, s, q, t)|2 ≤ E|∇g(Xx,s,qt )|2E|∇Xx,s,q

t |2 ∈ G,

where we have used Thm. 19 and 22. The proof for higher order derivatives follow accord-ingly by the above procedure, using Lem. 24.

Appendix C. Auxiliary results for the proof of Thm. 3

Lemma 26 Let α ≥ 1 and suppose b, σ satisfy Assumption B.1. Then, there exists a K ∈ G,independent of η and ε, such that

Eα+1∏j=1

∣∣∣∆(ij)

∣∣∣ ≤ K(x)ηα+1.

where ij ∈ 1, . . . , d and C > 0 is independent of η.

Proof We have

E|∆(x)|α+1 ≤2αE∣∣∣∣∫ η

0b(Xx,0

s , η, ε)ds

∣∣∣∣α+1

+ 2αηα+1

2 E∣∣∣∣∫ η

0σ(Xx,0

s , η, ε)dWs

∣∣∣∣α+1

≤2αηα∫ η

0E|b(Xx,0

s , η, ε)|α+1ds+ 2αηα+1

2

∣∣∣∣∫ η

0σ(Xx,0

s , η, ε)dWs

∣∣∣∣α+1

Using Cauchy-Schwarz inequality, Itô’s isometry, we get

E∣∣∣∣∫ η

0σ(Xx,0

s , η, ε)dWs

∣∣∣∣α+1

≤

(E∣∣∣∣∫ η

0σ(Xx,0

s , η, ε)dWs

∣∣∣∣2α+2)1/2

≤Cηα/2(∫ η

0E|σ(Xx,0

s , η, ε)|2α+2ds

)1/2

where C depends only on α. Now, using the linear growth condition (B.1 (i)) and the mo-ment estimates in Thm. 19, we obtain the result.

40


Lemma 27 Suppose u ∈ G(α+1) for some α ≥ 1. Let assumption (i) in Thm. 3 hold. Then,there exists some K ∈ G, independent of η, ε, such that∣∣∣Eu(xx,01 )− Eu(Xx,0

1 )∣∣∣ ≤ K(x)(ηρ(ε) + ηα+1)

Proof Using Taylor’s theorem with the Lagrange form of the remainder, we have

u(xx,01 )− u(Xx,01 ) =

α∑s=1

1s!

d∑i1,...,ij=1

s∏j=1

[∆(ij)(x)− ∆(ij)(x)] ∂su∂x(i1),...x(ij)

(x)

+ 1(α+1)!

d∑i1,...,ij=1

α+1∏j=1

[∆(ij)(x)− ∆(ij)(x)]

×[

∂(α+1)u∂x(i1),...x(ij)

(x+ a∆(x))− ∂(α+1)u∂x(i1),...x(ij)

(x+ a∆(x))

],

where a, a ∈ [0, 1]. Taking expectations, using assumption (i) of Thm. 4.1 and Lem. 26, weget

|Eu(xx,01 )− Eu(Xx,01 )| ≤ K(x)(ηρ(ε) + ηα+1).

Appendix D. Auxiliary results for the proof of Thm. 9

Set in (4.2)

b(x, η, ε) = b0(x, ε) + ηb1(x, ε)

σ(x, η, ε) = σ0(x, ε),

We prove the following Itô-Taylor expansion.

Lemma 28 Let ψ : Rd → R be a sufficiently smooth function and define the operators

Aε,0ψ(x) :=b0(x, ε)(i)∂(i)ψ(x)

Aε,1ψ(x) :=b1(x, ε)(i)∂(i)ψ(x) + 12σ0(x, ε)(i,k)σ0(x, ε)(j,k)∂

2(i,j)ψ(x)

[Λε,0g(x)](l) :=σ0(x, ε)(i,l)∂(i)ψ(x), l = 1, . . . , d.

Suppose further that b0, b1, σ0 ∈ G3. Then, we have

Eψ(Xx,0η ) = ψ(x) + ηAε,0ψ(x) + η2(1

2A2ε,0 +Aε,1)ψ(x) +O(η3).

Proof Using Itô’s formula, we have

ψ(Xx,0η ) =ψ(x) +

∫ η

0Aε,0ψ(Xx,0

s )ds+ η

∫ η

0Aε,1ψ(Xx,0

s )ds

+√η

∫ η

0Λε,0ψ(Xx,0

s )dWs

41

Li, Tai and E

By further application of the above formula to Aε,0ψ and Aε,1ψ, we have

ψ(Xx,0η ) =ψ(x) + ηAε,0ψ(x) + η2(1

2A2ε,0 +Aε,1)ψ(x)

+ η

∫ η

0

∫ s

0(Aε,1Aε,0 +Aε,0Aε,1)ψ(Xx,0

v )dvds

+

∫ η

0

∫ s

0

∫ v

0A3ε,0ψ(Xx,0

r )drdvds

+ η2

∫ η

0

∫ s

0A2ε,1ψ(Xx,0

v )dvds

+ η

∫ η

0

∫ s

0

∫ v

0Aε,1A

2ε,0ψ(Xx,0

r )drdvds

+√η

∫ η

0Λε,0ψ(Xx,0

s )dWs

+√η

∫ η

0

∫ s

0Λε,0Aε,0ψ(Xx,0

v )dWvds

+√η

∫ η

0

∫ s

0

∫ v

0Λε,0A

2ε,0ψ(Xx,0

r )dWrdvds

+ η3/2

∫ η

0

∫ s

0Λε,0Aε,1ψ(Xx,0

v )dWvds

Taking expectations of the above, it remains to show that each of the terms in the integral ei-ther vanishes, or is O(η3). This follows immediately from the assumption that b0, b1, σ0 ∈ G3

and ψ ∈ G4. Indeed, observe that all the integrands have at most 3 derivatives in b0, b1, σ0

and 4 derivatives in ψ, which by our assumptions all belong to G. Thus, the expectation ofeach integrand is bounded by κ1(1+supt∈[0,η] E|X

x,0t |2κ2) for some κ1, κ2, which by Thm. 19

must be finite. Thus, the last 3 stochastic integrals are martingales and their expectationvanish, and the expectations of the other integrals are O(η3) by the polynomial growthassumption and moment estimates in Thm. 19.

We also prove a general moment estimate for the generalized SGA iterations 4.1.

Lemma 29 Let xk : k ≥ 0 be the generalized SGA iterations defined in 4.1. Suppose

|h(x, γ, η)| ≤ Lγ(1 + |x|)

for some random variable Lγ > 0 a.s. and ELγm <∞ for all m ≥ 1. Then, for fixed T > 0and any m ≥ 1, E|xk|m exists and is uniformly bounded in η and k = 0, . . . , N ≡ bT/ηc.Proof For each k ≥ 0, we have

|xk+1|m ≤ |xk|l +

m∑l=1

(m

l

)|xk|m−lηl|h(xk, γk, η)|m−l

Now, for 1 ≤ l ≤ m,

E|xk|m−l|h(xk, γk, η)|l =E|xk|m−lE(|h(xk, γk, η)|l∣∣xk)

≤E(Llγ)E|xk|m−l(1 + |xk|l)≤2E(Llγ)(1 + E|xk|m).

42


Hence, if we let ak := E|xk|m, we have

ak+1 ≤ (1 + Cη)ak + C ′η

where C,C ′ > 0 are independent of η and k, which immediately implies

ak ≤(a0 + C ′/C)(1 + Cη)k − C ′/C≤(|x0|m + C ′/C)e(T/η) log(1+Cη) − C ′/C≤(|x0|m + C ′/C)eCT − C ′/C.

We also need the following result concerning mollified functions.

Lemma 30 Let ε ∈ (0, 1) and ψ be continuous with its weak derivative Dψ belonging toGw. Denote by ψε = νε ∗ ψ the mollification of ψ. Then, there exists a K ∈ G independentof ε such that

|ψε(x)− ψ(x)| ≤ εK(x)

Proof We have for almost every x,

|ψε(x)− ψ(x)| ≤∫B(0,ε)

νε(y)|ψ(x− y)− ψ(x)|dy

=

∫B(0,ε)

νε(y)

∣∣∣∣∫ 1

0Dψ(x− λy) · ydλ

∣∣∣∣ dy≤ε∫B(0,ε)

∫ 1

0νε(y)|Dψ(x− λy)|dλdy

≤ε∫B(0,ε)

νε(y)κ1[1 + κ2(|x|+ |y|)]dy

≤εK(x).

Since ψ is continuous, the above equality holds for all x ∈ Rd.

Appendix E. Auxiliary results for computations in Sec. 5

Lemma 31 Let A be a real square matrix such that all eigenvalues have positive real part.Then,

(i) For each ε > 0, there exists a constant Cε > 0 independent of t but depends on ε, suchthat

|e−tA| ≤ Cεe−t(mini <λi(A)−ε)

43

Li, Tai and E

(ii) If in addition A is diagonalizable, then there exists a constant C > 0 independent of tsuch that

|e−tA| ≤ Ce−tmini <λi(A)

Proof (i) We know that A is similar to a Jordan block matrix J so that e−At = Pe−JtP−1.Hence, |e−At| ≤ |P ||P−1||e−Jt|. For each Jordan block Jk, we have Jk = λkI + Nk whereNk is nilpotent (Nk

k = 0). Hence,

e−Jkt =e−λkIte−Nkt = e−λktk−1∑m=0

Nmkm! (−t)m

=e−(λk−ε)t

[k−1∑m=0

Nmkm! (−t)me−εt

].

For each ε > 0 the norm of the last term is uniformly bounded in t, and hence we obtainthe result.

(ii) We denote the similarity transformation A = PDP−1 where D is the diagonal matrixof eigenvalues of A. Defining Q := P †P († denotes conjugate transpose), we have

|e−tA| = Tr(e−tATe−tA) = Tr(Q−1e−tD

†Qe−tD)

References

Jing An, Jianfeng Lu, and Lexing Ying. Stochastic modified equations for the asynchronousstochastic gradient descent. arXiv preprint arXiv:1805.08244, 2018.

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximationwith convergence rate O(1/n). In Advances in Neural Information Processing Systems,pages 773–781, 2013.

Michael Betancourt, Michael I Jordan, and Ashia C Wilson. On symplectic optimization.arXiv preprint arXiv:1802.03653, 2018.

Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journalof political economy, 81(3):637–654, 1973.

Bart J Daly. The stability properties of a coupled pair of non-linear partial differenceequations. Mathematics of Computation, 17(84):346–360, 1963.

Alexandre Défossez and Francis Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pages205–213, 2015.

Rick Durrett. Probability: theory and examples. Cambridge university press, 2010.

44


Lawrence C Evans. Partial differential equations. 2010.

Yuanyuan Feng, Lei Li, and Jian-Guo Liu. A note on semi-groups of stochastic gradientdescent and online principal component analysis. arXiv preprint arXiv:1712.06509, 2017.

CW Hirt. Heuristic stability theory for finite-difference equations. Journal of ComputationalPhysics, 2(4):339–355, 1968.

Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation ofnonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.

Peter E. Kloeden and Eckhard Platen. Numerical Solution of Stochastic Differential Equa-tions. Springer, New York, corrected edition, June 2011.

Walid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descentdynamics. In Advances in Neural Information Processing Systems, pages 6796–6806, 2017.

Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms andapplications, volume 35. Springer Science & Business Media, 2003.

Harold J Kushner. Rates of convergence for sequential monte carlo optimization methods.SIAM Journal on Control and Optimization, 16(1):150–168, 1978.

Harold J Kushner and Adam Shwartz. An invariant measure approach to the convergenceof stochastic approximations with state dependent noise. SIAM Journal on Control andOptimization, 22(1):13–27, 1984.

Harold Joseph Kushner and Dean S Clark. Stochastic approximation methods for constrainedand unconstrained systems, volume 26. Springer Science & Business Media, 2012.

Qianxiao Li, Cheng Tai, and Weinan E. Dynamics of stochastic gradient algorithms. arxivpreprint. arXiv preprint arXiv:1511.06251v1, 2015.

Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and adaptive stochas-tic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110, 2017.

Lennart Ljung, Georg Ch Pflug, and Harro Walk. Stochastic approximation and optimizationof random systems, volume 17. Birkhäuser, 2012.

Stephan Mandt, Matthew D Hoffman, and David M Blei. Continuous-time limit of stochasticgradient descent revisited. In OPT workshop, NIPS, 2015.

Stephan Mandt, Matthew D Hoffman, and David M Blei. A variational analysis of stochasticgradient algorithms. arXiv preprint arXiv:1602.02666, 2016.

Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent asapproximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907, 2017.

45

Li, Tai and E

Panayotis Mertikopoulos and Mathias Staudigl. Convergence to nash equilibrium in con-tinuous games with noisy first-order feedback. In 2017 IEEE 56th Annual Conference onDecision and Control (CDC), pages 5609–5614. IEEE, 2017.

Grigori N Milstein. Approximate integration of stochastic differential equations. Theory ofProbability & Its Applications, 19(3):557–562, 1975.

Grigori N Milstein. Weak approximation of solutions of systems of stochastic differentialequations. Theory of Probability & Its Applications, 30(4):750–766, 1986.

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation al-gorithms for machine learning. In Advances in Neural Information Processing Systems,pages 451–459, 2011.

Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weightedsampling, and the randomized algorithm. In Advances in Neural Information ProcessingSystems, pages 1017–1025, 2014.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.Springer Science & Business Media, 2013.

Yurii E Nesterov. A method for solving the convex programming problem with convergencerate O(1/k2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.

WF Noh and MH Protter. Difference methods and the equations of hydrodynamics. Tech-nical report, California. Univ., Livermore. Lawrence Radiation Lab., 1960.

Bernt Oksendal. Stochastic differential equations: an introduction with applications. SpringerScience & Business Media, 2013.

Maxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a network:Variance reduction, consensus, convergence. In 2012 IEEE 51st IEEE Conference onDecision and Control (CDC), pages 6793–6800. IEEE, 2012.

Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinateascent for regularized loss minimization. Mathematical Programming, pages 1–41, 2014.

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:Convergence results and optimal averaging schemes. In International Conference on Ma-chine Learning, pages 71–79, 2013.

Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modelingNesterov’s accelerated gradient method: theory and insights. In Advances in NeuralInformation Processing Systems, pages 2510–2518, 2014.

RF Warming and BJ Hyett. The modified equation approach to the stability and accuracyanalysis of finite-difference methods. Journal of computational physics, 14(2):159–179,1974.

46


Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective onaccelerated methods in optimization. Proceedings of the National Academy of Sciences,113(47):E7351–E7358, 2016.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variancereduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

47

Documents

StochasticModiﬁedEquationsandDynamicsofStochastic ...jmlr.csail.mit.edu/papers/volume20/17-526/17-526.pdfStochastic Modified Equations I: Mathematical Foundations imations that establish