Exactly Sparse Gaussian Variational Inference with ...Exactly Sparse Gaussian Variational Inference with Application to Derivative-Free Batch Nonlinear State Estimation c The Author(s)

Exactly Sparse Gaussian VariationalInference with Application toDerivative-Free Batch Nonlinear StateEstimation

c©The Author(s) 2020

Timothy D. Barfoot1 and James R. Forbes2 and David J. Yoon1

AbstractWe present a Gaussian Variational Inference (GVI) technique that can be applied to large-scale nonlinear batchstate estimation problems. The main contribution is to show how to fit both the mean and (inverse) covariance ofa Gaussian to the posterior efficiently, by exploiting factorization of the joint likelihood of the state and data, asis common in practical problems. This is different than Maximum A Posteriori (MAP) estimation, which seeks thepoint estimate for the state that maximizes the posterior (i.e., the mode). The proposed Exactly Sparse GaussianVariational Inference (ESGVI) technique stores the inverse covariance matrix, which is typically very sparse (e.g.,block-tridiagonal for classic state estimation). We show that the only blocks of the (dense) covariance matrix thatare required during the calculations correspond to the non-zero blocks of the inverse covariance matrix, and furthershow how to calculate these blocks efficiently in the general GVI problem. ESGVI operates iteratively, and whilewe can use analytical derivatives at each iteration, Gaussian cubature can be substituted, thereby producing anefficient derivative-free batch formulation. ESGVI simplifies to precisely the Rauch-Tung-Striebel (RTS) smootherin the batch linear estimation case, but goes beyond the ‘extended’ RTS smoother in the nonlinear case since itfinds the best-fit Gaussian (mean and covariance), not the MAP point estimate. We demonstrate the technique oncontrolled simulation problems and a batch nonlinear Simultaneous Localization and Mapping (SLAM) problem withan experimental dataset.

KeywordsGaussian variational inference, exact sparsity, derivative-free state estimation

1 IntroductionGauss pioneered the method of least squares out ofnecessity to predict the position of the dwarf planet Ceresafter passing behind the Sun. In his initial treatment ofthe subject (Gauss 1809), he presented what we wouldconsider a ‘likelihood’ function, which was expressed as anexponential function of quadratic terms,

L(x) = exp

(−1

2(x− z)TW−1(x− z)

), (1)

where x is the state to be estimated, z are measurements,and W−1 is a weighting matrix. Gauss recognizedthat L(x) is maximized when (x− z)TW−1(x− z) isminimized, leading to the weighted least-squares solution.He later proved that the least-squares estimate is optimalwithout any assumptions regarding the distribution errors(Gauss 1821, 1823), and his more general result wasrediscovered by Markoff (1912), leading to the morecommonly known Gauss-Markov theorem (Bjorck 1996).

If we adopt a Bayesian perspective (Bayes 1764), ourgoal is to compute the full posterior, p(x|z), by refininga prior, p(x), not just a point estimate, based on somemeasurements, z:

p(x|z) =p(z|x)p(x)

p(z)=p(x, z)

p(z). (2)

The full posterior is not a Gaussian Probability DensityFunction (PDF) for nonlinear measurement models, p(z|x).We are therefore often satisfied with finding the maximumof the Bayesian posterior, which is called the Maximum

1Institute for Aerospace Studies, University of Toronto2Department of Mechanical Engineering, McGill University

Corresponding author:Timothy D. Barfoot, Institute for Aerospace Studies, University ofToronto, Toronto, Ontario, M3H 5T6, Canada.Email: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

arX

iv:1

911.

0833

3v2

[cs

.RO

] 9

Apr

202

0

2 Exactly Sparse Gaussian Variational Inference

A Posteriori (MAP) approach. The connection to leastsquares (for Gaussian noise) is seen by taking the negativelogarithm of the likelihood function (and dropping constantterms), resulting in a nonlinear quadratic loss function thatis minimized:

V (x) = − ln p(x, z) =1

2e (x, z)

TW−1 e (x, z) ,

x? = arg minxV (x), (3)

where e(·, ·) is the error and is a nonlinear function ofthe state, x, and measurements, z. The result is a pointsolution, the most likely x given z. A Bayesian prior caneasily be included in the loss function and thus we referto this problem as MAP rather than Maximum Likelihood(ML) (no prior). Although there are various methods forminimizing the above loss function, perhaps the most wellknown dates back, again, to Gauss, who described hownonlinear least-squares problems can be linearized andrefined in an iterative process (Abdulle and Wanner 2002),a method that is now known as Gauss-Newton (GN), or themethod of differential corrections (Ortega and Rheinboldt1970). To this day, MAP is the dominant approachemployed for batch nonlinear estimation problems.

Rather than finding the maximum of the Bayesianposterior, our approach in this paper will be to find the bestGaussian approximation, in terms of the mean and (inverse)covariance, to the full posterior that is ‘closest’ in termsof the Kullback-Leibler (KL) divergence between the two(Kullback and Leibler 1951). This approach is referred to asvariational inference or variational Bayes (Bishop 2006).As we will restrict ourselves to Gaussian approximations ofthe posterior, we will refer to this as Gaussian variationalinference (GVI). While GVI is not new, it is not commonlyused in batch estimation problems, where the state size, N ,can be very large. Our main contribution in this paper, is toshow how to make GVI tractable for large-scale estimationproblems. Specifically, we will show how to exploit ajoint likelihood for the state and measurements that can befactored,

p(x, z) =

K∏k=1

p(xk, zk), (4)

where xk is a subset of the variables in x. This typeof factorization is very common in real-world roboticsproblems, for example, since each measurement typicallyonly involves a small subset of the state variables andthis is already exploited in the MAP approach (Brown1958; Thrun et al. 2004; Walter et al. 2007) for efficientsolutions. We extend this exploit to the GVI approach byidentifying that the inverse covariance matrix is exactlysparse when the likelihood factors, and most importantly,that we never actually need to compute the entire covariancematrix, which is typically dense and of size N ×N .

As a by-product of our approach, we also show how touse cubature points (e.g., sigmapoints) for some of therequired calculations, resulting in an efficient derivative-free implementation for large-scale batch estimation.

The paper is organized as follows. Section 2 reviewssome related work. Section 3 sets up our GVI approachin terms of the KL functional that we seek to minimize.It then derives a Newton-style iterative optimizer tocalculate the mean and (inverse) covariance of the Gaussianapproximation. Section 4 shows how we can exploita factored likelihood not only by showing the inversecovariance is exactly sparse (as it is in the MAPformulation) but also showing that we only ever require theblocks of the covariance matrix corresponding to the non-zero blocks of the inverse covariance. It also summarizesan existing method for calculating these required blocks ofthe covariance and shows how we can make use of sample-based methods to avoid the need to calculate derivativesof our models. Section 5 presents an alternate formulationof the variational approach that is more approximatebut also more efficient and also shows how we canfold parameter estimation into the framework while stillexploiting sparsity. Section 6 provides some toy problemsand a real-data robotics demonstration of the method.Finally, Section 7 provides our conclusion and suggestionsfor future work.

2 Related WorkGaussian estimation has been a key tool employed infields such as robotics, computer vision, aerospace, andmore. The famous Kalman Filter (KF) (Kalman 1960),for example, provides a recursive formula to propagate aGaussian state estimate. While the KF only goes forwardin time, the Rauch-Tung-Striebel (RTS) smoother (Rauchet al. 1965) carries out forward and backward passes toefficiently estimate the state and can be shown to becarrying out full Bayesian inference for linear models(Barfoot 2017). Sarkka (2013) provides a wonderfulpresentation of recursive Bayesian inference methods, forboth linear and nonlinear models. In computer vision androbotics, the important Bundle Adjustment (BA) (Brown1958) / Simultaneous Localization and Mapping (SLAM)(Durrant-Whyte and Bailey 2006) problem is often castas a batch Gaussian estimation problem (Triggs et al.2000; Lu and Milios 1997; Thrun and Montemerlo 2005),with more advanced solution methods required than simpleforward/backward passes (Kaess et al. 2008, 2011).

While the recursive methods are fundamentally impor-tant, here we concern ourselves with problems that requirebatch Gaussian inference. In robotics, some canonicalproblems are batch trajectory estimation, pose-graph relax-ation (Bourmaud 2016), and BA/SLAM. However, we canalso pose control/planning (Dong et al. 2016; Mukadam

Prepared using sagej.cls

Barfoot et al. 3

et al. 2018), calibration (Pradeep et al. 2014), and three-dimensional modelling problems (Li et al. 2011) as Gaus-sian inference, such that the number of commonplace appli-cations is quite large. Despite the widespread need for thistool, almost without exception we rely on MAP estimationto ‘fit’ a Gaussian, which is to say we find the most likelystate in the Bayesian posterior and call this the ‘mean’, thenfit a Gaussian centered at the most likely state, which isreferred to as the Laplace approximation (Bishop 2006, p.315). For linear models, this in fact does produce the exactGaussian posterior. For nonlinear models, however, the pos-terior is not Gaussian and then the Laplace approximationis a convenient approach that can be computed efficientlyfor large-scale problems. The primary goal of this paper isto revisit the batch Gaussian inference problem in search ofimprovements over this popular method.

Within recursive estimation, attempts have been made togo beyond MAP, in order to perform better on nonlinearproblems. The Bayes filter (Jazwinski 1970) is a generalmethod that can be approximated in many different waysincluding through the use of Monte Carlo integration(Thrun et al. 2006) or the use of cubature rules (e.g.,sigmapoints) (Julier and Uhlmann 1996; Sarkka 2013).These sample-based extensions also bring the convenienceof not requiring analytical derivatives of nonlinear modelsto be calculated. Thus, a secondary goal of the paperis to find principled ways of incorporating sample-basedtechniques within batch Gaussian inference.

As we will see, the starting point for our paper willbe a variational Bayes setup (Bishop 2006). We aim tofind the Gaussian approximation that is closest to the fullBayesian posterior in terms of the KL divergence betweenthe two (Kullback and Leibler 1951). This is a paradigmshift from the MAP approach where the only parameterto be optimized is the ‘mean’, while the Laplace-stylecovariance is computed post hoc. In GVI, we seek to findthe best mean and covariance from the outset. The challengeis how to do this efficiently for problems with a large statesize; if the mean is size N , then the covariance will be N ×N , which for real-world problems could be prohibitivelyexpensive. However, as we will show, we can carry outfull GVI by exploiting the same problem structures weusually do in the MAP approach. This will come at theexpense of some increased computational cost, but thecomputational complexity as a function of N does notincrease. Ranganathan et al. (2007) and recently Davisonand Ortiz (2019) discuss the use of loopy belief propagationto carry out large-scale Gaussian inference for roboticsproblems; our motivation is somewhat different in that weseek to improve on MAP whereas these works investigateparallelization of the computations and further approximatethe Gaussian variational estimate.

While this result is new in robotics, Opper andArchambeau (2009) discuss a similar GVI approach inmachine learning. They begin with the same KL divergenceand show how to calculate the derivatives of this functionalwith respect to the Gaussian parameters. They go on toapply the method to Gaussian process regression problems(Rasmussen and Williams 2006), of which batch trajectoryestimation can be viewed as a special case (Barfoot et al.2014; Anderson et al. 2015). Our paper extends this workin several significant ways including (i) generalizing toany GVI problem where the likelihood can be factored,(ii) devising a Newton-style iterative solver for both meanand inverse covariance, (iii) explicitly showing how toexploit problem-specific structure in the case of a factoredlikelihood to make the technique efficient, (iv) applyingGaussian cubature to avoid the need to calculate derivatives,and (v) demonstrating the approach on problems of interestin robotics.

Kokkala et al. (2014, 2016), Ala-Luhtala et al. (2015),Garcıa-Fernandez et al. (2015), Gasperin and Juricic(2011), and Schon et al. (2011) discuss a very similarapproach to our GVI scheme in the context of nonlinearsmoothers and filters; some of these works also carry outparameter estimation of the motion and observation models,which we also discuss as it fits neatly into the variationalapproach (Neal and Hinton 1998; Ghahramani and Roweis1999). These works start from the same KL divergence,show how to exploit factorization of the joint likelihood,and discuss how to apply sigmapoints (Kokkala et al. 2014,2016; Gasperin and Juricic 2011) or particles (Schon et al.2011) to avoid the need to compute derivatives. Garcıa-Fernandez et al. (2015) is a filtering paper that followsa similar philosophy to the current paper by statisticallylinearizing about an iteratively improved posterior. Ourpaper extends these works by (i) generalizing to any large-scale batch GVI problems where the likelihood can befactored (not restricted to smoothers with block-tridiagonalinverse covariance), (ii) devising a Newton-style iterativesolver for both mean and inverse covariance, (iii) explicitlyshowing how to exploit problem-specific structure in thecase of a factored likelihood to make the technique efficient,and (iv) demonstrating the approach on problems of interestin robotics.

There have been a few additional approaches toapplying sampled-based techniques to batch estimation;however, they are quite different from ours. Park et al.(2009) and Roh et al. (2007) present a batch estimatorthat uses the sigmapoint Kalman filter framework asthe optimization method. For each major iteration, theycompute sigmapoints for the estimated mean and propagatethem through the motion model over all time steps. Then,the measurements from all of these propagated sigmapointsare stacked in a large column vector and the standard



sigmapoint measurement update is applied. Although thismethod has been reported to work well (Park et al. 2009;Roh et al. 2007), it is expensive because it requiresconstructing the full covariance matrix to obtain the Kalmangain in the measurement update step. In this paper, we workwith the inverse covariance matrix and show how to avoidever constructing the full covariance matrix, which opens tothe door to use on large-scale estimation problems.

3 Gaussian Variational InferenceThis section poses the problem we are going to solve andproposes a general solution. Exploiting application-specificstructure is discussed later, in Section 4. We first define theloss functional that we seek to minimize, then derive anoptimization scheme in order to minimize it with respectto the parameters of a Gaussian. As an aside, we show thatour optimization scheme is equivalent to so-called NaturalGradient Descent (NGD). Following this, we work ouroptimization scheme into a different form in preparation forexploiting application-specific structure and finally showthat we can recover the classic RTS smoother in the linearcase.

3.1 Loss FunctionalAs is common in variational inference (Bishop 2006), weseek to minimize the KL divergence (Kullback and Leibler1951) between the true Bayesian posterior, p(x|z), and anapproximation of the posterior, q(x), which in our case willbe a multivariate Gaussian PDF,

q(x) = N (µ,Σ) (5)

=1√

(2π)N |Σ|exp

(−1

2(x− µ)TΣ−1(x− µ)

),

where | · | is the determinant. For practical robotics andcomputer vision problems, the dimension of the state, N ,can become very large and so the main point of our paperis to show how to carry out GVI in an efficient manner forlarge-scale problems∗.

As KL divergence is not symmetrical, we have a choiceof using KL(p||q) or KL(q||p). Bishop (2006, p. 467)provides a good discussion of the differences between thesetwo functionals. The former expression is given by

KL(p||q) = −∫ ∞

−∞p(x|z) ln

(q(x)

p(x|z)

)dx

= Ep [ln p(x|z)− ln q(x)] , (6)

while the latter is

KL(q||p) = −∫ ∞

−∞q(x) ln

(p(x|z)

q(x)

)dx

= Eq [ln q(x)− ln p(x|z)] , (7)

where x ∈ RN is the latent state that we seek to infer fromdata, z ∈ RD, and E[·] is the expectation operator. The keypractical difference that leads us to choose KL(q||p) is thatthe expectation is over our Gaussian estimate, q(x), ratherthan the true posterior, p(x|z). We will show that we canuse this fact to devise an efficient iterative scheme for q(x)that best approximates the posterior. Moreover, our choiceof KL(q||p) leads naturally to also estimating parametersof the system (Neal and Hinton 1998), which we discussin Section 5.2. Ranganathan et al. (2007) and Davison andOrtiz (2019) discuss approximate Gaussian inference forrobotics problems using loopy belief propagation, which isbased on KL(p||q) (Bishop 2006, p. 505); their emphasis ison parallelizing the computations whereas we are focusedon improving the estimate over MAP.

We observe that our chosen KL divergence can be writtenas

KL(q||p) = Eq[− ln p(x, z)]

− 1

2ln((2πe)N |Σ|

)︸︷︷︸

entropy

+ ln p(z)︸︷︷︸constant

, (8)

where we have used the expression for the entropy,−∫q(x) ln q(x)dx, for a Gaussian. Noticing that the final

term is a constant (i.e., it does not depend on q(x))we define the following loss functional that we seek tominimize with respect to q(x):

V (q) = Eq[φ(x)] +1

2ln(|Σ−1|

), (9)

with φ(x) = − ln p(x, z). We deliberately switch from Σ(covariance matrix) to Σ−1 (inverse covariance matrix alsoknown as the information matrix or precision matrix) in (9)as the latter enjoys sparsity that the former does not; wewill carry this forward and use µ and Σ−1 as a completedescription of q(x). The first term in V (q) encourages thesolution to match the data while the second penalizes itfor being too certain; although we did not experiment withthis, a relative weighting (i.e., a metaparameter) betweenthese two terms could be used to tune performance on othermetrics of interest. It is also worth mentioning that V (q)is the negative of the so-called Evidence Lower Bound(ELBO), which we will consequently minimize.

3.2 Optimization SchemeOur next task is to define an optimization scheme tominimize the loss functional with respect to the mean, µ,and inverse covariance, Σ−1. Our approach will be similarto a Newton-style optimizer.

∗Note, we choose not to make a mean-field approximation, insteadallowing all variables to be correlated


Barfoot et al. 5

After a bit of calculus, the derivatives of our lossfunctional, V (q), with respect to our Gaussian parameters,µ and Σ−1, are given by (Opper and Archambeau 2009)

∂V (q)

∂µT= Σ−1Eq[(x− µ)φ(x)], (10a)

∂2V (q)

∂µT∂µ= Σ−1Eq[(x− µ)(x− µ)Tφ(x)]Σ−1

− Σ−1 Eq[φ(x)], (10b)∂V (q)

∂Σ−1 = −1

2Eq[(x− µ)(x− µ)Tφ(x)]

+1

2ΣEq[φ(x)] +

1

2Σ, (10c)

where, comparing (10b) and (10c), we notice that

∂2V (q)

∂µT∂µ= Σ−1 − 2Σ−1 ∂V (q)

∂Σ−1 Σ−1. (11)

This relationship is critical to defining our optimizationscheme, which we do next.

To find extrema, we could attempt to set the firstderivatives to zero, but it is not (in general) possible toisolate for µ and Σ−1 in closed form. Hence, we willdefine an iterative update scheme. We begin by writing outa Taylor series expansion of V (q) that is second order in δµbut only first order in δΣ−1 (second order would be difficultto calculate and covariance quantities are already quadraticx):

V(q(i+1)

)≈ V

(q(i))

+

(∂V (q)

∂µT

∣∣∣∣q(i)

)Tδµ

+1

2δµT

(∂2V (q)

∂µT∂µ

∣∣∣∣q(i)

)δµ

+ tr

(∂V (q)

∂Σ−1

∣∣∣∣q(i)

δΣ−1

), (12)

where δµ = µ(i+1) − µ(i) and δΣ−1 =(Σ−1

)(i+1) −(Σ−1

)(i)with i the iteration index of our scheme. We now

want to choose δµ and δΣ−1 to force V (q) to get smaller.For the inverse covariance, Σ−1, if we set the derivative,

∂V (q)∂Σ−1 , to zero (for an extremum) in (11) we immediatelyhave

Σ−1(i+1)

=∂2V (q)

∂µT∂µ

∣∣∣∣q(i)

, (13)

where we place an index of (i+ 1) on the left and (i) on theright in order to define an iterative update. Inserting (11)again on the right we see that the change to the inversecovariance by using this update can also be written as

δΣ−1 = −2(Σ−1

)(i) ∂V (q)

∂Σ−1

∣∣∣∣q(i)

(Σ−1

)(i). (14)

Convergence of this scheme will be discussed below†.For the mean, µ, we will take inspiration from the

MAP approach to Gaussian nonlinear batch estimation andemploy a Newton-style update (Nocedal and Wright 2006).Since our loss approximation (12) is locally quadratic inδµ, we take the derivative with respect to δµ and set this tozero (to find the minimum). This results in a linear systemof equations for δµ:(

∂2V (q)

∂µT∂µ

∣∣∣∣q(i)

)︸︷︷︸

(Σ−1)(i+1)

δµ = −

(∂V (q)

∂µT

∣∣∣∣q(i)

), (15)

where we note the convenient reappearance of Σ−1 as theleft-hand side.

Inserting our chosen scheme for δµ and δΣ−1 into theloss approximation (12), we have

V(q(i+1)

)− V

(q(i))

≈ −1

2δµT

(Σ−1

)(i+1)δµ︸︷︷︸

≥0

with equality iff δµ = 0

− 1

2tr(Σ(i) δΣ−1 Σ(i) δΣ−1

)︸︷︷︸

≥0

with equality iff δΣ−1 = 0(see Appendix A)

≤ 0, (16)

which shows that we will reduce our loss, V (q), so longas δµ and δΣ−1 are not both zero; this is true when thederivatives with respect to µ and Σ−1 are not both zero,which occurs only at a local minimum of V (q). This is alocal convergence guarantee only as the expression is basedon our Taylor series expansion in (12).

3.3 Natural Gradient Descent InterpretationAs an aside, we can interpret our update for δµ and δΣ−1

as carrying out so-called Natural Gradient Descent (NGD)(Amari 1998; Hoffman et al. 2013; Barfoot 2020), which

†It is worth mentioning that (10c) ignores the fact that Σ−1 is actually asymmetric matrix. Magnus and Neudecker (2019) discuss how to calculatethe derivative of a function with respect to a matrix while accounting for itssymmetry. They show that if A is the derivative of a function (with respectto a symmetric matrix, B) that ignores symmetry, then A + AT − A ◦ 1

is the derivative accounting for the symmetry of B, where ◦ is theHadamard (element-wise) product and 1 is the identity matrix. It is nottoo difficult to see that A = 0 if and only if A + AT − A ◦ 1 = 0 inthe case of a symmetric A. Therefore, if we want to find an extremum bysetting the derivative to zero, we can simply set A = 0 as long as A issymmetric and this will account for the symmetry of B correctly. Barfoot(2020) investigates this issue more thoroughly.



exploits the information geometry to make the update moreefficient than regular gradient descent. To see this, we stackour variational parameters into a single column, α, usingthe vec(·) operator, which converts a matrix to a vector bystacking its columns:

α =

[µ

vec(Σ−1

)] , δα =

[δµ

vec(δΣ−1

)] ,∂V (q)

∂αT=

[ ∂V (q)∂µT

vec(∂V (q)∂Σ−1

)] . (17)

The last expression is the gradient of the loss functionalwith respect to α.

The NGD update scheme can then be defined as

δα = −I−1α

∂V (q)

∂αT, (18)

where Iα is the Fisher Information Matrix (FIM) (Fisher1922) for the variational parameter, α, and its calculationcan be found in Appendix B. Inserting the details of thecomponents of the above we have[

δµ

vec(δΣ−1

)]

= −[Σ−1 0

0 12 (Σ⊗Σ)

]−1[ ∂V (q)

∂µT

vec(∂V (q)∂Σ−1

)] , (19)

where⊗ is the Kronecker product. Extracting the individualupdates we see

δµ = −Σ∂V (q)

∂µT, (20a)

vec(δΣ−1

)= −2

(Σ−1 ⊗Σ−1

)vec(∂V (q)

∂Σ−1

).

(20b)

Finally, using that vec(ABC) ≡ (CT ⊗A) vec(B), wehave

Σ−1 δµ = −∂V (q)

∂µT, (21a)

δΣ−1 = −2Σ−1 ∂V (q)

∂Σ−1 Σ−1, (21b)

which is the same set of updates as in the previoussubsection.

3.4 Stein’s LemmaWhile our iterative scheme could be implemented as is,it will be expensive (i.e., O(N3) per iteration) for largeproblems. The next section will show how to exploitsparsity to make the scheme efficient and, in preparation

for that, we will manipulate our update equations into aslightly different form using Stein’s lemma (Stein 1981). Inour notation, the lemma says

Eq[(x− µ)f(x)] ≡ ΣEq[∂f(x)

∂xT

], (22)

where q(x) = N (µ,Σ) is a Gaussian random variable andf(·) is any nonlinear differentiable function. A doubleapplication of Stein’s lemma also reveals

Eq[(x− µ)(x− µ)T f(x)]

≡ ΣEq[∂2f(x)

∂xT∂x

]Σ + ΣEq[f(x)], (23)

assuming f(·) is twice differentiable. Combining Stein’slemma with our loss derivatives in (10a), (10b), and (10c),we have the useful identities

∂

∂µTEq[f(x)] ≡ Eq

[∂f(x)

∂xT

], (24a)

∂2

∂µT∂µEq[f(x)] ≡ Eq

[∂2f(x)

∂xT∂x

](24b)

≡ −2Σ−1

(∂

∂Σ−1Eq[f(x)]

)Σ−1,

which we will have occasion to use later on; Appendix Dprovides the derivations.

We can apply Stein’s lemma from (22) and (23) to ouroptimization scheme in (13) and (15) to write the iterativeupdates compactly as(

Σ−1)(i+1)

= Eq(i)[

∂2

∂xT∂xφ(x)

], (25a)

(Σ−1

)(i+1)δµ = −Eq(i)

[∂

∂xTφ(x)

], (25b)

µ(i+1) = µ(i) + δµ. (25c)

Ala-Luhtala et al. (2015, App. C) also make use of Stein’slemma in this way in the context of Gaussian variationalsmoothers. In general, this iterative scheme will still beexpensive for large problems and so we will look to exploitstructure to make GVI more efficient. As only the first andsecond derivatives of φ(x) are required, we can drop anyconstant terms (i.e., the normalization constant of p(x, z)).

Notably, our optimization scheme in (25) is identicalto the MAP approach (with the Laplace covarianceapproximation) if we approximate the expectations usingonly the mean of q(x). Thus, MAP with Laplace can beviewed as an approximation of the more general approachwe discuss in this paper.

3.5 Recovery of the RTS SmootherBefore moving on, we briefly show that our GVIformulation produces the discrete-time RTS smoother result


Barfoot et al. 7

in the linear case. As is shown by Barfoot (2017, §3, p. 44),the batch linear state estimation problem can be written inlifted form (i.e., at the trajectory level):

x = A(Bu + w), (26a)y = Cx + n, (26b)

where x is the entire trajectory (states over time), u are thecontrol inputs, y are the sensor outputs, w ∼ N (0,Q) isprocess noise, n ∼ N (0,R) is measurement noise, A isthe lifted transition matrix, B is the lifted control matrix,and C is the lifted observation matrix. We then have

φ(x) =1

2

(Bu−A−1x

)TQ−1

(Bu−A−1x

)+

1

2(y −Cx)

TR−1 (y −Cx) . (27)

The expected derivatives can be calculated analytically forthis linear problem:

Eq[

∂2

∂xT∂xφ(x)

](28a)

= A−TQ−1A−1 + CTR−1C,

Eq[∂

∂xTφ(x)

](28b)

= −A−TQ−1(Bu−A−1µ

)−CTR−1 (y −Cµ) .

At convergence, (28b) must be zero, so we have

Σ−1 = A−TQ−1A−1 + CTR−1C︸︷︷︸block-tridiagonal

, (29a)

Σ−1µ = A−TQ−1Bu + CTR−1y, (29b)

which can be solved efficiently for µ due to the block-tridiagonal nature of Σ−1; from here, Barfoot (2017, §3,p.55) shows the algebraic equivalence of this form to thecanonical RTS smoother. Thus, our GVI approach stillreproduces the classic linear result. However, we can alsonow address nonlinear problems more completely than theMAP case.

4 Exact SparsityThis section shows how to exploit application-specificstructure to make the optimization scheme of the previoussection efficient for large-scale problems. We first showthat when the joint likelihood of the state and data canbe factored, the calculation of the required expectationsin our optimization scheme is exactly sparse, meaningwe only need the marginals of the covariance associatedwith each factor. We then discuss how we can calculatethese marginals efficiently from the inverse covariance,for any GVI problem. Finally, we show how to usesigmapoints drawn from these marginals to implement thefull optimization scheme.

4.1 Factored Joint LikelihoodWe have seen in the previous section that the iterativeupdate scheme relies on calculating three expectations:

Eq[φ(x)]︸︷︷︸scalar

, Eq[∂

∂xTφ(x)

]︸︷︷︸

column

, Eq[

∂2

∂xT∂xφ(x)

]︸︷︷︸

matrix

, (30)

where we drop the iteration index for now. Let us nowassume that the joint state/data likelihood can be factoredsuch that we can write its negative log-likelihood as

φ(x) =

K∑k=1

φk(xk), (31)

where φk(xk) = − ln p(xk, zk) is the kth (negative log)factor expression, xk is a subset of variables in x associatedwith the kth factor, and zk is a subset of the data in zassociated with the kth factor.

Let us consider the first (scalar) expectation in (30). Wecan insert the factored likelihood and see what happens:

Eq[φ(x)] = Eq

[K∑k=1

φk(xk)

]

=

K∑k=1

Eq[φk(xk)] =

K∑k=1

Eqk [φk(xk)], (32)

where the last step is subtle but paramount: the expectationsimplifies from being over q = q(x), the full Gaussianestimate, to being over qk = qk(xk), the marginal of theestimate for just the variables in each factor. This is not anapproximation and the implications are many.

The other two expectations (column and matrix) in (30)enjoy similar simplifications and more, but require a bitmore explanation. Let Pk be a projection matrix such thatit extracts xk from x:

xk = Pkx. (33)

Then inserting the factored expression into the second(column) expectation we have

Eq[∂

∂xTφ(x)

]= Eq

[∂

∂xT

K∑k=1

φk(xk)

]

=

K∑k=1

Eq[∂

∂xTφk(xk)

]

=

K∑k=1

PTk Eq

[∂

∂xTkφk(xk)

]

=

K∑k=1

PTk Eqk

[∂

∂xTkφk(xk)

]. (34)



For factor k, we are able to simplify the derivative frombeing with respect to x, to being with respect to xk, sincethere is no dependence on the variables not in xk and hencethe derivative with respect to those variables is zero; weuse the projection matrix (as a dilation matrix) to mapthe derivative back into the appropriate rows of the overallresult. After this, the expectation again simplifies to beingwith respect to qk = qk(xk), the marginal of the estimatefor just the variables in factor k. For the last (matrix)expectation we have a similar result:

Eq[

∂2

∂xT∂xφ(x)

]= Eq

[∂2

∂xT∂x

K∑k=1

φk(xk)

]

=

K∑k=1

Eq[

∂2

∂xT∂xφk(xk)

]

=

K∑k=1

PTk Eq

[∂2

∂xTk ∂xkφk(xk)

]Pk

=

K∑k=1

PTk Eqk

[∂2

∂xTk ∂xkφk(xk)

]Pk. (35)

The simplified expectations in (32), (34), and (35) are thekey tools that enable our ESGVI approach and we nowmake several remarks about them:

1. We do not require the full Gaussian estimate, q(x),to evaluate the three expectations involved in ouriterative scheme but rather we only require themarginals associated with each factor, qk(xk). Thiscan represent a huge computational and storagesavings in practical problems because it means thatwe never need to fully construct and store the(usually dense) covariance matrix, Σ. Schon et al.(2011); Gasperin and Juricic (2011); Kokkala et al.(2016) also show how the required expectations aresimplified to being over the marginals specifically forthe smoother problem, but here we have generalizedthat result to any factorization of the joint likelihood.

2. Looking to the covariance update in (25a) and nowthe simplification in (35), we know that Σ−1 willbe exactly sparse (with the pattern depending on thenature of the factors) and that the sparsity pattern willremain constant as we iterate. A fixed sparsity patternensures that we can build a custom sparse solver forthe mean (25b) and use it safely at each iteration; forexample, in the batch state estimation problem, Σ−1

is block-tridiagonal (under a chronological variableordering).

3. As a reminder, marginalization of a Gaussianamounts to projection such that

qk(xk) = N (µk,Σkk) = N(Pkµ,PkΣPT

k

),

(36)so that it is just specific sub-blocks of the fullcovariance matrix that are ever required.

4. The only sub-blocks of Σ that we require areprecisely the ones corresponding to the non-zero sub-blocks of Σ−1 (which is typically highly sparse). Wecan see this more plainly by writing

Σ−1 =

K∑k=1

PTk Eqk

[∂2

∂xTk ∂xkφk(xk)

]Pk, (37)

where we can see that each factor uses some sub-blocks, Σkk = PkΣPT

k , to evaluate the expectation,and then the results are inserted back into the sameelements of Σ−1.

5. It turns out that we can extract the required sub-blocks of Σ very efficiently. For example, for batchstate estimation, with a block-tridiagonal Σ−1, wecan piggyback the calculation of the required blocks(i.e., the three main block diagonals of Σ) ontothe solution for the mean in (25b) (Meurant 1992;Barfoot 2017) while keeping the complexity of thesolver the same. However, we can also computethe required blocks of Σ efficiently in the generalcase (Takahashi et al. 1973), and the next section isdevoted to discussion of this topic.

Some of these remarks may seem familiar to thoseused to working with a MAP approach to batch stateestimation (e.g., the sparsity pattern of Σ−1 exists andis constant across iterations). But now we are performingGVI that iterates over a full Gaussian PDF (i.e., mean andcovariance) not just a point estimate (i.e., mean only).

At this point, the only approximation that we have madeis that our estimate of the posterior is Gaussian. However,to implement the scheme in practice, we need to choosea method to actually compute the (marginal) expectationsin (32), (34), and (35). There are many choices includinglinearization, Monte Carlo sampling, and also deterministicsampling. We will show how to use sampling methods in alater section.

4.2 Partial Computation of the CovarianceFor completeness, we briefly summarize how it is possibleto compute the blocks of Σ (typically dense) correspondingto the non-zero sub-blocks of Σ−1 (typically very sparse)in an efficient manner. This idea was first proposed byTakahashi et al. (1973) in the context of circuit theory andwas later used by Broussolle (1978) in a state estimationcontext where the matrix of interest was a covariance matrix


Barfoot et al. 9

. . .. . . ΣK−2,K−2

. . . ΣK−1,K−2 ΣK−1,K−1

. . . ΣK,K−2 ΣK,K−1 ΣK,K

=

. . .. . . D−1

K−2,K−2

. . . 0 D−1K−1,K−1

. . . 0 0 D−1K,K

−

. . .

......

.... . . ΣK−2,K−2 ΣK−2,K−1 ΣK−2,K

. . . ΣK−1,K−2 ΣK−1,K−1 ΣK−1,K

. . . ΣK,K−2 ΣK,K−1 ΣK,K

. . .. . . 0. . . LK−1,K−2 0. . . LK,K−2 LK,K−1 0

(45)

like ours. Erisman and Tinney (1975) provide a proof ofthe closure of the Takahashi et al. procedure and alsodiscuss algorithmic complexity. More recently, Triggs et al.(2000, App. B.4) and Kaess and Dellaert (2009) discussmethods to calculate specific blocks of the covariancematrix efficiently from the inverse covariance for computervision and robotics applications, but do not discuss doingso for the complete set of covariance blocks correspondingto the non-zero blocks of the inverse covariance matrix.

At each iteration of our GVI approach, we are requiredto solve a system of linear equations for the change in themean:

Σ−1 δµ = r, (38)

where r is the right-hand side in (25b). We start by carryingout a sparse lower-diagonal-upper decomposition,

Σ−1 = LDLT , (39)

where D is diagonal and L is lower-triangular with oneson the main diagonal (and sparse). The cost of thisdecomposition will depend on the nature of the prior andmeasurement factors. The key thing is that the sparsitypattern of L is a direct function of the factors’ variabledependencies and can be determined in advance; more onthis below. We can then solve the following two systems ofequations for the change in the mean:

(LD) v = r, (sparse forward substitution) (40)LT δµ = v. (sparse backward substitution) (41)

To solve for the required blocks of Σ, we notice that

LDLTΣ = 1, (42)

where 1 is the identity matrix. We can premultiply by theinverse of LD to arrive at

LTΣ = D−1L−1, (43)

where L−1 will in general no longer be sparse. Takingthe transpose and adding Σ−ΣL to both sides we have(Takahashi et al. 1973)

Σ = L−TD−1 + Σ (1− L) . (44)

Since Σ is symmetric, we only require (at most) calculationof the main diagonal and the lower-half blocks and, asit turns out, this can also be done through a backwardsubstitution pass. To see this we expand the lower-halfblocks as in (45) at the top of the page, where we onlyshow the blocks necessary for the calculation of the lower-half of Σ; critically, L−T is unnecessary since it onlyaffects the upper-half blocks of Σ and is therefore dropped.Temporarily ignoring the need to exploit sparsity, we seethat we can calculate the lower-half blocks of Σ throughbackward substitution:

ΣK,K = D−1K,K , (46a)

ΣK,K−1 = −ΣK,KLK,K−1, (46b)ΣK−1,K−1 = D−1

K−1,K−1 −ΣK−1,KLK,K−1, (46c)

...

Σj,k = δ(j, k) D−1j,k −

K∑`=k+1

Σj,`L`,k, (j ≥ k)

(46d)

where δ(·, ·) is the Kronecker delta function.In general, blocks that are zero in L will also be zero

in Σ−1, but not the other way around. Therefore, it issufficient (but not necessary) to calculate the blocks of Σthat are non-zero in L and it turns out this can alwaysbe done. Table 1 shows some example sparsity patternsfor Σ−1 and the corresponding sparsity pattern of L. Thesparsity of the lower-half of L is the same as the sparsity ofthe lower-half of Σ−1 except that L can have a few morenon-zero entries to ensure that when multiplied together thesparsity of Σ−1 is produced. Specifically, if Lk,i 6= 0 andLj,i 6= 0 then we must have Lj,k 6= 0 (Erisman and Tinney1975); this can be visualized as completing the ‘four cornersof a box’, as shown in the example in the first column ofTable 1.

Table 1 also shows some typical robotics examples.In batch trajectory estimation, Σ−1 is block-tridiagonaland in this case the L matrix requires no extra non-zero entries. In SLAM, Σ−1 is an ‘arrowhead’ matrixwith the upper-left partition (corresponding to the robot’strajectory) as block-tridiagonal and the lower-right partition



Table 1. Example sparsity patterns of Σ−1 and the corresponding sparsity patterns of factor L, where ∗ indicates non-zero. Theset of zero entries of the lower-half of L is a subset of the zero entries of the lower-half of Σ−1. There are some extra non-zeroentries of L, shown as +, that arise from completing the ‘four corners of a box’. The box rule is shown in light grey for theleftmost example; the bottom-right corner of the box is zero in Σ−1 but non-zero in L.

basic sparsity constraint trajectory example SLAM example(note fill in at (5, 3) in L) (6 robot poses) (3 poses, 3 landmarks)

Σ−1 =

∗ ∗ ∗∗

∗ ∗∗

∗ ∗∗

Σ−1 =

∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗

Σ−1 =

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

L =

∗∗

∗ ∗∗

∗ + ∗∗

L =

∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗

L =

∗∗ ∗∗ ∗

∗ ∗ ∗ ∗∗ ∗ ∗ + ∗∗ ∗ ∗ + + ∗

(corresponding to landmarks) as block-diagonal. Using anLDLT decomposition, we can exploit the sparsity of theupper-left partition, as shown in the example. If we wantedto exploit the sparsity of the lower-right, we could reversethe order of the variables or do a LTDL decompositioninstead. In this SLAM example, each of the three landmarksis observed from each of the three poses so the upper-rightand lower-left partitions are dense and this causes someextra entries of L to be non-zero.

Finally, to understand why we do not need to calculate allof the blocks of Σ, we follow the explanation of Erismanand Tinney (1975). We aim to compute all the blocks ofthe lower-half of Σ corresponding to the non-zero blocksof L. Looking to equation (46d), we see that if Lp,k is non-zero, then we require Σj,p for the calculation of non-zeroblock Σj,k. But if Σj,k is non-zero, so must be Lj,k andthen using our ‘four corners of a box’ rule, this implies Lj,pmust be non-zero and so we will have Σj,p and Σp,j = ΣT

j,p

on our list of blocks to compute already. This shows thecalculation of the desired blocks is closed under the schemedefined by (46d), which in turn implies there will alwaysexist an efficient algorithm to calculate the blocks of Σcorresponding to the non-zero blocks of Σ−1, plus a fewmore according to the ‘four corners of a box’ rule.

It is worth noting that variable reordering and otherschemes such as Givens rotations (Golub and Van Loan1996) can be combined with the Takahashi et al. approachto maximize the benefit of sparsity in Σ−1 (Kaess et al.2008). In this section, we have simply shown that in general,the calculation of the required blocks of Σ (correspondingto the non-zero block of Σ−1) can be piggybackedefficiently onto the solution of (25b), with the details

depending on the specific problem. In fact, the bottleneckin terms of computational complexity is the original lower-diagonal-upper decomposition, which is typically requiredeven for MAP approaches. We therefore claim that ourESGVI approach has the same order of computational cost(as a function of the state size, N ) as MAP for a givenproblem, but will have a higher coefficient due to the extraburden of using the marginals to compute expectations.

4.3 Marginal SamplingWe have seen in the previous section that we actuallyonly need to calculate the marginal expectations (for eachfactor),

Eqk [φk(xk)]︸︷︷︸scalar

, Eqk[∂

∂xTkφk(xk)

]︸︷︷︸

column

,

Eqk[

∂2

∂xTk ∂xkφk(xk)

]︸︷︷︸

matrix

, (47)

which can then be reassembled back into the largerexpectations of (30).

As a quick aside, an additional use for the ‘scalar’expression above is to evaluate the loss functional,

V (q) =

K∑k=1

Eqk [φk(xk)]︸︷︷︸Vk(qk)

+1

2ln(|Σ−1|

)︸︷︷︸

V0

, (48)

which is used to test for convergence and to performbacktracking during optimization. The V0 term can be


Barfoot et al. 11

evaluated by noting that,

ln(|Σ−1|

)= ln

(|LDLT |

)= ln (|D|) =

N∑i=1

ln(dii),

(49)where dii are the diagonal elements of D; this exploitsthe lower-diagonal-upper decomposition from (39) that wealready compute during the optimization.

The computation of each expectation in (47) looks, on thesurface, rather intimidating. The first and second derivativessuggests each factor must be twice differentiable, andsomehow the expectation over qk(xk) must be computed.So far we have made no assumptions on the specificform of the factors φk, and we would like to keepit that way, avoiding the imposition of differentiabilityrequirements. Additionally, recalling how sampling-basedfilters, such as the unscented Kalman filter (Julier andUhlmann 1996), the cubature Kalman filter (Arasaratnamand Haykin 2009), and the Gauss-Hermite Kalman filter(Ito and Xiong 2000)(Wu et al. 2006), approximate termsinvolving expectations, a cubature approximation of theassociated expectations in (47) appears appropriate. Thissection considers the use of Stein’s lemma and cubaturemethods to derive an alternative means to compute theterms in (47) that is derivative-free.

To avoid the need to compute derivatives of φk, wecan once again apply Stein’s lemma, but in the oppositedirection from our previous use. Using (22) we have

Eqk[∂

∂xTkφk(xk)

]= Σ−1

kk Eqk [(xk − µk)φk(xk)], (50)

and using (23) we have

Eqk[

∂2

∂xTk ∂xkφk(xk)

]= Σ−1

kk Eqk [(xk − µk)(xk − µk)Tφk(xk)]Σ−1kk

− Σ−1kk Eqk [φk(xk)]. (51)

Thus, an alternative means to computing the threeexpectations in (47), without explicit computation ofderivatives, involves first computing

Eqk [φk(xk)]︸︷︷︸scalar

, Eqk [(xk − µk)φk(xk)]︸︷︷︸column

,

Eqk[(xk − µk)(xk − µk)Tφk(xk)

]︸︷︷︸matrix

, (52)

then computing (50) and (51) using the results of (52).The reverse application of Stein’s lemma has not destroyedthe sparsity that we unveiled earlier because we have nowapplied it at the marginal level, not the global level.

Of interest next is how to actually compute the threeexpectations given in (52) in an efficient yet accurate way.As integrals, the expectations in (52) are

Eqk [φk(xk)] (53a)

=

∫ ∞

−∞φk(xk)qk(xk)dxk,

Eqk [(xk − µk)φk(xk)] (53b)

=

∫ ∞

−∞(xk − µk)φk(xk)qk(xk)dxk,

Eqk[(xk − µk)(xk − µk)Tφk(xk)

](53c)

=

∫ ∞

−∞(xk − µk)(xk − µk)Tφk(xk)qk(xk)dxk,

where qk(xk) = N (µk,Σkk). Computing these integralsanalytically is generally not possible, and as such, anumerical approximation is sought. There are many ways ofapproximating the integrals in (53), the most popular typebeing multi-dimensional Gaussian quadrature, commonlyreferred to as Gaussian cubature or simply cubature(Cools 1997)(Sarmavuori and Sarkka 2012)(Kokkala et al.2016)(Sarkka et al. 2016)(Sarkka 2013, §6, p. 100). Usingcubature, each of the integrals in (53) is approximated as(Kokkala et al. 2016)(Sarkka et al. 2016)(Sarkka 2013, §6,p. 99-106)

Eqk [φk(xk)] (54a)

≈L∑`=1

wk,` φk(xk,`),

Eqk [(xk − µk)φk(xk)] (54b)

≈L∑`=1

wk,` (xk,` − µk)φk(xk,`),

Eqk [(xk − µk)(xk − µk)Tφk(xk)] (54c)

≈L∑`=1

wk,` (xk,` − µk)(xk,` − µk)Tφk(xk,`),

where wk,` are weights, xk,` = µk +√

Σkkξk,` aresigmapoints, and ξk,` are unit sigmapoints. Both theweights and unit sigmapoints are specific to the cubaturemethod. For example, the popular unscented transformation(Julier and Uhlmann 1996)(Sarkka et al. 2016)(Sarkka2013, §6, p. 109-110) uses weights

wk,0 =κ

Nk + κ, wk,` =

1

2(Nk + κ), ` = 1, . . . , 2Nk

(55)and sigmapoints

ξk,` =

0 ` = 0√Nk + κ1` ` = 1, . . . , Nk−√Nk + κ1`−Nk

` = Nk + 1, . . . , 2Nk

,

(56)



where Nk is the dimension of xk. On the other hand,the spherical-cubature rule (Arasaratnam and Haykin2009)(Kokkala et al. 2016)(Sarkka 2013, §6, p. 106-109)uses weights

wk,` =1

2Nk, ` = 1, . . . , 2Nk, (57)

and sigmapoints

ξk,` =

{ √Nk1` ` = 1, . . . , Nk−√Nk1`−Nk

` = Nk + 1, . . . , 2Nk, (58)

where 1i is a Nk × 1 column matrix with 1 at row iand zeros everywhere else. Gauss-Hermite cubature isyet another method that can be used to compute theapproximations in (54) (Ito and Xiong 2000)(Wu et al.2006)(Sarkka 2013, §6 p. 99-106). As discussed in Sarkka(2013, §6 p. 103), given an integrand composed of a linearcombination of monomials of the form xd11 , x

d22 , . . . , x

dNk

Nk,

the M th order Gauss-Hermite cubature rule is exact whendi ≤ 2M − 1. However, for an M th-order Gauss-Hermitecubature approximation, MNk sigmapoints are needed,which could be infeasible in practise when Nk is large(Sarkka 2013, §6 p. 103). Fortunately, the approximationsof (53) given in (54) are at the factor level (i.e., at thelevel of xk, not x), and at the factor level Nk is often amanageable size in most robotics problems. For this reason,Gauss-Hermite cubature is used in our numerical workpresented in Sections 6.1, 6.2, and 6.3, yielding accurateyet reasonably efficient approximations of (53).

Some additional remarks are as follows:

1. The accuracy of the cubature approximations in(54) will depend on the specific method and theseverity of the nonlinearity in φk. Alternative meansto approximate (54), such as cubature methods thatare exact for specific algebraic and trigonometricpolynomials (Cools 1997)(Kokkala et al. 2016),Gaussian-process cubature (O’Hagan 1991)(Sarkkaet al. 2016), or even adaptive cubature methods (Presset al. 2007, §4, p. 194), can be employed. In thecase where computational complexity is of concern,a high-degree cubature rule that is an efficientalternative to Gauss-Hermite cubature is presented inJia et al. (2013).

2. We are proposing quite a different way of using acubature method (or any sampling method) than istypical in the state estimation literature; we considerthe entire factor expression, φk, to be the nonlinearity,not just the observation or motion models, as iscommon. This means, for example, that if there isa robust cost function incorporated in our factorexpression (Barfoot 2017, §5, p. 163)(MacTavish andBarfoot 2015), it is handled automatically and does

not need to be implemented as iteratively reweightedleast squares (Holland and Welsch 1977).

3. Because we have ‘undone’ Stein’s lemma at this point(it was a temporary step to exploit the sparsity only),it may not even be necessary to have φk differentiableanymore. Appendix C shows how to get to thederivative-free version of ESGVI directly without thedouble application of Stein’s lemma. This opens thedoor to some interesting possibilities including theuse of the H∞ (worst case) norm, hard constraintson some or all of the states, or the aforementioneduse of a robust cost function, within the factor φk. Anappropriate sampling method would be required.

4. We see in (54) that the scalars, φk, serve to reweighteach sample, but that otherwise the expressionsare simply those for the first three moments of adistribution.

5. We also use cubature to evaluate V (q) accordingto (48). This is required to test for convergence ofthe optimization scheme.

The approach that we have presented up to this pointis extremely general and can benefit any GVI problemwhere p(x, z) can be factored. In computer vision androbotics, some examples include BA (Brown 1958) andSLAM (Durrant-Whyte and Bailey 2006). In Section 6,we will demonstrate the technique first on controlled toyproblems, then on a batch SLAM problem.

5 ExtensionsBefore moving on to our experiments, we pause toelaborate two extensions of the main paper. First, wediscuss an alternate loss functional that leads to a modifiedoptimization scheme similar to a Gauss-Newton solver andoffers some computational savings. Second, we discuss howto extend our approach beyond estimation of the latent stateto include estimation of unknown parameters in our models.

5.1 Alternate Loss FunctionalWe can consider an alternate variational problem thatmay offer computational advantages over the main ESGVIapproach of this paper. We consider the special case wherethe negative-log-likelihood takes the form

φ(x) =1

2e(x)TW−1e(x). (59)

Substituting this into the loss functional, we have

V (q) =1

2Eq[e(x)TW−1e(x)

]+

1

2ln(|Σ−1|). (60)

Owing to the convexity of the quadratic expression,eTW−1e, we can apply Jensen’s inequality (Jensen 1906)


Barfoot et al. 13

directly to write

Eq[e(x)]TW−1Eq[e(x)] ≤ Eq[e(x)TW−1e(x)

]. (61)

The Jensen gap is the (positive) difference between the rightand left sides of this inequality and will generally tend to belarger the more nonlinear is e(x) and less concentrated isq(x). Motivated by this relationship, we can define a newloss functional as

V ′(q) =1

2Eq[e(x)]TW−1Eq[e(x)] +

1

2ln(|Σ−1|), (62)

which may be thought of as a (conservative) approximationof V (q) that is appropriate for mild nonlinearities and/orconcentrated posteriors; the conservative aspect will bediscussed a bit later on. We will now show that wecan minimize V ′(q) by iteratively updating q(x) andcontinue to exploit problem sparsity arising from a factoredlikelihood.

We begin by noting that we can directly approximate theexpected error as

Eq(i+1) [e(x)]

≈ Eq(i) [e(x)] +∂

∂µEq(i) [e(x)]

(µ(i+1) − µ(i)

)︸︷︷︸

δµ

= Eq(i) [e(x)]︸︷︷︸e(i)

+Eq(i)[∂

∂xe(x)

]︸︷︷︸

E(i)

δµ

= e(i) + E(i) δµ, (63)

where we have employed the derivative identity in (24a).We can then approximate the loss functional as

V ′(q) ≈ 1

2

(e(i) + E(i) δµ

)TW−1

(e(i) + E(i) δµ

)+

1

2ln(|Σ−1|), (64)

which is now exactly quadratic in δµ. This specificapproximation leads directly to a Gauss-Newton estimator,bypassing Newton’s method, as we have implicitlyapproximated the Hessian (Barfoot 2017, p.131). Takingthe first and second derivatives with respect to δµ, we have

∂V ′(q)∂ δµT

= E(i)T W−1(e(i) + E(i) δµ

), (65a)

∂2V ′(q)∂ δµT∂ δµ

= E(i)T W−1E(i). (65b)

For the derivative with respect to Σ−1, we have

∂V ′(q)

∂Σ−1 ≈ −1

2Σ E(i)T W−1E(i) Σ +

1

2Σ, (66)

where the approximation enforces the relationship in (24b),which does not hold exactly anymore due to the alterednature of V ′(q). Setting this to zero for a critical point wehave

(Σ−1)(i+1) = E(i)T W−1E(i), (67)

where we have created an iterative update analogous to thatin the main ESGVI approach.

For the mean, we set (65a) to zero and then for theoptimal update we have

E(i)T W−1E(i)︸︷︷︸(Σ−1)(i+1)

δµ = −E(i)T W−1e(i). (68)

Solving for δµ provides a Gauss-Newton update, which wewill refer to as ESGVI Gauss-Newton (ESGVI-GN). This isidentical to how Gauss-Newton is normally carried out, butnow we calculate e and E not just at a single point but ratheras an expectation over our Gaussian posterior estimate. Weagain make a number of remarks about the approach:

1. The sparsity of the inverse covariance matrix, Σ−1,will be identical to the full ESGVI approach. Thiscan be seen by noting that

φ(x) =

K∑k=1

φk(xk) =1

2

K∑k=1

ek(xk)TW−1k ek(xk)

=1

2e(x)TW−1e(x), (69)

where

e(x) =

e1(x1)...

eK(xK)

, W = diag(W1, . . . ,WK).

(70)Then we have

Σ−1 = Eq[∂

∂xe(x)

]TW−1Eq

[∂

∂xe(x)

]=

K∑k=1

PTk Eqk

[∂

∂xkek(xk)

]TW−1

k

× Eqk[∂

∂xkek(xk)

]Pk, (71)

which will have zeros wherever an error term doesnot depend on the variables. We also see, just asbefore, that the expectations can be reduced to beingover the marginal, qk(xk), meaning we still onlyrequire the blocks of Σ corresponding to the non-zeroblocks of Σ−1.



2. We can still use Stein’s lemma to avoid the need tocompute any derivatives:

Eqk[∂

∂xkek(xk)

]= Eqk

[ek(xk)(xk − µk)T

]Σ−1kk . (72)

This is sometimes referred to as a statistical Jacobianand this usage is very similar to the filtering andsmoothing approaches described by Sarkka (2013),amongst others, as cubature can be applied at themeasurement model level rather than the factor level.Because we are iteratively recomputing the statisticalJacobian about our posterior estimate, this is mostsimilar to Sibley et al. (2006) and Garcıa-Fernandezet al. (2015), although some details are differentas well as the fact that we started from our lossfunctional, V ′(q).

3. The number of cubature points required to calculateEqk

[ek(xk)(xk − µk)T

]will be lower than our

full ESGVI approach described earlier as the orderof the expression in the integrand is half thatof Eqk

[(xk − µk)(xk − µk)Tφk(xk)

]. Since the

number of cubature points goes up as MNk , cuttingM in half is significant and could be the differencebetween tractable and not for some problems. Thiswas the main motivation for exploring this alternateapproach.

4. It is known that minimizing KL(q||p), which ourV (q) is effectively doing, can result in a Gaussianthat is too confident (i.e., inverse covariance is toolarge) (Bishop 2006; Ala-Luhtala et al. 2015). Aside benefit of switching from V (q) to V ′(q) isthat the resulting inverse covariance will be moreconservative. This follows from Jensen’s inequalityonce again. For an arbitrary non-zero vector, a, wehave

0 < aT Eq[∂e(x)

∂x

]TW−1Eq

[∂e(x)

∂x

]︸︷︷︸

Σ−1 from V ′(q)

a

Jensen≤ aTEq

[∂e(x)

∂x

T

W−1 ∂e(x)

∂x

]a

Gauss-Newton≈ aT Eq

[∂2φ(x)

∂xT∂x

]︸︷︷︸

Σ−1 from V (q)

a, (73)

which ensures that not only do we have a positivedefinite inverse covariance but that it is conservativecompared to the full ESGVI approach.

Due to the extra approximations made in ESGVI-GNcompared to ESGVI, it remains to be seen whether it

offers an improvement over MAP approaches. However,as ESGVI-GN provides a batch option that does notrequire any derivatives, it can be used as a less expensivepreprocessor for the derivative-free version of full ESGVI.

5.2 Parameter EstimationAlthough it is not the main focus of our paper, we use thissection to provide a sketch of how parameters may also beestimated using our ESGVI framework. We introduce someunknown parameters, θ, to our loss functional,

V (q|θ) = Eq[φ(x|θ)] +1

2ln(|Σ−1|), (74)

and recall that V (q|θ) is the negative of the so-calledEvidence Lower Bound (ELBO), which can be used inan Expectation Maximization (EM) framework to estimateparameters when there is a latent state (Neal and Hinton1998; Ghahramani and Roweis 1999). The expectation, orE-step, is already accomplished by ESGVI; we simply holdθ fixed and run the inference to convergence to solve forq(x), our Gaussian approximation to the posterior. In theM-step, which is actually a minimization in our case, wehold q(x) fixed and find the value of θ that minimizesthe loss functional. By alternating between the E- and M-steps, we can solve for the best value of the parametersto minimize − ln p(z|θ), the negative log-likelihood of themeasurements given the parameters.

As we have done in the main part of the paper, we assumethe joint likelihood of the state and measurements (given theparameters) factors so that

φ(x|θ) =

K∑k=1

φk(xk|θ), (75)

where for generality we have each factor being affectedby the entire parameter set, θ, but in practice it could bea subset. Taking the derivative of the loss functional withrespect to θ, we have

∂V (q|θ)

∂θ=

∂

∂θEq[φ(x|θ)]

=∂

∂θEq

[K∑k=1

φk(xk|θ)

]

=

K∑k=1

Eqk[∂

∂θφk(xk|θ)

], (76)

where in the last expression the expectation simplifiesto being over the marginal, qk(xk), rather than the fullGaussian, q(x). As with the main ESGVI approach, thismeans that we only need the blocks of the covariance, Σ,corresponding to the non-zero blocks of Σ−1, which weare already calculating as part of the E-step. Furthermore,


Barfoot et al. 15

Table 2. Descriptions of variants of our ESGVI algorithm tested in our experiments. The first two algorithms in the table are MAPcomparisons, which can be recovered in our framework by evaluating all the expectations using a single quadrature point at themean of the current estimate. The only extra feature added to the all the methods beyond the ‘plain vanilla’ implementationdescribed in the theory section is that all algorithms are allowed to backtrack when updating the mean and inverse covariance inorder to ensure that the loss functional actually decreases at each iteration.

algorithm method to evaluate M , number of quadrature pointslabel expectations in (47) (per dimension)MAP Newton analytical Jacobian and Hessian 1MAP GN analytical Jacobian and approximate Hessian 1ESGVI deriv M=2 analytical Jacobian and Hessian + quadrature 2ESGVI deriv M=3 analytical Jacobian and Hessian + quadrature 3ESGVI deriv-free M=3 Stein’s lemma + quadrature 3ESGVI-GN deriv-free M=3 Stein’s lemma + quadrature 3ESGVI deriv-free M=4 Stein’s lemma + quadrature 4ESGVI deriv-free M=10 Stein’s lemma + quadrature 10

we can easily evaluate the marginal expectations usingcubature.

To make this more tangible, consider the example of

φ(x|W) =1

2

K∑k=1

(ek(xk)TW−1ek(xk)− ln(|W−1|)

),

(77)where the unknown parameter is W, the measurementcovariance matrix. Then taking the derivative with respectto W−1 we have

∂V (q|W)

∂W−1=

1

2

K∑k=1

Eqk[ek(xk)ek(xk)T

]− K

2W.

(78)Setting this to zero for a minimum we have

W =1

K

K∑k=1

Eqk[ek(xk)ek(xk)T

], (79)

where we can use cubature to evaluate the marginalexpectations. Reiterating, we never require the fullcovariance matrix, Σ, implying that our exactly sparseframework extends to parameter estimation.

6 EvaluationWe compared several variants of our algorithm on threedifferent test problems: (i) a scalar toy problem motivatedby a stereo camera to show that ESGVI achieves abetter estimate than MAP, (ii) a multi-dimensional SLAMproblem that shows we can carry out ESGVI in a tractableway, and (iii) a SLAM problem using a real robotics datasetthat shows the new methods work well in practice.

Table 2 lists the variants of ESGVI that we studied as wellas two MAP variants for comparison. The methods differ inthe way the expectations of (47) are evaluated. ‘Analytical’means that we calculate derivatives in closed form whereas

‘derivative-free’ means we used Stein’s lemma to avoidderivatives. All the methods attempt to minimize V (q)(i.e., they are running our Newton-style optimizer) unlessthey have the ‘GN’ (Gauss-Newton) designation, whichindicates that they are using the alternate loss functional,V ′(q), from Section 5.1. All the methods use cubature tocalculate the expectations, but the number of points perdimension is varied. The MAP methods are simply specialcases of the ESGVI approach in which we use analyticalderivatives and a single quadrature point located at the meanof the current estimate.

All the methods use the ‘plain vanilla’ optimizationscheme exactly as described in the paper with the exceptionof one extra feature. When updating the mean and inversecovariance, we allow backtracking if the loss functional isnot reduced. In other words, we attempt to update accordingto

µ(i+1) ← µ(i) + αB δµ, (80a)

Σ−1(i+1)

← Σ−1(i)

+ αB δΣ−1, (80b)

where α = 0.95 and with B = 0, 1, 2, . . . increasing untilthe loss functional goes down from the previous iteration.All the methods tested were allowed to do this in the sameway and used (48) to calculate the loss functional.

6.1 Experiment 1: Stereo One-DimensionalSimulation

Our first simulation is a simple one-dimensional, nonlinearestimation problem motivated by the type of inverse-distance nonlinearity found in a stereo camera model.As this problem is only one-dimensional, we cannotdemonstrate the ability to exploit sparsity in the problem,leaving this to the next two subsections. Here our aim is toshow that indeed our proposed iterative scheme convergesto the minimum of our loss functional and also that we offeran improvement over the usual MAP approach.



This same experiment (with the same parameter settings)was used as a running example by Barfoot (2017, §4). Weassume that our true state is drawn from a Gaussian prior:

x ∼ N (µp, σ2p). (81)

We then generate a measurement according to

y =fb

x+ n, n ∼ N (0, σ2

r), (82)

where n is measurement noise. The numerical values of theparameters used in our trials were

µp = 20 [m], σ2p = 9 [m2], (83)

f = 400 [pixel], b = 0.1 [m], σ2r = 0.09 [pixel2].

The two factors are defined as

φ =1

2

(x− µp)2

σ2p

, ψ =1

2

(y − fb

x

)2

σ2r

, (84)

so that − ln p(x, z) = φ+ ψ + constant. Our loss func-tional is therefore

V (q) = Eq[φ] + Eq[ψ] +1

2ln(σ−2), (85)

where q = N (µ, σ2) is our estimate of the posterior. Weseek to find the q to minimize V (q). This problem canalso be viewed as the correction step of the Bayes filter(Jazwinski 1970): start from a prior and correct it based onthe latest (nonlinear) measurement.

To conduct a proper Bayesian experiment, we ran100, 000 trials where each one consisted of drawingthe latent state from the prior, then producing a noisymeasurement given that state. To stay clear of edge cases(e.g., negative distance), we only accepted a draw ofthe latent state if it was within 4 standard deviations ofthe mean, resulting in 6 out of the 100, 000 experimentsto not be accepted and the state redrawn. We then ranseveral variants of our algorithm summarized in Table 2.Everything else to do with the experiment was the same forall algorithms, allowing a fair comparison. Figure 1 showsthe statistical results of our 100, 000 trials as boxplots.The columns correspond to the different versions of ouralgorithm while the rows are different performance metrics.The first column (analytical Hessian and Jacobian witha single quadrature point at the mean) is equivalent toa standard MAP approach. We can see that our newalgorithms do require a few more iterations (first row) toconverge than MAP, which is to say that it takes morecomputation to arrive at a better approximation to theposterior. We also see that the new algorithms do find alower final value of the loss functional, V (q), which is whatwe asked them to minimize (second row).

We also wanted to see if the new algorithms were lessbiased and more consistent than MAP, and so calculated thebias as the sum of errors (third row), squared error (fourthrow), and squared Mahalanobis / Normalized EstimationSquared Error (NEES) (fifth row). We cannot ask theestimator to minimize these quantities (because they arebased on knowledge of the groundtruth values of the latentstate) but our hypothesis has been that by minimizing V (q),we should also do better on these metrics. Looking at thethird row, all the GVI variants are less biased than MAP byan order of magnitude or more. Our MAP error of−30.6 cmis consistent with the result reported by Barfoot (2017, §4).The best algorithm reported there, the Iterated SigmapointKalman Filter (ISPKF) (Sibley et al. 2006), had a bias of−3.84 cm. Our best algorithm had a bias of 0.3 cm. Squarederror (fourth row) is also slightly improved compared toMAP.

The squared Mahalanobis / NEES error should be closeto 1 for a one-dimensional problem; here the results aremixed, with some of our approaches doing better than MAPand some not. It seems that our choice of KL(q||p) ratherthan KL(p||q) results in a slightly overconfident covariance.Bishop (2006, fig 10.1) shows a similar situation for thesame choice of KL(q||p) as does Ala-Luhtala et al. (2015).As discussed earlier, it may be possible to overcome thisby changing the relative weighting between the two mainterms in V (q) through the use of a metaparameter that isoptimized for a particular situation.

Figure 2 shows the details of a single trial of the 100, 000that we ran. We show only a subset of the algorithms(rows) in the interest of space. The left column providesa contour plot of V (q) (µ on the horizontal and σ−2 onthe vertical) and the path the optimizer actually took toarrive at its minimum (red dot) starting from the prior (greendot). The right column shows the value of V (q) at eachiteration. It is worth noting that we show the true valueof the loss (calculated using (48) with a large number ofquadrature points) as well as the approximation of the lossthat the algorithm had access to during its iterations (eachalgorithm used a different number of quadrature points,M ).We see that the MAP approach clearly does not terminateat the minimum of V (q); its approximation of the requiredexpectations is too severe to converge to the minimum. Theother algorithms end up very close to the true minimum, ina similar number of iterations.

In the next section, we introduce time and allow oursimulated robot to move along the x-axis, with the samenonlinear stereo camera model. Our aim is to show that wecan exploit the sparse structure of the problem in higherdimensions. We deliberately chose a SLAM problem toshowcase that our ESGVI approach can work even whenthe probabilistic graphical model has loops.


Barfoot et al. 17

16 The International Journal of Robotics Research XX(X)

4.4175 4.7903 6.3336 5.3724 7.0974 7.45892

4

6

8

10

itera

tions

MAP Newton ESGVI deriv M=2 ESGVI deriv M=3 ESGVI deriv-free M=3 ESGVI deriv-free M=4 ESGVI deriv-free M=10

0.3644 0.3292 0.3244 0.3356 0.3250 0.3237-0.5

0

0.5

1

1.5

2


-0.3064[m] -0.0342[m] 0.0030[m] 0.0399[m] 0.0110[m] 0.0080[m]

-5

0

5


4.6963[m2] 4.4619[m2] 4.4384[m2] 4.5052[m2] 4.4416[m2] 4.4338[m2]0

5

10

15


1.0705 1.0357 1.0713 1.0431 1.0741 1.07600

1

2

3

4


Figure 1. (Experiment 1) Statistical results of 100, 000 trials of the one-dimensional stereo camera simulation shown asstandard boxplots. The different rows show different performance metrics for the different variants of our algorithm (columns).Table 2 provides details of the different algorithms tested. The number below an algorithm label is its mean performance on thatmetric. Further details are discussed in the text.

the statistical results of our 100, 000 trials as boxplots.The columns correspond to the different versions of ouralgorithm while the rows are different performance metrics.The first column (analytical Hessian and Jacobian witha single quadrature point at the mean) is equivalent toa standard MAP approach. We can see that our newalgorithms do require a few more iterations (first row) toconverge than MAP, which is to say that it takes more

computation to arrive at a better approximation to theposterior. We also see that the new algorithms do find alower final value of the loss functional, V (q), which is whatwe asked them to minimize (second row).

We also wanted to see if the new algorithms were lessbiased and more consistent than MAP, and so calculatedthe average error (third row), average squared error (fourthrow), and squared Mahalanobis / Normalized Estimation


Figure 1. (Experiment 1) Statistical results of 100, 000 trials of the one-dimensional stereo camera simulation shown asstandard boxplots. The different rows show different performance metrics for the different variants of our algorithm (columns).Table 2 provides details of the different algorithms tested. The number above an algorithm label is its mean performance on thatmetric. Further details are discussed in the text.



MA

PN

ewto

n

1 2 3 4 5 6 7 8 9 10iterations

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

V(q)

true costapproximate cost

ESG

VId

eriv

M=2

1 2 3 4 5 6 7 8 9 10iterations

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

V(q)


ESG

VId

eriv

M=3

1 2 3 4 5 6 7 8 9 10iterations

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

V(q)


ESG

VId

eriv

-fre

eM

=4

0 2 4 6 8 10 12 14iterations

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

V(q)


Figure 2. (Experiment 1) One trial of the one-dimensional stereo camera simulation showing the convergence history for fourdifferent algorithms shown in each row. The left column shows a contour map of the loss functional, V (q), with the steps theoptimizer took starting from the prior (green dot) to its converged value (red dot). The right column shows the loss at eachiteration; as each algorithm makes different approximations to the loss during execution, we show the loss that each algorithmused to make decisions and the actual loss at each step.


Barfoot et al. 19

. . .�0 �1 �2 �K

'K'2'1

1,0

1,1

2,1

2,2

K,K K,K�1

m1 m2 mK

xK�1 xKx1x0 x2

Figure 3. (Experiment 2) Factor graph for the stereo K-dimensional simulation. White circles represent variables to be estimated(both robot positions and landmark positions). Small black dots represent factors in the joint likelihood of the state and data.

6.2 Experiment 2: Stereo K-DimensionalSimulation

This simulation was designed to show that we can scale upto a more realistic problem size, while still deriving benefitfrom our variational approach. We extend the stereo cameraproblem from the previous section to the time domain byallowing a robot to move along the x-axis. In order tocontinue to carry out a proper Bayesian comparison ofalgorithms, we introduce a prior both on the robot motionand on the landmark positions in this SLAM problem. Thefactor graph for the problem can be seen in Figure 3.

The state to be estimated is

x =

x0

x1

...xKm1

...mK

, xk =

[pkvk

], (86)

where pk is a robot position, vk a robot speed, and mk alandmark position. The problem is highly structured as eachlandmark is seen exactly twice, from two consecutive robotpositions.

For the (linear) prior factors we have

φk =

12 (x0 − x0)T P−1(x0 − x0) k = 012 (xk −Axk−1)TQ−1

× (xk −Axk−1) k > 0, (87a)

ϕk =1

2

(mk − µm,k)2

σ2m

, (87b)

with

P = diag(σ2p, σ

2v), A =

[1 T0 1

],

Q =

[13T

3QC12T

2QC12T

2QC TQC

], (88)

where T is the discrete-time sampling period, QC is apower spectral density, and σ2

p, σ2v , σ2

m are variances.

The robot state prior encourages constant velocity (Barfoot2017, §3, p.85). The landmark prior is simply a Gaussiancentered at the true landmark location, µm,k.

For the (nonlinear) measurement factors we have

ψ`,k =1

2

(y`,k − fb

m`−pk

)2

σ2r

, (89)

where f and b are the camera parameters (same as theprevious experiment), y`,k is the disparity measurementof the `th landmark from the kth position, and σ2

r is themeasurement noise variance.

The negative log-likelihood of the state and data is then

− ln p(x, z) =

K∑k=0

φk +

K∑k=1

ϕk

+

K∑k=1

(ψk,k−1 + ψk,k) + constant. (90)

We set the maximum number of timesteps to be K = 99for this problem, resulting in an overall state dimension of299. Figure 4 shows the sparsity patterns of Σ−1, L, and theblocks of Σ that get computed by the method of Takahashiet al. (1973). This can very likely be improved further usingmodern sparsity techniques but the point is that we havea proof-of-concept scheme that can compute the subset ofblocks of Σ required to carry out GVI.

We ran 10, 000 trials of this simulation. In each trial,we drew the latent robot trajectory and landmark statesfrom the Bayesian prior, then simulated the nonlinearmeasurements with a random draw of the noise. Weestimated the full state using four different algorithms fromTable 2: ‘MAP Newton’, ‘ESGVI deriv M=2’, ‘ESGVIderiv M=3’, and ‘ESGVI deriv-free M=4’.

Figure 5 shows the statistical results of the 10, 000 trials.All the algorithms converge well in a small number ofiterations (usually 4). Increasing the number of cubaturepoints for the derivative-based methods does result inreducing the overall value of the loss functional, V (q);the ‘ESGVI deriv-free M=4’ method does about as well as



Figure 4. (Experiment 2) Sparsity patterns for the stereo K-dimensional simulation. The red partition lines separate the robotstate variables from the landmark variables. The inverse covariance, Σ−1, is highly sparse owing to the factor graph pattern inFigure 3; only 1,687/89,401 = 1.9% of entries are nonzero. After performing a lower-diagional-upper decomposition, the lowerfactor, L, becomes more filled in owing to the ‘four corners of a box’ rule; 15,445/89,401 = 17.3% of entries are non-zero. Finally,we see that only a fraction of the entries of Σ are required despite the fact that this matrix is actually dense; since Σ issymmetric, we only need to calculate 17.3% of it as well.

Table 3. (Experiment 2) Wall-clock time per iteration for testedalgorithms.

algorithm wall-clock timelabel per iteration [s]MAP Newton 0.0262ESGVI deriv M=2 0.0745ESGVI deriv M=3 0.1836ESGVI deriv-free M=4 0.2653

the ‘ESGVI deriv M=3’ method but requires no analyticalderivatives of the factors.

As in the one-dimensional simulation, the bias in theestimate is significantly reduced in the ESGVI approachescompared to the MAP approach (row 3 of Figure 5). Thisis important since this result can be achieved in a tractableway for large-scale problems and even without analyticalderivatives. Note that we chose to show this metricseparately for (left to right) robot position, robot velocity,and landmark position to avoid combining quantities withdifferent units. We can see that the improvements offeredby ESGVI are mostly due to the robot position variables butrobot velocity and landmark positions are also improved.

The ESGVI methods also do slightly better than MAPon squared error (row 4 of Figure 5). Again, we choseto show this metric separately for (left to right) robotposition, robot velocity, and landmark position to avoidcombining quantities with different units. We can see thatthe improvements offered by ESGVI are mostly on the

robot position variables. For squared Mahalanobis distance/ NEES (row 5 of Figure 5), all algorithms perform well.

Table 3 shows the wall-clock time per iteration for thedifferent algorithms. Naturally, as we use more cubaturepoints, the computational cost increases. The derivative-free version of ESGVI is about an order of magnitudeslower than the MAP approach. We believe that there maybe applications where this increased computational costis worthwhile in terms of increased performance or theconvenience of avoiding the calculation of derivatives.

6.3 Experiment 3: Robot Dataset

Finally, we consider a batch SLAM problem with a robotdriving around and building a map of landmarks as depictedin Figure 6. The robot is equipped with a laser rangefinderand wheel odometers and must estimate its own trajectoryand the locations of a number of tubular landmarks. Thisdataset has been used previously by Barfoot et al. (2014)to test SLAM algorithms. Groundtruth for both the robottrajectory and landmark positions (this is a unique aspect ofthis dataset) is provided by a Vicon motion capture system.The whole dataset is 12, 000 timesteps long, which webroke into six subsequences of 2000 timestamps; statisticalperformance reported below is an average over these sixsubsequences. We assume that the data association (i.e.,which measurement corresponds to which landmark) isknown in this experiment to restrict testing to the stateestimation part of the problem.


Barfoot et al. 21

20 The International Journal of Robotics Research XX(X)

4.2163000 4.0670000 4.0570000 4.05680004

5

6

7

itera

tions

MAP Newton ESGVI deriv M=2 ESGVI deriv M=3 ESGVI deriv-free M=4

1546.7953886 1546.6325736 1546.6325660 1546.6325662

1520

1540

1560

1580


3.25[m] 0.63[m/s] -0.58[m] 0.09[m] 0.02[m/s] -0.03[m] 0.06[m] 0.01[m/s] -0.02[m] 0.06[m] 0.01[m/s] -0.02[m]

-20

0

20


1.51[m2] 0.26[m2/s2] 16.37[m2] 1.36[m2] 0.26[m2/s2] 16.36[m2] 1.36[m2] 0.26[m2/s2] 16.36[m2] 1.36[m2] 0.26[m2/s2] 16.36[m2]0

10

20


1.0032699 1.0027273 1.0027386 1.0027275

0.8

1

1.2


Figure 5. (Experiment 2) Statistical results of 10, 000 trials of the K-dimensional stereo camera simulation shown as standardboxplots. The different rows show different performance metrics for the different variants of our algorithm (columns). Table 2provides details of the different algorithms tested. The number below an algorithm label is its mean performance on that metric.Further details are discussed in the text.

groundtruth for both the robot trajectory and landmarkpositions is provided by a Vicon motion capture system.The whole dataset is 12, 000 timesteps long, which webroke into six subsequences of 2000 timestamps; statisticalperformance reported below is an average over these sixsubsequences. We assume that the data association (i.e.,which measurement corresponds to which landmark) is

known in this experiment to restrict testing to the stateestimation part of the problem.


Figure 5. (Experiment 2) Statistical results of 10, 000 trials of the K-dimensional stereo camera simulation shown as standardboxplots. The different rows show different performance metrics for the different variants of our algorithm (columns). Table 2provides details of the different algorithms tested. For the bias (row 3) and squared error (row 4) metrics, we show the resultsseparately for (left to right) robot position, robot velocity, and landmark position to avoid combining quantities with different units.The number above an algorithm label is its mean performance on that metric. Further details are discussed in the text.



−2 0 2 4 6 8 10−3

−2

−1

0

1

2

3

4

x [m]

y [m

]

Ground−truth Map, Robot Path

Figure 6. (Experiment 3) Setup: (left) a mobile robot navigates amongst a map of landmarks; it receives bearing measurementsto some landmarks as well as wheel odometry. (right) the ground-truth path of the robot and landmark map as measured by anoverhead camera system.

. . .�0 �1 �2

. . .�K

xK�1 xKxkx1x0 x2

m`

`,1

`,k

`,K�1

1 2 0 k K�1 K

Figure 7. (Experiment 3) Factor graph for our robot dataset. White circles indicate variables and small black circles indicatefactors involving variables.

The state to be estimated is

x =

x0

x1

...xKm1

...mL

, xk =

xkykθkxkykθk

, m` =

[x`y`

], (91)

where xk is a robot state and m` a landmark position.For each of our six subsequences we have K = 2000 andL = 17.

Figure 7 shows the factor graph for this experiment andFigure 8 shows the corresponding sparsity patterns. For the(linear) prior factor on the robot states we have

φk =

{12 (x0 − x0)T P−1(x0 − x0) k = 012 (xk −Axk−1)TQ−1(xk −Axk−1) k > 0

,

(92)

with

P = diag(σ2x, σ

2y, σ

2θ , σ

2x, σ

2y, σ

2θ), A =

[1 T10 1

],

Q =

[13T

3QC12T

2QC12T

2QC TQC

],

QC = diag(QC,1, QC,2, QC,3), (93)

where T is the discrete-time sampling period, QC,i arepower spectral densities, and σ2

x, σ2y ,σ2

θ , σ2x, σ2

y , σ2θ

are variances. The robot state prior encourages constantvelocity (Barfoot 2017, §3, p.85). Unlike the previousexperiment, we do not have a prior on the landmarkpositions; this was necessary when conducting a properBayesian evaluation in the previous experiment, but herewe simply have a standard SLAM problem.

The (nonlinear) odometry factors, derived from wheelencoder measurements, are

ψk =1

2(vk −Ckxk)

TS−1 (vk −Ckxk) , (94)


Barfoot et al. 23

Figure 8. (Experiment 3) Sparsity patterns for the first 100 of 2000 timestamps of the robot dataset. The red partition linesseparate the robot state variables from the landmark variables. The inverse covariance, Σ−1, is highly sparse owing to the factorgraph pattern in Figure 7; only 11,636/401,956 = 2.9% of entries are nonzero. After performing a lower-diagional-upperdecomposition, the lower factor, L, becomes more filled in owing to the ‘four corners of a box’ rule; 20,590/401,956 = 5.1% ofentries are non-zero. Finally, we see that only a fraction of the entries of Σ are required despite the fact that this matrix is actuallydense; since Σ is symmetric, we only need to calculate 5.1% of it as well. For the full 2000-timestamp dataset the sparsity iseven more favourable for ESGVI, but the landmark part of the pattern becomes difficult to visualize due to its small size relativeto the trajectory part.

where

vk =

ukvkωk

, Ck =

0 0 0 cos θk sin θk 00 0 0 − sin θk cos θk 00 0 0 0 0 1

,S = diag

(σ2u, σ

2v , σ

2ω

). (95)

The vk consists of measured forward, lateral, and rotationalspeeds in the robot frame, derived from wheel encoders; weset vk = 0, which enforces the nonholonomy of the wheelsas a soft constraint. The σ2

u, σ2v , and σ2

ω are measurementnoise variances.

The (nonlinear) bearing measurement factors, derivedfrom a laser rangefinder, are

ψ`,k =1

2

(β`,k − g(m`,xk))2

σ2r

, (96)

with

g(m`,xk) = atan2(y` − yk − d sin θk,

x` − xk − d cos θk)− θk, (97)

where β`,k is a bearing measurement from the kth robotpose to the `th landmark, d is the offset of the laserrangefinder from the robot center in the longitudinaldirection, and σ2

r is measurement noise variance. Althoughthe dataset provides range to the landmarks as well, wechose to neglect these measurements to accentuate the

differences between the various algorithms. Our setup issimilar to a monocular camera situation, which is knownto be a challenging SLAM problem.

Putting these together, our joint state/data likelihood inthis case is of the form

− ln p(x, z) =

K∑k=0

φk +

K∑k=0

ψk

+

K∑k=1

L∑`=1

ψ`,k + constant, (98)

where it is understood that not all L = 17 landmarks areactually seen at each timestep and thus we must remove thefactors for unseen landmarks.

By using only bearing measurements, this proved to be achallenging dataset. We initialized our landmark locationsusing the bearing-only Random Sample And Consensus(RANSAC) (Fischler and Bolles 1981) strategy describedby McGarey et al. (2017). We attempted to initialize ourrobot states using only the wheel odometry information, butthis proved too difficult for methods making use of the fullHessian (i.e., Newton’s method). To remedy this problem,we used wheel odometry to initialize Gauss-Newton andthen used this to initialize Newton’s method. Specifically,we used MAP Gauss-Newton to initialize MAP Newtonand ESGVI-GN to initialize ESGVI. To compare our resultsto groundtruth, we aligned the resulting landmark map tothe groundtruth map since it is well established that SLAM



Barfoot et al. 23

11.0000000 13.0000000 15.0000000 10.1666667 14.16666678

10

12

14

16

itera

tions

MAP GN MAP Newton ESGVI deriv M=3 ESGVI-GN deriv-free M=3 ESGVI deriv-free M=4

59000.9087368 59000.8243169 58984.8987932 59000.8783692 58984.9601092

5.88

5.89

5.9

5.91

5.92

104


-25.0[m] -0.7[rad] 0.0[m] -25.0[m] -0.7[rad] 0.0[m] -29.5[m] -0.7[rad] 0.0[m] -25.0[m] -0.7[rad] 0.0[m] -29.6[m] -0.7[rad] 0.0[m]-80

-60

-40

-20

0

20


17.9[m2] 0.7[rad2] 0.2[m2] 17.9[m2] 0.7[rad2] 0.2[m2] 4.9[m2] 0.7[rad2] 0.1[m2] 17.4[m2] 0.7[rad2] 0.2[m2] 4.9[m2] 0.7[rad2] 0.1[m2]0

10

20


14.3496530 14.3589872 4.5634779 13.8371567 4.5796553

5

10

15

20

25


Figure 9. (Experiment 3) Statistical results of robot dataset shown as standard boxplots. The different rows show differentperformance metrics for the different variants of our algorithm (columns). Table 2 provides details of the different algorithmstested. The number below an algorithm label is its mean performance on that metric, averaged over all 2000 timestamps and sixsubsequences. Further details are discussed in the text.

of the Hessian and a line search at each iteration to increaserobustness.

Figure 9 provides the statistical results of several variantsof our ESGVI algorithms. We see that the number ofiterations required to converge is higher than in the previousexperiments, with the ESGVI variants requiring a few morethan the corresponding MAP algorithms. Again, we see theESGVI variants reducing the loss functional, V (q), further

than the MAP methods. The mean error is further awayfrom zero for the ESGVI methods than MAP, which couldsimply be related to the relatively small number of trialscompared to the previous two experiments. However, thesquared error and squared Mahalanobis distance metrics aredrastically improved for the full ESGVI methods comparedto the MAP method and even the ESGVI-GN method.


Figure 9. (Experiment 3) Statistical results of robot dataset shown as standard boxplots. The different rows show differentperformance metrics for the different variants of our algorithm (columns). Table 2 provides details of the different algorithmstested. For the bias (row 3) and squared error (row 4) metrics, we show the results separately for (left to right) robot position,robot orientation, and landmark position to avoid combining quantities with different units. The number above an algorithm labelis its mean performance on that metric, averaged over all 2000 timestamps and six subsequences. Further details are discussedin the text.


Barfoot et al. 25

Table 4. (Experiment 3) Wall-clock time per iteration for testedalgorithms (averages over the 6 sequences).

algorithm wall-clock timelabel per iteration [s]MAP Newton 5.8079MAP GN 1.2035ESGVI deriv M=3 80.2644ESGVI-GN deriv-free M=3 41.4672ESGVI deriv-free M=4 164.0719

produces a relative solution; reported errors are calculatedafter this alignment.

Figure 9 provides the statistical results of several variantsof our ESGVI algorithms. We see that the number ofiterations required to converge is higher than in the previousexperiments, with the ESGVI variants requiring a few morethan the corresponding MAP algorithms. Again, we see theESGVI variants reducing the loss functional, V (q), furtherthan the MAP methods.

The bias is further away from zero for the derivative-free ESGVI method than MAP, which could simply berelated to the relatively small number of trials comparedto the previous two experiments. Note, as in the previousexperiment, we show the bias separately for (left to right)robot position, robot orientation, and landmark position toavoid combining quantities with different units. We do nothave groundtruth for velocity so do not report errors for thispart of the state.

The squared error and squared Mahalanobis distancemetrics are drastically improved for the full ESGVImethods compared to the MAP method and eventhe ESGVI-GN method. Again, we show the squarederror separately for (left to right) robot position, robotorientation, and landmark position to avoid combiningquantities with different units. We can see that theimprovements offered by ESGVI on squared error are dueto the robot and landmark position variables rather thanrobot orientation.

Figure 10 shows the detailed error plots for one of thesix subsequences for the ‘MAP GN’ and ‘ESGVI deriv-freeM=4’ algorithms. The ESGVI path is visibly better thanMAP in most sections. MAP seems to have underestimatedthe scale of the whole solution, resulting in muchworse performance on all translational variables, whileperforming similarly on heading error. Both algorithms arefairly consistent, with ESGVI being both more confidentand more consistent. The other five subsequences havesimilar results.

Figure 8 shows the sparsity patterns of Σ−1 and L, aswell as the blocks of Σ that are computed (the matrixis actually dense); only the patterns for the first 100timestamps are shown for clarity, but each subsequence isactually 2000 timestamps long. In terms of computational

complexity, all of the algorithms for this SLAM problemare O(L3 + L2K) per iteration, where L is the number oflandmarks and K is the number of timesteps. However, thewall-clock time required by the different algorithms variessignificantly due to different numbers of iterations and theaccuracy with which the required expectations in (47) arecomputed. Table 4 reports how long each algorithm tookper iteration; the ESGVI methods come at a cost, but thismay be acceptable for batch (i.e., offline) applications.It is also worth noting that we have made little attemptto optimize our implementation. We used a brute-forcecubature method requiring MNk sigmapoints where Nk isthe number of state variables involved in a factor. Moreefficient options could be swapped in to speed up theevaluation of each factor. Additionally, parallelization couldbe employed at the factor level quite easily in our approachby evaluating the expectations in (47) in parallel acrossseveral compute elements.

7 Conclusion and Future WorkWe presented our Exactly Sparse Gaussian VariationalInference (ESGVI) approach and demonstrated that it ispossible to compute a Gaussian that is ‘best’ in termsof KL divergence from the full Bayesian posterior, evenfor large-scale problems. We exploited the fact that thejoint likelihood of the state and data factors, a propertyof most common robotics problems, to show that thefull (dense) covariance matrix is not required, only theblocks corresponding to the non-zero blocks of the (sparse)inverse covariance matrix. We further showed how to applycubature methods (e.g., sigmapoints) within our frameworkresulting in a batch inference scheme that does not requireanalytical derivatives, yet is applicable to large-scaleproblems. The methods offer performance improvements(over MAP) that increase as the problem becomes morenonlinear and/or the posterior less concentrated.

There are several avenues for further exploration beyondthis work. First, sample-efficient cubature methods couldbring the cost of our scheme down further. While weshowed that we only need to apply cubature at thefactor/marginal level, this can still be expensive formarginals involving several state variables. We used a brute-force approach requiring MNk samples for a marginal ofdimension Nk, but there may be other alternatives thatcould be applied to bring the cost down. Parallel evaluationof the factor expectations could also be worth investigating.

We used the method of Takahashi et al. (1973) tocompute the blocks of Σ corresponding to the non-zeroblocks of Σ−1, but this basic method is not alwaysoptimally efficient, requiring additional (unnecessary)blocks of Σ to be computed for some GVI problems. Itshould be possible to combine this with additional modernsparsity methods such as variable reordering and Givens



20 40 60 80 100 120 140 160 180 200time [s]

-0.4

-0.2

0

0.2

0.4

posi

tion

erro

r x [m

]

MAP GNESGVI deriv-free M=4

20 40 60 80 100 120 140 160 180 200time [s]

-0.4

-0.2

0

0.2

0.4

posi

tion

erro

r y [m

]

MAP GNESGVI deriv-free M=4

20 40 60 80 100 120 140 160 180 200time [s]

-0.15

-0.1

-0.05

0

0.05

0.1

head

ing

erro

r [r

ad] MAP GN

ESGVI deriv-free M=4

-2 0 2 4 6 8 10position x [m]

-3

-2

-1

0

1

2

3

4

posi

tion

y [m

]

Ground-truth pathMAP GN pathESGVI deriv-free M=4 pathGround-truth landmarkMAP GN landmarkESGVI deriv-free M=4 landmark

Figure 10. (Experiment 3) A comparison of ‘MAP GN’ and ‘ESGVI deriv-free M=4’ on one of the six subsequences of 2000timestamps. Above, we see the individual error plots (with 3σ covariance envelopes) for the x, y, and θ components of the robotstate as compared to groundtruth. Below, we have an overhead view of the robot path and landmark map for the two algorithmsas well as groundtruth. The total landmark squared error for ‘MAP GN’ was 0.2214 [m2] while for ‘ESGVI deriv-free M=4’ it was0.0168 [m2].


Barfoot et al. 27

rotations (Golub and Van Loan 1996; Kaess et al. 2008) toimprove the efficiency of this step.

Our SLAM experiments showed that we could carry outGVI in a tractable manner. However, we have not yet shownthat our approach is robust to outliers. It would certainlybe worth attempting to wrap each factor expression in arobust cost function to enable a variational extension ofM-estimation (Barfoot 2017, §5, p. 163)(MacTavish andBarfoot 2015). This is typically implemented as iterativelyreweighted least squares (Holland and Welsch 1977),but ESGVI might handle robust cost functions with nomodification since we compute expectations at the factorlevel.

There are many other practical applications of ESGVIworth exploring beyond the simple cases presented here.We are particularly interested in how to apply our inferenceapproach to joint estimation-control problems and havebegun an investigation along this line.

We restricted our variational estimate to be a singlemultivariate Gaussian, but the ideas here will likely extendto mixtures of Gaussians and possibly other approximationsof the posterior as well. We have not explored thispossibility yet, but the variational approach seems to offer alogical avenue along which to do so.

Finally, we showed the possibility of extending ESGVIto include parameter estimation through the use of an EMsetup, which typically employs the same loss functional,V (q|θ). In particular, we would like to represent ourfactor models as Deep Neural Networks (DNNs) whoseweights are unknown. We believe that ESGVI offers agood option for the expectation step, as we may be ableto use our derivative-free version to avoid the need to takethe derivative of a DNN with respect to the state beingestimated, instead just carrying out hardware-acceleratedfeed-forward evaluations.

Acknowledgements

We are grateful to Filip Tronarp (Aalto University) as well as PaulFurgale, Colin McManus, and Chi Hay Tong (while they were atthe University of Toronto) for some early discussions on this work.This work was supported by the Natural Sciences and EngineeringResearch Council of Canada.

References

Abdulle A and Wanner G (2002) 200 years of least squaresmethod. Elemente der Mathematik 57: 45–60.

Ala-Luhtala J, Sarkka S and Piche R (2015) Gaussian filteringand variational approximations for bayesian smoothing incontinuous-discrete stochastic dynamic systems. SignalProcessing 111: 124–136.

Amari SI (1998) Natural gradient works efficiently in learning.Neural computation 10(2): 251–276.

Anderson S, Barfoot TD, Tong CH and Sarkka S (2015) Batchnonlinear continuous-time trajectory estimation as exactlysparse gaussian process regression. Autonomous Robots39(3): 221–238.

Arasaratnam I and Haykin S (2009) Cubature kalman filters. IEEETransactions on Automatic Control 54(6): 1254–1269.

Barfoot TD (2017) State Estimation for Robotics. CambridgeUniversity Press.

Barfoot TD (2020) Multivariate gaussian variational infer-ence by natural gradient descent. Technical report,Autonomous Space Robotics Lab, University of Toronto.arXiv:2001.10025 [stat.ML].

Barfoot TD, Tong CH and Sarkka S (2014) Batch continuous-time trajectory estimation as exactly sparse gaussian processregression. In: Proceedings of Robotics: Science and Systems(RSS). Berkeley, USA.

Bayes T (1764) Essay towards solving a problem in the doctrineof chances. Philosophical Transactions of the Royal Societyof London .

Bishop CM (2006) Pattern Recognition and Machine Learning.Springer.

Bjorck A (1996) Numerical Methods for Least Squares Problems.Society for Industrial and Applied Mathematics (SIAM).

Bourmaud G (2016) Online variational bayesian motionaveraging. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 126–142.

Broussolle F (1978) State estimation in power systems: detectingbad data through the sparse inverse matrix method. IEEETransactions on Power Apparatus and Systems (3): 678–682.

Brown DC (1958) A solution to the general problem of multiplestation analytical stereotriangulation. RCA-MTP DataReduction Technical Report No. 43 (or AFMTC TR 58-8),Patrick Airforce Base, Florida.

Cools R (1997) Constructing cubature formulae: The sciencebehind the art. Acta Numerica 6: 1–54.

Davison AJ and Ortiz J (2019) Futuremapping 2: Gaussian beliefpropagation for spatial ai.

Dong J, Mukadam M, Dellaert F and Boots B (2016) Motionplanning as probabilistic inference using gaussian processesand factor graphs. In: Robotics: Science and Systems,volume 12. p. 4.

Durrant-Whyte H and Bailey T (2006) Simultaneous localisationand mapping (SLAM): Part I the essential algorithms. IEEERobotics and Automation Magazine 11(3): 99–110.

Erisman AM and Tinney WF (1975) On computing certainelements of the inverse of a sparse matrix. Commununicationsof the ACM 18(3): 177–179.

Fischler M and Bolles R (1981) Random sample consensus: aparadigm for model fitting with applications to image analysisand automated cartography. Communications of ACM 24(6):381–395.


https://arxiv.org/abs/2001.10025


Fisher RA (1922) On the mathematical foundations of theoreticalstatistics. Philosophical Transactions of the Royal Society ofLondon. Series A, Containing Papers of a Mathematical orPhysical Character 222(594-604): 309–368.

Garcıa-Fernandez AF, Svensson L, Morelande MR and SarkkaS (2015) Posterior linearization filter: Principles andimplementation using sigma points. IEEE transactions onsignal processing 63(20): 5561–5573.

Gasperin M and Juricic D (2011) Application of unscentedtransformation in nonlinear system identification. IFACProceedings Volumes 44(1): 4428–4433.

Gauss C (1809) Theoria motus corporum coelestium, perthes etbesser, hamburgi 7.

Gauss C (1821) Theoria combinationis observationum erroribusminimis obnoxiae, pars prior. Werke, IV, KoniglichenGesellschaft der Wissenschaften zu Gottingen : 1–26.

Gauss C (1823) Theoria combinationis observationum erroribusminimis obnoxiae, pars posterior. Werke, IV, KoniglichenGesellschaft der Wissenschaften zu Gottingen : 27–53.

Ghahramani Z and Roweis ST (1999) Learning nonlineardynamical systems using an em algorithm. In: Advances inneural information processing systems. pp. 431–437.

Golub H and Van Loan CF (1996) Matrix computations, johnshopkins uni. Press, London .

Hoffman MD, Blei DM, Wang C and Paisley J (2013) Stochasticvariational inference. The Journal of Machine LearningResearch 14(1): 1303–1347.

Holland PW and Welsch RE (1977) Robust regression usingiteratively reweighted least-squares. Communications inStatistics – Theory and Methods 6(9): 813–827.

Ito K and Xiong K (2000) Gaussian filters for nonlinear filteringproblems. IEEE Transactions on Automatic Control 45(5):910–927.

Jazwinski AH (1970) Stochastic Processes and Filtering Theory.Academic, New York.

Jensen JLWV (1906) Sur les fonctions convexes et les inegalitesentre les valeurs moyennes. Acta mathematica 30: 175–193.

Jia B, Xin M and Cheng Y (2013) High-degree cubature kalmanfilter. Automatica 49: 510–518.

Julier S and Uhlmann J (1996) A general method forapproximating nonlinear transformations of probabilitydistributions. Technical report, Robotics Research Group,University of Oxford.

Kaess M and Dellaert F (2009) Covariance recovery from a squareroot information matrix for data association. Robotics andautonomous systems 57(12): 1198–1210.

Kaess M, Johannsson H, Roberts R, Ila V, Leonard J and Dellaert F(2011) isam2: Incremental smoothing and mapping with fluidrelinearization and incremental variable reordering. In: 2011IEEE International Conference on Robotics and Automation.pp. 3281–3288.

Kaess M, Ranganathan A and Dellaert F (2008) isam: Incrementalsmoothing and mapping. IEEE Transactions on Robotics24(6): 1365–1378.

Kalman RE (1960) A new approach to linear filtering andprediction problems. Trans. ASME, Journal of BasicEngineering 82: 35–45.

Kokkala J, Solin A and Sarkka S (2014) Expectation maximizationbased parameter estimation by sigma-point and particlesmoothing. In: 17th International Conference on InformationFusion (FUSION). IEEE, pp. 1–8.

Kokkala J, Solin A and Sarkka S (2016) Sigma-point filtering andsmoothing based parameter estimation in nonlinear dynamicsystems. Journal of Advances in Information Fusion 11(1):15–30.

Kullback S and Leibler RA (1951) On information and sufficiency.The annals of mathematical statistics 22(1): 79–86.

Li R, Hwangbo J, Chen Y and Di K (2011) Rigorousphotogrammetric processing of HiRISE stereo imageryfor mars topographic mapping. IEEE Transactions onGeoscience and Remote Sensing 49(7): 2558–2572.

Lu F and Milios E (1997) Globally consistent range scanalignment for environment mapping. Autonomous robots4(4): 333–349.

MacTavish KA and Barfoot TD (2015) At all costs: A comparisonof robust cost functions for camera correspondence outliers.In: Proceedings of the 12th Conference on Computer andRobot Vision (CRV). Halifax, Canada, pp. 62–69. DOI:10.1109/CRV.2015.52.

Magnus JR and Neudecker H (2019) Matrix differential calculuswith applications in statistics and econometrics. John Wiley& Sons.

Markoff A (1912) Wahrscheinlichheitsrechnung. Trans., Leipzig.McGarey P, MacTavish KA, Pomerleau F and Barfoot TD

(2017) Tslam: Tethered simultaneous localization andmapping for mobile robots. International Journal ofRobotics Research (IJRR) 36(12): 1363–1386. DOI:10.1177/0278364917732639.

Meurant G (1992) A review on the inverse of symmetrictridiagonal and block tridiagonal matrices. SIAM Journal ofMatrix Analysis and Applications 13(3): 707–728.

Mukadam M, Dong J, Yan X, Dellaert F and Boots B(2018) Continuous-time gaussian process motion planningvia probabilistic inference. The International Journal ofRobotics Research 37(11): 1319–1340.

Neal RM and Hinton GE (1998) A view of the em algorithm thatjustifies incremental, sparse, and other variants. In: Learningin graphical models. Springer, pp. 355–368.

Nocedal J and Wright SJ (2006) Numerical Optimization. 2ndedition. Springer.

O’Hagan A (1991) Bayes–hermite quadrature. Journal ofstatistical planning and inference 29(3): 245–260.


Barfoot et al. 29

Opper M and Archambeau C (2009) The variational gaussianapproximation revisited. Neural computation 21(3): 786–792.

Ortega J and Rheinboldt W (1970) Iterative Solution of NonlinearEquations in Several Variables. Academic Press.

Park E, Park SY and Choi KH (2009) Performance comparisonof the batch filter based on the unscented transformation andother batch filters for satellite orbit determination. Journal ofAstronomy and Space Sciences 26(1): 75– 88.

Pradeep V, Konolige K and Berger E (2014) Calibrating a multi-arm multi-sensor robot: A bundle adjustment approach. In:Experimental robotics. Springer, pp. 211–225.

Press WH, Teukolsky SA, Vetterling WT and Flannery BP(2007) Numerical Recipes: The Art of Scientific Computing.Cambridge.

Ranganathan A, Kaess M and Dellaert F (2007) Loopy sam.In: Proceedings of Intl. Joint Conf. on Artificial Intelligence(IJCAI). pp. 2191–2196.

Rasmussen CE and Williams CKI (2006) Gaussian Processes forMachine Learning. Cambridge, MA: MIT Press.

Rauch HE, Tung F and Striebel CT (1965) Maximum likelihoodestimates of linear dynamic systems. AIAA Journal 3(8):1445–1450.

Roh KM, Park SY, Park ES and Choi KH (2007) A batch filterbased on the unscented transformation and its applications toattitude estimation. Advances in the Astronautical Sciences127(1): 19–40.

Sarkka S (2013) Bayesian Filtering and Smoothing. CambridgeUniversity Press.

Sarkka S, Hartikainen J, Svensson L and Sandblom F (2016) Onthe relation between gaussian process quadratures and sigma-point methods. Journal of Advances in Information Fusion11(1): 31–46.

Sarmavuori J and Sarkka S (2012) Fourier-hermite kalman filter.IEEE Transactions on Automatic Control 57(6): 1511–1515.

Schon TB, Wills A and Ninness B (2011) System identification ofnonlinear state-space models. Automatica 47(1): 39–49.

Sibley G, Sukhatme G and Matthies L (2006) The iterated sigmapoint kalman filter with applications to long-range stereo. In:Proceedings of Robotics: Science and Systems. Philadelphia,USA.

Stein CM (1981) Estimation of the mean of a multivariate normaldistribution. Annals of Statististics 9(6): 1135–1151.

Takahashi K, Fagan J and Chen MS (1973) A sparse busimpedance matrix and its application to short circuit study.In: Proceedings of the PICA Conference.

Thrun S, Burgard W and Fox D (2006) Probabilistic Robotics.MIT Press.

Thrun S, Liu Y, Koller D, Ng AY, Ghahramani Z and Durrant-Whyte H (2004) Simultaneous localization and mapping withsparse extended information filters. The international journalof robotics research 23(7-8): 693–716.

Thrun S and Montemerlo M (2005) The GraphSLAM algorithmwith applications to large-scale mapping of urban structures.International Journal on Robotics Research 25(5/6): 403–430.

Triggs W, McLauchlan P, Hartley R and Fitzgibbon A (2000)Bundle adjustment: A modern synthesis. In: Triggs W,Zisserman A and Szeliski R (eds.) Vision Algorithms: Theoryand Practice, LNCS. Springer Verlag, pp. 298–375.

Walter MR, Eustice RM and Leonard JJ (2007) Exactly sparseextended information filters for feature-based slam. TheInternational Journal of Robotics Research 26(4): 335–359.

Wu Y, Hu D, Wu M and Hu X (2006) Gaussian filters for nonlinearfiltering problems. IEEE Transactions on Signal Processing54(8): 2910–2921.

A Definiteness of tr(ABAB)

We used the fact that

tr(Σ(i) δΣ−1 Σ(i) δΣ−1

)≥ 0, (99)

with equality if and only if δΣ−1 = 0 in our localconvergence guarantee in (16). To show this, it is sufficientto show for A real symmetric positive definite and B realsymmetric that

tr(ABAB) ≥ 0, (100)

with equality if and only if B = 0.We can write

tr(ABAB) = vec(ABA)T vec(B)

= vec(B)T (A⊗A)︸︷︷︸>0

vec(B) ≥ 0, (101)

using basic properties of vec(·) and the Kronecker product,⊗. The matrix in the middle is symmetric positive definiteowing to our assumptions on A. Therefore the quadraticform is positive semi-definite with equality if and only ifvec(B) = 0 if and only if B = 0.

B Fisher Information Matrix for aMultivariate Gaussian

This section provides a brief derivation for the FisherInformation Matrix (FIM) associated with our Gaussian,q = N (µ,Σ). Magnus and Neudecker (2019) or Barfoot(2020) provide more detail. If we stack up our variationalparameters into a vector as

α =

[µ

vec(Σ−1

)] , (102)

then we seek to show that the FIM is

Iα =

[Iµ 00 IΣ−1

]=

[Σ−1 0

0 12 (Σ⊗Σ)

]. (103)



For a Gaussian, we can use the following FIM definition(Fisher 1922):

Iα = −Eq[∂2 ln q

∂αT∂α

]. (104)

The negative log-likelihood of a Gaussian is

− ln q =1

2(x− µ)TΣ−1(x− µ)

− 1

2ln |Σ−1|+ constant. (105)

The first differential is

− d ln q = −dµTΣ−1(x− µ)− 1

2tr(Σ dΣ−1

)+

1

2(x− µ)T dΣ−1(x− µ). (106)

The second differential is

− d2 ln q = dµTΣ−1 dµ− 2dµT dΣ−1(x− µ)

+1

2tr((

(x− µ)(x− µ)T −Σ)d2Σ−1

)+

1

2tr(Σ dΣ−1 Σ dΣ−1

). (107)

The expected value of the second differential over q is

−Eq[d2 ln q

]= dµTΣ−1 dµ +

1

2tr(Σ dΣ−1 Σ dΣ−1

)= dµTΣ−1 dµ (108)

+1

2vec(dΣ−1

)T(Σ⊗Σ) vec

(dΣ−1

).

In matrix form this is

− Eq[d2 ln q

]= dαT

[Σ−1 0

0 12 (Σ⊗Σ)

]dα, (109)

where

dα =

[dµ

vec(dΣ−1

)] . (110)

Turning the differentials into partial derivatives we have

Iα = −Eq[∂2 ln q

∂αT∂α

]=

[Σ−1 0

0 12 (Σ⊗Σ)

]. (111)

The inverse FIM is given by

I−1α =

[I−1

µ 0

0 I−1Σ−1

]=

[Σ 0

0 2(Σ−1 ⊗Σ−1

)] ,(112)

using properties of linear and Kronecker algebra.

V

exploitfactorization−−−−−−−−−−−→ V0 +

∑k Vkyminimize

yminimize

GVI(derivative-free)

ESGVI(derivative-free)y Stein’s

lemma

x Stein’slemma

GVI

exploitfactorization−−−−−−−−−−−→ ESGVI

Figure 11. Commutative diagram showing two paths to get tothe derivative-free version of ESGVI starting from the lossfunctional, V .

C Direct Derivation of Derivative-FreeESGVI

In our main ESGVI derivation, we chose to (i) define ascheme to minimize V (q) with respect to q and then (ii)exploit φ(x) =

∑Kk=1 φk(xk) to make the scheme efficient.

Here we show that we can carry out (i) and (ii) in theopposite order to streamline the derivation of the derivative-free version of ESGVI, avoiding the need for Stein’s lemma.

Figure 11 shows two paths to get to the derivative-freeversion of ESGVI starting from our loss functional, V . Themain body of the paper took the counter-clockwise paththat goes down, across to the right, and then up. We willnow demonstrate the clockwise path that goes right and thendown.

Inserting φ(x) =∑Kk=1 φk(xk), the loss functional

becomes

V (q) =

K∑k=1

Eqk [φk(xk)]︸︷︷︸Vk(qk)

+1

2ln(|Σ−1|

)︸︷︷︸

V0

, (113)

where the expectations are now over the marginal,qk(xk) = N (µk,Σkk) with

µk = Pkµ, Σkk = PkΣPTk , (114)

and where Pk is a projection matrix. We then take the firstderivative with respect to µ:

∂V (q)

∂µT=

∂

∂µT

(K∑k=1

Eqk [φk(xk)] +1

2ln(|Σ−1|

))

=

K∑k=1

∂

∂µTEqk [φk(xk)]

=

K∑k=1

PTk

∂

∂µTkEqk [φk(xk)]

=

K∑k=1

PTkΣ−1

kk Eqk [(xk − µk)φk(xk)]︸︷︷︸cubature

. (115)


Barfoot et al. 31

The last step comes from (10a) applied at the marginal level.Similarly, we have for the second derivative with respect toµ that

∂2V (q)

∂µT∂µ

=∂2

∂µT∂µ

(K∑k=1

Eqk [φk(xk)] +1

2ln(|Σ−1|

))

=

K∑k=1

∂2

∂µT∂µEqk [φk(xk)]

=

K∑k=1

PTk

(∂2

∂µTk ∂µkEqk [φk(xk)]

)Pk

=

K∑k=1

PTk

(Σ−1kk Eqk

[(xk − µk)(xk − µk)Tφk(xk)

]︸︷︷︸cubature

× Σ−1kk −Σ−1

kk Eqk [φk(xk)]︸︷︷︸cubature

)Pk, (116)

where we made use of (10b) at the marginal level in the laststep.

The update scheme is then

(Σ−1

)(i+1)=

∂2V (q)

∂µT∂µ

∣∣∣∣q(i)

, (117a)

(Σ−1

)(i+1)δµ = − ∂V (q)

∂µT

∣∣∣∣q(i)

, (117b)

µ(i+1) = µ(i) + δµ, (117c)

which is identical to the derivative-free ESGVI approachfrom the main body of the paper. The sparsity of the left-hand side comes from the use of the projection matrices.We did not require the use of Stein’s lemma (Stein 1981)using this derivation, which also means that we have madeno assumptions about the differentiability of φk(xk) withrespect to xk. A similar streamlined derivation can beworked out for the Gauss-Newton variant.

D Derivation of Derivative Identities (24)To show (24a), we can use (10a) to write

∂

∂µTEq[f(x)] = Σ−1Eq[(x− µ)f(x)], (118)

which can also be found in Opper and Archambeau (2009).Applying Stein’s lemma from (22) we immediately have

∂

∂µTEq[f(x)] = Eq

[∂f(x)

∂xT

], (119)

the desired result.

To show (24b), we can use (10b) to write

∂2

∂µT∂µEq[f(x)] = Σ−1Eq[(x− µ)(x− µ)T f(x)]Σ−1

− Σ−1 Eq[f(x)], (120)

again due to Opper and Archambeau (2009). ApplyingStein’s lemma from (23) we have

∂2

∂µT∂µEq[f(x)] = Eq

[∂2f(x)

∂xT∂x

], (121)

the desired result. Similarly to (10c), we have

∂

∂Σ−1Eq[f(x)] = −1

2Eq[(x− µ)(x− µ)T f(x)]

+1

2ΣEq[f(x)], (122)

which is again confirmed by Opper and Archambeau(2009). Comparing the right-hand side of this to the right-hand side of (120), we have

− 2Σ−1

(∂

∂Σ−1Eq[f(x)]

)Σ−1 =

∂2

∂µT∂µEq[f(x)]

= Eq[∂2f(x)

∂xT∂x

], (123)

which shows the second part of (24b).


Documents

Exactly Sparse Gaussian Variational Inference with ...Exactly Sparse Gaussian Variational Inference with Application to Derivative-Free Batch Nonlinear State Estimation c The Author(s)