16
1 Exactly Sparse Delayed-State Filters for View-Based SLAM Ryan M. Eustice, Member, IEEE, Hanumant Singh, Member, IEEE, and John J. Leonard, Member, IEEE Abstract— This paper presents the novel insight that the simul- taneous localization and mapping (SLAM) information matrix is exactly sparse in a delayed-state framework. Such a framework is used in view-based representations of the environment that rely upon scan-matching raw sensor data to obtain virtual observations of robot motion with respect to a place it has previously been. The exact sparseness of the delayed-state infor- mation matrix is in contrast to other recent feature-based SLAM information algorithms, such as sparse extended information filters or thin junction-tree filters, since these methods have to make approximations in order to force the feature-based SLAM information matrix to be sparse. The benefit of the exact sparsity of the delayed-state framework is that it allows one to take advantage of the information space parameterization without incurring any sparse approximation error. Therefore, it can produce equivalent results to the full-covariance solution. The approach is validated experimentally using monocular imagery for two datasets: a test tank experiment with ground-truth, and a remotely operated vehicle survey of the RMS Titanic. Index TermsSLAM, mobile robotics, Kalman filters, infor- mation filters, computer vision, and underwater vehicles. I. I NTRODUCTION G OOD navigation is often a prerequisite for many of the tasks assigned to mobile robotics. This is especially true in the underwater realm where unmanned underwater vehicles (UUVs) have increasingly become part of the standard toolkit of deep-water science. The scientists who use these vehicles have come to demand that they be capable of co- locating data both spatially and temporally across a range of varying applications; examples include studies of bio- diversity [1], coral-reef health [2], plume tracking [3]–[5], micro-bathymetry mapping [6]–[8], and deep-sea archeology [9]–[11]. Since global positioning system (GPS) signals do not penetrate the ocean surface, engineers most often resort to acoustic-beacon networks [12], [13] to meet the large-area, bounded-error, precision navigation requirements of scientists. The disadvantage of this method, however, is that it requires the deployment, calibration, and eventual recovery of the transponder net. While this is often an acceptable trade-off for R. Eustice was with the Joint Program in Oceanographic Engineering of the Massachusetts Institute of Technology, Cambridge, MA, and the Woods Hole Oceanographic Institution, Woods Hole, MA during the tenure of this work; presently he is with the Department of Mechanical Engineering at The Johns Hopkins University, Latrobe 223, 3400 N. Charles St., Baltimore, MD 21218 USA. Email: [email protected] Fax: 410.516.7254 H. Singh is with the Department of Applied Ocean Physics and Engineering at the Woods Hole Oceanographic Institution, WHOI MS 7, Woods Hole, MA 02543 USA. Email: [email protected] Fax: 508.457.2191 J. Leonard is with the Department of Mechanical Engineering at the Massachusetts Institute of Technology, Room 5-214, 77 Massachusetts Ave., Cambridge, MA 02139 USA. Email: [email protected] Fax: 617.253.8125 long-term deployments, it frequently is the bane of short-term surveys. In more recent years UUVs have seen significant advances in their dead-reckoning capabilities. The advent of sensors such as the acoustic Doppler velocity log (DVL) [14] and north seeking fiber optic gyro (FOG) [15] have enabled underwater vehicles to navigate with reported error bounds of less than 1% of distance traveled [16]. For shorter-duration missions this level of precision can often be quite satisfactory, but for longer- duration, large-area missions the unbounded accumulation of error is typically intolerable. A. Visually Augmented Navigation In an effort to overcome current underwater navigation limitations, Eustice, Pizarro, and Singh [17] presented a SLAM technique for near seafloor navigation called visually aug- mented navigation (VAN). Their technique incorporates pair- wise camera constraints from low-overlap imagery to constrain the vehicle position estimate and “reset” the accumulated navigation drift error. In this framework, the camera provides measurements of the six degree of freedom (DOF) relative coordinate transformation between poses modulo scale. The method recursively incorporates these relative-pose constraints by estimating the global poses that are consistent with the camera measurements and navigation prior. These global poses correspond to samples from the robot’s trajectory acquired at image acquisition and, therefore, unlike the typical feature- based SLAM estimation problem, which keeps track of the current robot pose and an associated landmark map, the VAN state vector consists entirely of historical vehicle states cor- responding to the vehicle poses at the times the images were captured. This delayed-state approach corresponds to a view- based representation of the environment (Fig. 1), which can be traced back to a batch scan-matching method by Lu and Milios [18] using laser data, a delayed decision making framework by Leonard and Rikoski [19] for feature initialization with sonar data, and the hybrid batch/recursive formulations by Fleischer [20] and McLauchlan [21] using camera images. In this context, scan-matching raw images results in virtual observations of robot motion with respect to a place it has previously visited. The VAN technique proposed the use of an extended Kalman filter (EKF) as the fusion framework for merging the navigation and camera sensor measurements. This is a well known approach whose application to SLAM was originally developed by Smith, Self, and Cheeseman [22], [23] and Moutarlier and Chatila [24]. The EKF maintains the joint

Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

1

Exactly Sparse Delayed-State Filters forView-Based SLAM

Ryan M. Eustice, Member, IEEE, Hanumant Singh, Member, IEEE, and John J. Leonard, Member, IEEE

Abstract— This paper presents the novel insight that the simul-taneous localization and mapping (SLAM) information matrix isexactly sparse in a delayed-state framework. Such a frameworkis used in view-based representations of the environment thatrely upon scan-matching raw sensor data to obtain virtualobservations of robot motion with respect to a place it haspreviously been. The exact sparseness of the delayed-state infor-mation matrix is in contrast to other recent feature-based SLAMinformation algorithms, such as sparse extended informationfilters or thin junction-tree filters, since these methods have tomake approximations in order to force the feature-based SLAMinformation matrix to be sparse. The benefit of the exact sparsityof the delayed-state framework is that it allows one to takeadvantage of the information space parameterization withoutincurring any sparse approximation error. Therefore, it canproduce equivalent results to the full-covariance solution. Theapproach is validated experimentally using monocular imageryfor two datasets: a test tank experiment with ground-truth, anda remotely operated vehicle survey of the RMS Titanic.

Index Terms— SLAM, mobile robotics, Kalman filters, infor-mation filters, computer vision, and underwater vehicles.

I. INTRODUCTION

GOOD navigation is often a prerequisite for many ofthe tasks assigned to mobile robotics. This is especially

true in the underwater realm where unmanned underwatervehicles (UUVs) have increasingly become part of the standardtoolkit of deep-water science. The scientists who use thesevehicles have come to demand that they be capable of co-locating data both spatially and temporally across a rangeof varying applications; examples include studies of bio-diversity [1], coral-reef health [2], plume tracking [3]–[5],micro-bathymetry mapping [6]–[8], and deep-sea archeology[9]–[11]. Since global positioning system (GPS) signals donot penetrate the ocean surface, engineers most often resortto acoustic-beacon networks [12], [13] to meet the large-area,bounded-error, precision navigation requirements of scientists.The disadvantage of this method, however, is that it requiresthe deployment, calibration, and eventual recovery of thetransponder net. While this is often an acceptable trade-off for

R. Eustice was with the Joint Program in Oceanographic Engineering ofthe Massachusetts Institute of Technology, Cambridge, MA, and the WoodsHole Oceanographic Institution, Woods Hole, MA during the tenure of thiswork; presently he is with the Department of Mechanical Engineering at TheJohns Hopkins University, Latrobe 223, 3400 N. Charles St., Baltimore, MD21218 USA. Email: [email protected] Fax: 410.516.7254

H. Singh is with the Department of Applied Ocean Physics and Engineeringat the Woods Hole Oceanographic Institution, WHOI MS 7, Woods Hole, MA02543 USA. Email: [email protected] Fax: 508.457.2191

J. Leonard is with the Department of Mechanical Engineering at theMassachusetts Institute of Technology, Room 5-214, 77 Massachusetts Ave.,Cambridge, MA 02139 USA. Email: [email protected] Fax: 617.253.8125

long-term deployments, it frequently is the bane of short-termsurveys.

In more recent years UUVs have seen significant advances intheir dead-reckoning capabilities. The advent of sensors suchas the acoustic Doppler velocity log (DVL) [14] and northseeking fiber optic gyro (FOG) [15] have enabled underwatervehicles to navigate with reported error bounds of less than1% of distance traveled [16]. For shorter-duration missions thislevel of precision can often be quite satisfactory, but for longer-duration, large-area missions the unbounded accumulation oferror is typically intolerable.

A. Visually Augmented NavigationIn an effort to overcome current underwater navigation

limitations, Eustice, Pizarro, and Singh [17] presented a SLAMtechnique for near seafloor navigation called visually aug-mented navigation (VAN). Their technique incorporates pair-wise camera constraints from low-overlap imagery to constrainthe vehicle position estimate and “reset” the accumulatednavigation drift error. In this framework, the camera providesmeasurements of the six degree of freedom (DOF) relativecoordinate transformation between poses modulo scale. Themethod recursively incorporates these relative-pose constraintsby estimating the global poses that are consistent with thecamera measurements and navigation prior. These global posescorrespond to samples from the robot’s trajectory acquired atimage acquisition and, therefore, unlike the typical feature-based SLAM estimation problem, which keeps track of thecurrent robot pose and an associated landmark map, the VANstate vector consists entirely of historical vehicle states cor-responding to the vehicle poses at the times the images werecaptured. This delayed-state approach corresponds to a view-based representation of the environment (Fig. 1), which can betraced back to a batch scan-matching method by Lu and Milios[18] using laser data, a delayed decision making frameworkby Leonard and Rikoski [19] for feature initialization withsonar data, and the hybrid batch/recursive formulations byFleischer [20] and McLauchlan [21] using camera images.In this context, scan-matching raw images results in virtualobservations of robot motion with respect to a place it haspreviously visited.

The VAN technique proposed the use of an extendedKalman filter (EKF) as the fusion framework for merging thenavigation and camera sensor measurements. This is a wellknown approach whose application to SLAM was originallydeveloped by Smith, Self, and Cheeseman [22], [23] andMoutarlier and Chatila [24]. The EKF maintains the joint

Page 2: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

2

correlations over all elements in the state vector and, therefore,can update estimates of all the elements involved in keyevents like loop closure. Maintaining these joint correlations,however, represents a significant computational burden sinceeach measurement update requires quadratic complexity in thesize of the state vector. This limits the online use of an EKF torelatively small maps (e.g., for the VAN approach this equatesto an upper bound of approximately 100 six-vector poses).

The EKF’s quadratic computational complexity has longbeen a recognized issue within the SLAM community andhas lead to a great deal of research being directed towardsscalable large-area SLAM algorithms. Notable large-area ap-proaches include submaps [25]–[27], postponement [28]–[30],Rao-Blackwellized particle filtering techniques [31], [32], andcovariance intersection [33]. In addition to this body of work,promising new approaches for scalable SLAM have appearedin the recent literature and are based upon exploiting sparsityin the Gaussian “information form” [34]–[38].

B. A Scalable FrameworkTo our knowledge, the earliest related work that exploited

the efficiency of the measurement update in the inversecovariance form was published by McLauchlan and Murray[39], in the context of recursive structure-from-motion (SFM).This work was subsequently extended to realize a hybridbatch/recursive visual SLAM implementation that unified re-cursive SLAM and bundle adjustment [21]. McLauchlan recog-nized the potential increase in efficiency that can be gained viaapproximations to maintain sparsity of the information matrix:

It has long been known in the photogrammetrycommunity, in the form of the equivalent normalformulation, that the [information] matrix . . . takesa special sparse form in the context of recon-struction . . . [However, in a recursive formulation]. . . eliminating motion fills in the structure blocks.This has to be avoided to maintain update timesproportional to n. So our partial elimination adjust-ment method is to ignore corrections that fill-in zeroblocks, while applying the correction to the blockswhich are already non-zero.

While the consistency implications of this approximation areunknown, in practice the method achieved results approachingthose of a full batch solution for moderate duration imagesequences.

Recently, the SLAM community has also turned its attentionto exploring the information parameterization for increased ef-PSfrag replacements

xtxt−1xt−2· · ·xt0

relative-pose relative-pose

relative-poserelative-pose

navigationnavigationnavigationnavigation

Fig. 1. The system diagram for a view-based representation. The model iscomprised of a graph where the nodes correspond to historical robot posesand edges represent either Markov (navigation) or non-Markov (relative-pose)constraints.

ficiency. In particular, published approaches include the sparseextended information filter (SEIF) [34], the thin junction-treefilter (TJTF) [35], and Treemap filters [37]. The authors ofthese algorithms make the important empirical observation,first noted in [34] and later proved in [40], that when thefeature-based SLAM posterior is cast in the form of the ex-tended information filter (EIF), (i.e., the dual of the EKF), manyof the off-diagonal elements in the information matrix arenear zero when properly normalized. These new feature-basedSLAM information algorithms approximate the posterior with asparse representation and thereby prevent weak inter-landmarklinks from forming. This approach (effectively) bounds thedensity of the information matrix and, as each author shows,allows for constant time updates. The delicate and nontrivialissue that must be dealt with, however, is “how to sparsifythe information matrix?”, since this approximation can lead toglobal map inconsistency [41], [42].

Interestingly, it is the same phenomenon that plagues boththe information formulations of McLauchlan and Murray[21], [39], as well as the feature-based SLAM algorithms ofThrun et al. [34], Paskin [35], and Frese [37] — and that is“eliminating motion fills in the structure blocks.” Eliminatingthe robot’s trajectory causes the SLAM landmark posterior todensify destroying any sparsity [38], [43] and, hence, anyefficiency associated with a sparse representation. This is thereason why all feature-based SLAM information algorithmsare founded upon some type of pruning strategy that removesweak constraints.

In the following, we illustrate why the feature-based SLAMinformation matrix is naturally dense and therefore, why SEIFsand TJTFs have to approximate the SLAM posterior with asparse representation. We then continue by introducing thenovel insight that the information form is exactly sparse fora delayed-state representation. This inherent sparsity allowsus to cast the delayed-state framework in an efficient rep-resentation, but without any sparse approximation error. Wecall this result “exactly sparse delayed-state filters (ESDFs).”Benchmark results quantifying the ESDF’s efficiency withrespect to the standard EKF formulation are shown for acontrolled laboratory dataset. In addition, real-world resultsfor a recent ROV survey of the wreck of the RMS Titanic arepresented.

II. THE INFORMATION FORM

A. An Alternative Parameterization of the GaussianThe information form is often called the canonical or natural

representation of the Gaussian distribution. This notion of anatural representation stems from expanding the quadratic inthe exponential of the Gaussian distribution as

p(ξt)

= N(ξt;µt,Σt

)

=1√|2πΣt|

exp− 1

2 (ξt − µt)>Σ−1t (ξt − µt)

=1√|2πΣt|

exp− 1

2

(ξ>t Σ−1

t ξt − 2µ>t Σ−1t ξt

+ µ>t Σ−1t µt

)

Page 3: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

3

TABLE ISUMMARY OF MARGINALIZATION AND CONDITIONING OPERATIONS ON

A GAUSSIAN DISTRIBUTION EXPRESSED IN COVARIANCE AND

INFORMATION FORM

p`α,β

´= N

`hµαµβ

i,h

Σαα ΣαβΣβα Σββ

i´= N−1

`h ηαηβ

i,h

Λαα ΛαβΛβα Λββ

Marginalization Conditioning

p`α´

=Rp`α,β

´dβ p

`α˛β´

= p`α,β

´/p`β´

Cov.Form

µ = µα µ′ = µα + ΣαβΣ−1ββ (β − µβ)

Σ = Σαα Σ′ = Σαα − ΣαβΣ−1ββΣβα

Info.Form

η = ηα − ΛαβΛ−1ββηβ η′ = ηα − Λαββ

Λ = Λαα − ΛαβΛ−1ββΛβα Λ′ = Λαα

=e−

12µ>t Σ−1

t µt√|2πΣt|

exp− 1

2ξ>t Σ−1

t ξt + µ>t Σ−1t ξt

=e−

12η>t Λ−1

t ηt√∣∣2πΛ−1

t

∣∣exp− 1

2ξ>t Λtξt + η>t ξt

= N−1(ξt;ηt,Λt

)

where

Λt = Σ−1t and ηt = Λtµt. (1)

The result is that rather than parameterizing the normal dis-tribution in terms of its mean and covariance, N

(ξt;µt,Σt

),

it is instead parametrized in terms of its information vectorand information matrix, N−1

(ξt;ηt,Λt

)[44]. Here, “natural”

refers to the fact that the exponential is parameterized directlyin terms of the information vector and matrix without the needfor completing the matrix square.

B. Marginalization and ConditioningThe covariance and information representations lead to

very different computational characteristics with respect to thefundamental probabilistic operations of marginalization andconditioning. This is important because these two operationsappear at the core of any SLAM algorithm, for example mo-tion prediction and measurement updates. Table I summarizesthese operations on a Gaussian distribution where we see thatthe covariance and information representations exhibit a dualrelationship with respect to marginalization and conditioning.For example, marginalization is easy in the covariance formsince it corresponds to extracting the appropriate sub-blockfrom the covariance matrix, while in the information form itis hard because it involves calculating the Schur complementover the variables we wish to keep. Note that the oppositerelation holds true for conditioning, which is easy in theinformation form and hard in the covariance form.

III. FEATURE-BASED SLAM INFORMATION FILTERS

Most SLAM approaches are feature-based, which assumesthat the robot can extract an abstract representation of featuresin the environment from its sensor data and then use re-observation of these features for localization [22]. In thisapproach a landmark map is explicitly built and maintained.

The process of concurrently performing localization and fea-ture map building are inherently coupled, thereby implyingthat the robot must then represent a joint-distribution overlandmarks and current pose. Using the EKF to represent thesecoupled errors requires maintaining the cross-correlations inthe covariance matrix — in which there are quadraticallymany. Updating the joint correlations over map and robot leadsto an O(n2) complexity per update, with n being the numberof landmarks in the map.

A. Sparsity Yields EfficiencyAs stated earlier, some substantial papers have recently

appeared in the literature in which the authors explore re-parameterizing the feature-based SLAM posterior in the infor-mation form [34]–[37]. For example, Thrun et al. [34] makethe observation that when the EIF is used for inference, mea-surement updates are additive and efficient. The downside ofthe EIF is that motion prediction is generally O(n3), however,if the information matrix obeys a certain sparse structure, theEIF motion prediction can be performed in constant time.To obtain the requisite sparse structure, Thrun et al. makean important empirical observation regarding the architectureof the feature-based SLAM information matrix. They showthat when properly normalized, many of the inter-landmarkconstraints in the information matrix are redundant and weak.Based upon this insight, the methods presented in [34] and[35] try to approximate the information matrix with a sparserepresentation in which these weak inter-landmark constraintsare eliminated allowing for efficient inference.

The delicate issue that must be dealt with in these ap-proaches, though, is how to perform the necessary approxi-mation step to keep the information matrix sparse. In fact, thesparsification step is an important issue not to be glossed overbecause the feature-based SLAM information matrix associatedwith the joint-posterior over robot pose, xt, and landmark map,M, given sensor measurements, zt, and control inputs, ut,(i.e., p

(xt,M

∣∣zt,ut)) is naturally fully dense. As we show

next, this density arises from marginalizing out past robotposes.

B. Filtering Causes Fill-inTo see that marginalization results in fill-in, consider the

diagram shown in Fig. 2. We begin with the schematicshown to the upper left, which represents the robot, xt, attime t connected to three landmarks L1, L2 and L3 in thecontext of a Markov random field (MRF) [45], [46] (a.k.a.Markov network). The shown Markov network depicts agraphical representation of the conditional independencies inthe distribution p

(xt,L1:3

∣∣zt,ut)

and indicates that the onlyconstraints that exist are between the robot and landmarks(i.e., no inter-landmark constraints appear). This lack of inter-landmark constraints should be correctly interpreted to meanthat each landmark is conditionally independent given therobot pose as described in [31], [47]. The intuition behindthis comes from viewing the noise of each sensor reading asbeing independent, and therefore, determining each landmark

Page 4: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

4

position is an independent estimation problem given the knownlocation of the sensor.

Directly below each Markov network in Fig. 2 is anillustration of the corresponding information matrix. Herewe see that the nonzero off-diagonal elements encode therobot/landmark constraints while the zeros in the informationmatrix encode the lack of direct inter-landmark constraints[35]. Shown in the middle of Fig. 2 is the intermediate dis-tribution p

(xt+1,xt,L1:4

∣∣zt,ut). This distribution represents

a time-propagation of the previous distribution by augmentingthe state vector to include the term xt+1 (i.e., the newrobot pose at time t + 1), re-observation of feature L3, andobservation of a new landmark L4. Because the robot stateevolves according to a first-order Markov process, we see thatthe new robot state, xt+1, is only linked to the previous robotstate, xt, and that observation of the landmarks L3 and L4 addtwo additional constraints to xt+1. In the typical feature-basedSLAM approach only the current robot pose is estimated andnot the complete trajectory. Therefore, we always marginalizeout the previous robot pose, xt, during our time-projectionstep to give the distribution over current pose and map,p(xt+1,L1:4

∣∣zt,ut)

=∫p(xt+1,xt,L1:4

∣∣zt,ut)dxt. Recall-

ing the formula for marginalization applied to a Gaus-sian in the information form (see Table I), we note thatit is the the matrix outer product of ΛαβΛ−1

ββΛ>αβ (whereα = xt+1,L1:4 and β = xt) that causes the informationmatrix to fill in and become dense. This result is shown inthe rightmost graph of Fig. 2.

Intuitively, the landmarks L1, L2, L3, which used to beindirectly connected via a direct relationship with xt, mustnow represent that indirect relationship directly by creatingnew links between each other. Therefore, the penalty for afeature-based SLAM representation that always marginalizesout the robot trajectory is that the landmark Markov networkbecomes fully connected and the associated information matrixbecomes fully dense (though as previously mentioned [34]makes the empirical observation that many of the off-diagonalelements are relatively small).

IV. EXACTLY SPARSE DELAYED-STATE FILTERS

An alternative formulation of the SLAM problem is to use aview-based representation rather than a feature-based approach[17], [18], [48]. View-based representations do not explicitlymodel landmark features in the environment, instead the esti-mation problem consists of tracking the current robot pose inconjunction with a collection of historical poses sampled fromthe robot’s trajectory. The associated posterior is then definedover a collection of delayed-states [17]–[20]. In the view-basedrepresentation, raw sensor data is registered to provide virtualobservations of pose displacements. For example, in [18] and[48] these virtual observation come from scan-matching rawlaser range data, while in our application [17] these virtual ob-servations come from registering overlapping optical imagery(via the Essential matrix). Algorithm 1 provides an outlineof the overall ESDF algorithmic procedure whose details wediscuss next.

p`xt,L1:3

˛zt,ut

´p`xt+1, xt,L1:4

˛zt+1,ut+1´ p

`xt+1,L1:4

˛zt+1,ut+1´

PSfrag replacements

L1 L1

L1

L1L1L1

L2 L2

L2

L2L2L2

L3 L3

L3

L3L3L3

L4 L4

L4L4

xtxt

xtxt

xt+1 xt+1

xt+1xt+1

Fig. 2. A graphical explanation of why the feature-based SLAM informationmatrix is naturally fully dense. Refer to the text of §III for a detaileddiscussion. (left) The posterior over robot pose, xt, and landmarks, L1:3,given sensor measurements, zt, and control inputs, ut, is represented as aMarkov network. The corresponding information matrix is shown directlybelow and encodes the graphical link structure within the non-zero off-diagonal elements. (middle) The time-propagation of the posterior is nowshown where the state vector has been augmented to include the robot poseat time t + 1 (i.e., xt+1), re-observation of landmark L3, and observationof a new landmark L4. Appropriate sub-blocks of the information matrixhave been outlined in bold to differentiate the relevant portions involved inmarginalizing out the past robot pose, xt. Referring to Table I, Λαα is thelower right block, Λββ is the upper left block, and Λαβ = Λ>βα are the tworectangular blocks. (right) This posterior depicts the effect of marginalizingout the past robot state, xt, with its consequent “fill in” of the informationmatrix.

A. State Augmentation

We begin by describing the method of state augmentation,which is how we “grow” the state vector to contain a newdelayed-state. This operation occurs whenever we have anew view that we wish to store. For example, in our VANframework we add a delayed-state for each acquired image ofthe environment that we wish to be able to revisit at a latertime.

1) Adding a Delayed-State: Assume for the moment thatour estimate at time t is described by the following distributionexpressed in both covariance and information form:

p(xt,M

∣∣zt,ut)

= N([µxtµM

],

[Σxtxt ΣxtMΣMxt ΣMM

])

= N−1([ηxtηM

],

[Λxtxt ΛxtMΛMxt ΛMM

]).

This distribution represents a map, M, and current robot state,xt, given all measurements, zt, and control inputs, ut. Here,the map variable, M, is used in a general sense, for example,it could represent a collection of delayed-states or a set oflandmark features in the environment. For now we do not care,because we want to show what happens when we augment ourrepresentation to include the time-propagated robot state, xt+1,obtaining the distribution p

(xt+1,xt,M

∣∣zt,ut+1), which can

be factored as

p(xt+1,xt,M

∣∣zt,ut+1)

= p(xt+1

∣∣xt,M, zt,ut+1)p(xt,M

∣∣zt,ut+1)

Markov= p

(xt+1

∣∣xt,ut+1

)p(xt,M

∣∣zt,ut). (2)

Page 5: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

5

In (2) we factored the posterior into the product of a prob-abilistic state-transition multiplied by our prior using thecommon assumption that the robot state evolves accordingto a first-order Markov process. Equation (3) describes thegeneral nonlinear discrete-time Markov robot motion modelwe assume and (4) its first-order linearized form where F isthe Jacobian evaluated at µxt and wt ∼ N

(0,Q

)is the white

process noise.

xt+1 = f(xt,ut+1) + wt (3)≈ f(µxt ,ut+1) + F(xt − µxt) + wt (4)

Note that in general our robot state description, xt, consistsof both pose (i.e., position and orientation) and kinematiccomponents (e.g., body-frame velocities, angular rates).

2) Augmentation in the Covariance Form: Under the lin-earized approximation (4), the augmented state distribution (2)is also Gaussian, and in covariance form its result is given by[22]:

p(xt+1,xt,M

∣∣zt,ut+1)

= N(µ′t+1,Σ

′t+1

)

µ′t+1 =

f(µxt ,ut+1)µxtµM

Σ′t+1 =

(FΣxtxtF> + Q) FΣxtxt FΣxtM

ΣxtxtF> Σxtxt ΣxtM

ΣMxtF> ΣMxt ΣMM

.

(5)

The lower-right 2 × 2 sub-block of Σ′t+1 corresponds to thecovariance between the delayed-state element, xt, and themap, M, and has remained unchanged from the prior. Mean-while, the first row and column contain the cross-covariancesassociated with the time propagated robot state, xt+1, whichincludes the effect of the process model.

3) Augmentation in the Information Form: Having obtainedthe delayed-state distribution in covariance form, we cannow transform (5) to its information form (6). This requiresinversion of the 3 × 3 block covariance matrix Σ′t+1 whosetedious derivation we omit here, though, note that (6) can beverified by the fact that Λ′t+1Σ′t+1 = I and η′t+1 = Λ′t+1µ

′t+1.

p(xt+1,xt,M

∣∣zt,ut+1)

= N−1(η′t+1,Λ

′t+1

)

η′t+1 =

Q−1(f(µxt ,ut+1)− Fµxt

)

ηxt − F>Q−1(f(µxt ,ut+1)− Fµxt

)

ηM

Λ′t+1 =

Q−1 −Q−1F 0−F>Q−1 Λxtxt + F>Q−1F ΛxtM

0key result

ΛMxt ΛMM

(6)

4) Markovity Yields Exact Sparseness: Equation (6) pro-vides a key insight into the structure of the informationmatrix regarding delayed-states. We see that augmentingour state vector to include the time-propagated robot state,xt+1, introduces shared information only between it and theprevious robot state, xt. Moreover, the shared informationbetween xt+1 and the map, M, is always zero irrespectiveof what M abstractly represents (i.e., regardless of whether

M represents a set of landmarks or a collection of delayed-states, the result will always be zero). This sparsity in theaugmented state information matrix is a direct consequenceof the Markov property associated with the state transitionprobability p

(xt+1

∣∣xt,ut+1

), which states that xt+1 is only

conditionally dependent upon its previous state, xt. In termsof a graphical Markov network, we can trivially arrive at thesame sparsity pattern as (6) by recognizing that the time-propagated state, xt+1, is only linked to its parent node, xt,via the state transition probability, p

(xt+1

∣∣xt,ut+1

), as per (2)

and, therefore, is conditionally independent of M (i.e., sharesno links).

By induction, a key property of state augmentation in theinformation form is that if we continue to augment our statevector with additional delayed-states, the information matrixwill exhibit a block tridiagonal structure linking each delayed-state with only the post and previous states:

Λxt+1xt+1Λxt+1xt

Λ>xt+1xt Λxtxt Λxtxt−1

Λ>xtxt−1Λxt−1xt−1

Λxt−1xt−2

. . . . . . . . .

. (7)

Hence, the view-based SLAM delayed-state information matrixis naturally sparse without having to make any approxima-tions.

B. Measurement Updates

One of the very attractive properties of the information formis that measurement updates are constant-time and additive inan EIF [34] — this is in contrast to an EKF’s quadratic com-plexity per update. Assume the following general nonlinearmeasurement function (8) and its first-order linearized form(9):

zt = h(ξt) + vt (8)≈ h(µt) + H(ξt − µt) + vt (9)

where ξt is the predicted state vector distributed accordingto ξt ∼ N

(µt, Σt

)≡ N−1

(ηt, Λt

), vt is the white measure-

ment noise vt ∼ N(0,R

), and H is the Jacobian evaluated

at µt. The EKF covariance update requires computing theKalman gain and updating µt and Σt via [44]:

K = ΣtH>(HΣtH

> + R)−1

µt = µt + K(zt − h(µt)

)

Σt =(I−KH

)Σt(I−KH

)>+ KRK>.

(10)

This calculation non-trivially modifies all elements in thecovariance matrix resulting in quadratic computational com-plexity per update [22]. In contrast, the corresponding EIFupdate is given by [34]:

ηt = ηt + H>R−1(zt − h(µt) + Hµt

)

Λt = Λt + H>R−1H.(11)

Page 6: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

6

0 2000 4000 6000 8000 10000

0

2000

4000

6000

8000

10000

nonzero = 0.52%

large loop−closing event

tridiagonal Markov constraints

off−diagonal camera constraints

PSfrag replacements

ξt =

26664

x1

...x866

xr

37775

Fig. 3. Topology of the view-based SLAM information matrix. This figurehighlights the exact sparsity of view-based SLAM using the informationmatrix from the RMS Titanic survey of §VI-B. In all there are 867 delayed-states where each state is a 12-vector consisting of 6-pose and 6-kinematiccomponents. The resulting information matrix is a 10, 404× 10, 404 matrixwith only 0.52% nonzero elements.

1) ESDF Updates are Constant-Time: Equation (11) showsthat the information matrix is additively updated by theouter product term H>R−1H. In general, this outer productmodifies all elements of the predicted information matrix, Λt,however, a key observation is that the SLAM Jacobian, H,is always sparse [34]. For example, in the VAN frameworkpairwise registration of images Ii and Ij provides a relative-pose measurement (modulo scale) between states xi and xjresulting in a sparse Jacobian of the form:

H =[0 · · · ∂h

∂xi· · · 0 · · · ∂h

∂xj· · · 0

].

As a result, only the four block-elements corresponding toxi and xj of the information matrix need to be modified. Inparticular, the information in the diagonal blocks Λxixi andΛxjxj is increased, while new information appears at Λxixjand its symmetric counterpart Λxjxi . This new off-diagonalinformation reflects the addition of a new edge (i.e., constraint)into the corresponding Markov network linking the nodes xiand xj .

2) ESDF Updates use Linear Storage: Putting (7) togetherwith (11), we see that an important consequence of thedelayed-state framework is that the total number of nonzerooff-diagonal elements in the information matrix is linear inthe number of delayed-states and relative-pose constraints fora bounded graph structure (Fig. 3). Hence, without any ap-proximation, a view-based representation is exactly sparse and,furthermore, requires only linear storage. In our application,we control the degree of sparsity by bounding the numberof image registrations that the robot may attempt per stateaugmentation. In other words, the robot is only allowed tohypothesize k possible candidate images (where k = 5 in ourapplication) for attempted registration with the current view;this leads to at most 2nk non-Markov off-diagonal constraintsin the resulting information matrix.

3) ESDF Update and Measurement Correlation: As a sidenote, it is worth pointing out that (8) assumes that the measure-ments are corrupted by time independent noise. Since scan-

matching methods rely upon registering raw data, this criterionmay be violated if data is reused. In our VAN framework,relative-pose measurements are generated by pairwise regis-tration of images with common overlap. As typical underwateroptical survey trajectories consist of a boustrophedon patternand low frame-rates (to reduce the amount of power expendedon illumination), this implies that overall spatial image overlaptends to be low. Therefore, we assume that most pairwisecamera measurements in our application are weakly (if at all)correlated as derived in [49] and, hence, we do not directlyenforce the exclusion of data reuse.1 For the general case,however, measurement independence should be ensured byusing a set of raw data correspondences only once, so thatscan-matching measurements remain statistically independent.

C. Motion PredictionMotion prediction corresponds to a time propagation of

the robot’s state from time t to time t+ 1. In (6) wederived an expression in the information form for theaugmented distribution containing the time predicted robotstate, xt+1, and its previous state, xt — in other wordsp(xt+1,xt,M

∣∣zt,ut+1). To derive the time propagated dis-

tribution p(xt+1,M

∣∣zt,ut+1), note that all that is required

is to simply marginalize out the previous state, xt, from (6).Referring to Table I for marginalization of a Gaussian in theinformation form we have2

p(xt+1,M

∣∣zt,ut+1)

=

∫p(xt+1,xt,M

∣∣zt,ut+1)dxt

= N−1(ηt+1, Λt+1

)

ηt+1 =

[Q−1

(f(µxt ,ut+1)− Fµxt

)

ηM

]−[−Q−1FΛMxt

]Ω−1η?xt

=

[Q−1FΩ−1ηxt + Ψ

(f(µxt ,ut+1)− Fµxt

)

ηM − ΛMxtΩ−1η?xt

]

Λt+1 =

[Q−1 0

0 ΛMM

]−[−Q−1FΛMxt

]Ω−1

[−F>Q−1 ΛxtM

]

=

[Ψ Q−1FΩ−1ΛxtM

ΛMxtΩ−1F>Q−1 ΛMM − ΛMxtΩ

−1ΛxtM

]

(12)

where

η?xt = ηxt − F>Q−1(f(µxt ,ut+1)− Fµxt

),

Ω = Λxtxt + F>Q−1F,

and

Ψ = Q−1 −Q−1FΩ−1F>Q−1

= Q−1 −Q−1F(F>Q−1F + Λxtxt

)−1F>Q−1

= (Q + FΛ−1xtxtF

>)−1.

1 [49] provides a derivation showing that the correlation between twopairwise image registrations sharing a common image is null (low) if theshared feature set is empty (small).

2The simplification of Ψ employs the matrix inversion lemma:

(A + BCB>)−1 = A−1 −A−1B`B>A−1B + C−1

´−1B>A−1.

Page 7: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

7

1) ESDF Prediction is Constant-Time: An important con-sequence of the delayed-state framework is that (12) canbe implemented in constant-time. To see this we refer toFig. 4, which illustrates the effect of motion prediction fora collection of delayed-states. We begin with the Markov net-work of Fig. 4(a) showing a segregated collection of delayed-states. Our view-based “map” corresponds to the set of statesM = xt−4,xt−3,xt−2,xt−1, which have an interconnecteddependence due to camera measurements, while the statesxt and xt+1 are only serially connected and correspond tothe previous and predicted robot states, respectively. Refer-ring back to Table I, we see that Fig. 4(b) illustrates theeffect of marginalization on the information matrix. We notethat since xt is only serially connected to xt+1 and xt−1,marginalizing it out only requires modifying the informationblocks associated with these elements (i.e., Λ′xt+1xt+1

andΛ′xt−1xt−1

, denoted with cross-hairs, and the symmetric blocksΛ′xt+1xt−1

= Λ′>xt−1xt+1, denoted with black dots). Therefore,

since only a fixed portion of the information matrix is everinvolved in the calculation of (12), motion prediction can beperformed in constant-time. This is an important result sincein practice the fusion of asynchronous navigation sensor mea-surements (e.g., odometry, compass) implies that predictionis typically a high-bandwidth operation (e.g., O(10 Hz) ormore).

D. State RecoveryThe information form of the Gaussian is parameterized by

its information vector and information matrix, ηt and Λt,respectively. However, the expressions for motion prediction(12) and measurement update (11) additionally require sub-elements from the state mean vector, µt, so that the nonlinearmodels (3) and (8) can be linearized. Therefore, in orderfor the information form to be a computationally efficientparameterization for delayed-states, we also need to be able toeasily recover portions of the state mean vector. Fortunately,this is the case due to the sparse structure of the informationmatrix, Λt.

1) Full State Recovery: Naıve recovery of the state estimatethrough matrix inversion results in cubic complexity and de-stroys any efficiency gained over the EKF. Fortunately, closerinspection reveals that the recovery of the state mean, µt,can be posed more efficiently as solving a sparse, symmetric,positive-definite, linear system of equations:

Λtµt = ηt. (13)

Such systems can be solved via the classic iterative methodof conjugate gradients (CG) [50]. In general, CG can solve thissystem in n iterations with O(n) cost per iteration where n isthe size of the state vector (i.e., O(n2) total cost), and oftenin many fewer iterations if the initialization is good [51]. Inaddition, since the state mean, µt, typically does not changesignificantly with each measurement update (excluding keyevents like loop-closure), this relaxation can take place overmultiple time steps using a fixed number of iterations perupdate as pioneered by Duckett et al. [52] and Thrun et al.[34]. The caveat being that a fixed number of iterations does

not necessarily guarantee convergence and, hence, optimalstate recovery within an individual time step [53].

Additionally, a couple of recently developed multilevelrelaxation SLAM algorithms have appeared in the literature thatpropose linear asymptotic complexity. These new techniques,by Konolige [51] and Frese et al. [54], propose to achieve thecomputational reduction by subsampling poses and performingthe relaxation over multiple spatial resolutions. Borrowingmultigrid relaxation techniques pioneered in the early 1970’sfor solving discretized partial differential equations (PDEs)[55], the key idea is that spatial subsampling improves re-laxation convergence rates. Frese et al. further take advantageof this by applying their multilevel method incrementally overtime.

2) Partial State Recovery: An important observation re-garding the expressions for motion prediction (12) and mea-surement updates (11) is that they only require knowingsubsets of the state mean µt. In light of this we note thatrather than always solving for the complete state mean vector,µt, we can partition (13) into two sets of coupled equationsas [

Λ`` Λ`bΛb` Λbb

] [µ`µb

]=

[η`ηb

]. (14)

This partitioning of µt into what we call the “local portion”of the map, µ`, and the “benign portion”, µb, allows us tosub-optimally solve for local portions of the map in constant-time. By holding our current estimate for µb fixed, we cansolve (14) for an estimate of µ` as

µ` = Λ−1``

(η` − Λ`bµb

). (15)

Equation (15) provides us with a method for recoveringan estimate of the local map, µ`, provided that our estimatefor the benign portion, µb, is a decent approximation to theactual mean, µb. Furthermore, note that only a subset of µb isactually required in the calculation of µ` corresponding to thenonzero elements in the sparse matrix Λ`b. In terms of Thrunet al.’s notation [34], this active subset, denoted µ+

b , representsthe Markov blanket of µ` and corresponds to elements that aredirectly connected to µ` in the associated Markov network.Therefore, calculation of the local map, µ`, only requires anestimate of the locally connected delayed-state network, µ+

b ,and does not depend upon passive elements in the benignportion of the map.

In particular, we use (15) to provide an accurate andconstant-time approximation for recovering the robot meanduring motion prediction (12), and during incorporation ofhigh bandwidth navigation sensor measurements (11). Sincethe robot state is only serially connected to the map, Λ`b hasonly one nonzero block-element (§IV-C). Therefore, solvingfor the robot mean is constant-time. Note, though, that (15)will only provide a good approximation so long as the activemean estimate, µ+

b , is accurate. In the case that it is not (e.g.,as a result of loop closure), then the true full mean, µt, shouldbe recovered via (13).

E. Data AssociationTraditionally, the problem of data association is addressed

by evaluating the likelihood of a measurement for different

Page 8: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

8

PSfrag replacementsΛαα

ΛαβΛ−1ββΛβα

−−1

M

xt+1

xt+1

xt

xt

xt−1

xt−1

xt−2

xt−2

xt−3

xt−3

xt−4

xt−4

(a) State augmentation of xt+1.

PSfrag replacements

Λαα ΛαβΛ−1ββΛβα−−1

M

xt+1

xt+1

xt

xt−1

xt−1

xt−2

xt−2

xt−3

xt−3

xt−4

xt−4

(b) Marginalization over xt.

Fig. 4. ESDF motion prediction is constant-time. Shown above is a graphical illustration of the effect of motion prediction within a delayed-state framework.(a) The Markov network for a segregated collection of delayed-states. The view-based “map”, M, is composed of the set M = xt−4,xt−3,xt−2,xt−1,which is a collection of delayed-states that are interlinked by camera constraints. The previous and predicted robot states, xt and xt+1, respectively, areserially linked to the map. Below the Markov network is a schematic showing the nonzero structure (colored in gray) of the associated information matrix.(b) Recalling from Table I the expression for marginalization of a Gaussian in information form, we see that the rightmost schematic illustrates this operationgraphically. The end result is that only the states that were linked to xt (i.e., xt−1 and xt+1) are effected by the marginalization operation as indicated bythe cross-hairs and black dots superimposed on Λαα.

correspondence hypotheses [56]. However, obtaining the req-uisite prior requires marginalizing out all elements in the stateestimate except for the subset of variables of interest (i.e.,cubic complexity in the information form, see Table I). Tosidestep this difficulty, Thrun et al. [34], [57] instead proposedusing conditional likelihoods based upon extracting elementswithin the appropriate Markov blanket. Their method invertedthis sub-matrix to obtain a conditional covariance sub-blockthat they then used for data association. While [34] and [57]reported success using this conditional covariance, it can beshown to yield overconfident likelihood estimates [58].

Alternatively, Eustice et al. [58] derived a method forobtaining conservative estimates for the marginal covariances.Their technique stems from posing the relationship, ΛtΣt = I,as a sparse system of linear equations, ΛtΣ∗i = ei, where Σ∗iand ei denote the ith columns of the covariance and identitymatrices, respectively. They show that this relationship allowsfor efficient determination of the robot’s covariance columnand, based upon this, offer a novel algorithm for inferringconservative marginal covariances useful for data association.

For our VAN application, we employ the technique of [58]to infer the map marginal covariances and associated robotcross-correlation. This allows us to compute relative Euclideandistances and first-order uncertainty estimates between thecurrent robot pose and all other stored poses (i.e., lineartime complexity). Based upon this we infer the probabilityof overlap and select the k most likely views (k = 5 in ourapplication) as candidates for registration [49].

V. DISCUSSION

A. Connection to Lu-MiliosThe concept of a view-based map representation has strong

roots going back to a seminal paper by Lu and Milios[18]. Their approach sidestepped difficulties associated withfeature segmentation and representation by doing away withan explicit feature-based parameterization of the environment.

Rather, their technique indirectly represented a physical mapvia a collection of global robot poses and raw scan data.To determine the global poses, they formulated the nonlinearoptimization problem as one of estimating a set of global robotposes consistent with the relative-pose constraints obtained byscan matching and odometry. They then solved this sparsenonlinear optimization problem in an batch-iterative fashion.Our ESDF framework essentially attempts to recursively solvethe same problem. Note, though, that in the ESDF frameworkthe nonlinear relative-pose constraints are only linearized onceabout the current state when the measurement is incorporatedvia (11), while in the non-causal Lu-Milios batch formula-tion they are re-linearized around the current best estimateof the state at each iteration of the nonlinear optimization.This implies that while the ESDF solution can be performedrecursively, it will be more prone to linearization error.

B. Connection to Feature-Based SLAMAnother interesting theoretical connection involves relating

the delayed-state SLAM framework to feature-based SLAM. In§III we saw that the feature-based SLAM information matrixis naturally dense as a result of marginalizing out the robot’strajectory. On a similar train-of-thought, conceptually we canview the off-diagonal elements appearing in the delayed-stateSLAM information matrix as being a result of marginalizingout the landmarks (Fig. 5). Since landmarks are only everlocally observed, they only create links to spatially closerobot states. Therefore, each time we eliminate a landmark,it introduces a new off-diagonal entry into the informationmatrix that links all robot states that observed that landmark.

Interestingly, this same type of constraint phenomenon alsoappears in photogrammetry and, in particular, in large-scale,batch, bundle-adjustment techniques [59]. These techniquesare based upon a partitioned Levenberg-Marquardt algorithmthat takes advantage of the inherent sparsity between cameraand 3D feature constraints to efficiently solve the batch re-

Page 9: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

9

Define: ξt0 = xt0 initialize state vector to be the robotRequire: ηt0 ,Λt0 ,µt0 a priori robot estimate

1: AUGMENT FLAG← 12: loop perform SLAM3: ut+1 ← current control input4: if AUGMENT FLAG then5: Augment state vector: ξt+1 =

[x>t+1, ξ>t

]>ηt+1, Λt+1 ← (6)|ηt,Λt,µt,ut+1

6: AUGMENT FLAG← 07: else8: Predict state to next asynchronous measurement:

ηt+1, Λt+1 ← (12)|ηt,Λt,µt,ut+1

9: end if10: if navigation sensor measurement zt+1 then11: Perform partial-state recovery for robot:

µt+1 ← (15)|ηt+1,Λt+1,µt12: Apply measurement update:

ηt+1,Λt+1 ← (11)|zt+1,ηt+1,Λt+1,µt+1

13: Perform partial-state recovery for robot:µt+1 ← (15)|ηt+1,Λt+1,µt+1

14: else if new image frame It+1 then15: Extract and encode image interest points (§VI).16: Perform full-state recovery:

µt+1 ← (13)|ηt+1,Λt+1,µt17: Propose the k most likely candidate views for at-

tempted registration with It+1 (§IV-E).18: zt+1 ← ∅ initialize measurement vector19: for i = 1 to k do20: Attempt image registration between Ii and It+1.21: if registration success then22: Add pose-constraint to measurement vector:

zt+1 ←[z>i , z>t+1

]>23: end if24: end for25: Apply measurement update:

ηt+1,Λt+1 ← (11)|zt+1,ηt+1,Λt+1,µt+1

26: Update data association covariance bounds (§IV-E).27: Perform full-state recovery:

µt+1 ← (13)|ηt+1,Λt+1,µt+1

28: AUGMENT FLAG← 129: end if30: end loopAlgorithm 1: Outline of the ESDF algorithm as used for VAN.

construction problem. Their central component is based uponeliminating 3D-structure equations to yield a coupled set ofequations over camera poses that can be solved and thenback-substituted to recover the associated 3D-structure. Thisstrategy of 3D-structure elimination to yield a coupled setof equations over cameras, is reminiscent of the conceptualnotion of marginalizing out the landmarks in SLAM to agraph over poses. Therefore, loosely speaking, the VAN ESDFframework represents an “online”, linearized, formulation ofthis same camera recovery problem.

PSfrag replacements

L1 L2L3

xtxt−1xt−2xt−3xt−4

(a) The SLAM posterior over landmarks and trajectory as repre-sented by a Markov network.

PSfrag replacementsL1

L2

L3

xtxt−1xt−2xt−3xt−4

(b) The corresponding delayed-state Markov network aftermarginalizing out the landmarks.

Fig. 5. A depiction of the ESDF’s connection to feature-based SLAM.Conceptually, view-based SLAM can be idealized as marginalizing out thelandmarks (i.e., L1,L2,L3), which in turn causes edges to appear betweenspatially proximal samples from the robot’s trajectory (i.e. xt−4, . . . ,xt).

C. Connection to FastSLAMOur approach has an interesting relationship with other re-

cent SLAM algorithms that are based upon Rao-Blackwellizedparticle filters [31], [47]. Montemerlo et al.’s Factored Solutionto SLAM (FastSLAM) exploits the property that individualfeature estimates are conditionally independent given perfectknowledge of the vehicle trajectory. Different possible instanti-ations of the trajectory are represented as particles in a particlefilter, where each trajectory has its own set of estimated featurelocations. If each particle in FastSLAM did not represent thecomplete vehicle trajectory, then the conditional independenceassumptions they exploit would no longer apply. Our approachtoo exploits this same conditional independence property, andmust keep a history of vehicle poses in the state vector tomaintain sparsity. Our method, however, does not estimate fea-ture locations explicitly, but rather applies constraints derivedfrom measurements of the same features at multiple poses tocompute updates to the entire vehicle trajectory. The extensionof FastSLAM to accommodate such pose-constraints presentsan interesting question and warrants future research, especiallyin the context of dealing with the issue of particle depletion[60].

D. Connection to High Data RatesFinally, note that a view-based representation is still ap-

plicable even with much higher perceptual data rates (e.g.,laser scans, video). While the ESDF framework is general,and supports trajectory sampling at any rate, for practicalreasons it may be prudent to decimate the trajectory in orderto control the rate of growth of the resulting state vector. Thekey idea is that we are not required to sample the trajectoryat our perceptual update rate, but rather, we can sample itat an appropriate spatial decimation that is sufficient for re-localization.

For example, in our underwater VAN application a digital-still image is collected every few seconds from a down-looking monocular camera. Since this typically results insequential frame overlap of the order of 15–35%, we include

Page 10: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

10

PSfrag replacementsBundle Adjustment Constraint

T

IT0IT1 · · · ITm

M

IAkIAj· · ·IA1IA0

(a) The temporary frame set T provides additional constraints.

PSfrag replacements

Bundle Adjustment Constraint

TIT0

IT1

· · ·ITm

M

IAkIAj· · ·IA1IA0

(b) The result is an improved visual-odometry constraint.

Fig. 6. An extension of view-based SLAM to video frame rates. (a) Ourcollection of anchor images IA0

, . . . , IAj represents a subsampling ofthe available video image sequence and serves as our view-based spatialmap M. Given higher frame rates, we can exploit the additional viewsbetween temporally consecutive anchor images IAj and IAk to get animproved estimate of incremental motion. The improved motion estimatecomes from a local bundle adjustment that includes the temporary frame setT = IT0

, . . . , ITm. (b) The result is a serial constraint between IAj andIAk that is more rigid than a single pairwise measurement between the pair.

all frames into our view-based map representation. However,in the general case where video frame rates are available, wecan selectively sample key frames from the video sequenceto serve as spatial “anchor points” in a view-based map. Re-observation of these key frames (coupled with successful im-age registration) provides a zero-drift spatial measurement ofrobot motion allowing for loop closure. Furthermore, we canexploit the higher frame rates to get an improved estimate ofvisual odometry by performing a local-bundle adjustment overall frames occurring between temporally consecutive anchorimages (Fig. 6). This would provide a more rigid constraintbetween sampled poses than a simple pairwise registrationwould [61]. While this may not make optimal use of the inter-sample data, it represents a practical online compromise.

VI. RESULTS

This section presents experimental and real-world resultsproving both the scalability and efficiency of the ESDF infor-mation framework. Note that for each dataset all processingwas done using MATLAB R13 running on an Intel 3.4 GHzPentium-4 desktop with 2048 MB of RAM. For the purposesof benchmark comparison, we employed the full state recoverytechnique of (13) after every camera measurement and other-wise used the constant-time partial state recovery method of(15) to recover the robot state.

Camera constraints were generated using a state-of-the-artfeature-based image registration approach [62] founded upon:• Extract a combination of both Harris [63] and SIFT [64]

interest points from each image. For the Harris points,we exploit our navigation prior to apply an orientationnormalization to the interest regions by warping via theinfinite homography [62], and then compactly encodeusing Zernike moments [65].

• Establish putative correspondences between overlappingcandidate image pairs based upon similarity and a pose-constrained correspondence search [17].

• Employ a statistically robust least median of squares(LMedS) [66] registration methodology with regularizedsampling [67] to extract a consistent inlier correspon-dence set. For this task we use a 6-point Essential matrixalgorithm [68] as the motion-model constraint.

• Solve for a relative-pose estimate using the inlier corre-spondence set and Horn’s relative orientation algorithm[69] initialized with samples from our orientation prior.

• Carry out a two-view maximum likelihood estimate(MLE) refinement based upon minimizing the reprojectionerror over all inliers [62]. This returns the optimal 5-DOFrelative-pose constraint (i.e., azimuth, elevation, Eulerroll, Euler pitch, Euler yaw) and first-order parametercovariance (using the standard assumption of 1-pixel,isotropic, i.i.d. noise for extracted interest points).

For further details on VAN’s systems-level image processingsee [49].

A. Laboratory Validation: EKF vs. ESDF

In this section we demonstrate the efficiency of the ESDFinformation framework as compared to the standard EKF-basedformulation.

1) Experimental Setup: The experimental setup consistedof a downward-looking digital-still camera mounted to anunderwater, moving, pose-instrumented remotely operatedvehicle (ROV) at the Johns Hopkins University (JHU) Hydro-dynamic Test Facility [70]. Their vehicle [71] is instrumentedwith a typical suite of oceanographic dead-reckoning navi-gation sensors capable of measuring heading, attitude, XYZbottom-referenced Doppler velocities, and a pressure sensorfor depth. The vehicle and test facility are also equipped witha high frequency acoustic long-baseline (LBL) system that pro-vides centimeter-level bounded error XY vehicle positions usedfor validation purposes only. A simulated seafloor environmentwas created by placing textured carpet, riverbed rocks, andlandscaping boulders on the tank floor and was appropriatelyscaled to match a rugged seafloor environment with consider-able 3D scene relief. See [49] for further experimental details.

2) Experimental Results: Fig. 7 shows the result of esti-mating the ROV delayed-states associated with a 101 imagesequence using a full covariance EKF and sparse ESDF. Forthis experiment, the vehicle started near the top-left corner ofthe plot at (-2.5, 2.75) and then drove a course consisting oftwo grid-based surveys, one oriented SW to NE and the otherW to E. Fig. 7(a) shows the spatial XY pose topology, 3σconfidence bounds (unviewable at this scale), and link networkof camera constraints; links correspond to image pairs thatwere successfully registered. Fig. 7(b) and (c) compare thedensities associated with the EKF covariance matrix versus theESDF information matrix. Note that while the EKF correlationmatrix is dense, the information matrix exhibits a sparsetridiagonal structure with the number of off-diagonal elementsbeing linear in the number of camera constraints. In all thereare 307 camera measurements (81 temporal / 226-spatial) andeach delayed-state is a 12-vector consisting of 6-pose and6-kinematic components. Therefore, 102 delayed-states (101images plus the robot) results in a 1224× 1224 information

Page 11: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

11

matrix containing 122(102 + 2 · 101) + 62(2 · 226) = 60, 048nonzero elements as shown. We found the EKF and ESDFsolutions to be numerically equivalent and, furthermore, thatthe ESDF only required 4% of the storage of the EKF for thisexperiment.

Turning our attention now to filter efficiency, in Fig. 8we compare the prediction and update times of the EKF tothose of the ESDF. In particular, we see that prediction isessentially a constant-time operation for both filters. However,Fig. 8(b) shows that ESDF updates are orders of magnitudemore efficient than corresponding EKF updates, and more-over that they become more efficient relative to the EKFas the number of delayed-states increases. This increase inrelative efficiency with increasing state size results from adecreasing density in the information matrix. Also, note thatthis impressive computational reduction is despite the factthat we are using MATLAB’s “left-divide” capability to solve(13) (essentially a form of LU decomposition with forwardand backward substitution). Hence, the ESDF’s results couldbe even better if we implemented the iterative multilevelstate recovery techniques of [51], [54]. In summary, for this101 image sequence data collection took a total of 17 minutes,EKF processing required 29 minutes, and ESDF estimation wasjust over a minute (i.e., 17× faster than real-time).3

B. Real-World Results: ScalabilityIn this section we present experimental results validating

the large-area scalability of our ESDF framework.1) Experimental Setup: The wreck of the RMS Titanic

was surveyed during the summer of 2004 by the deep-seaROV Hercules [72] operated by the Institute for Exploration ofthe Mystic Aquarium. The ROV was equipped with a standardsuite of oceanographic dead-reckon navigation sensors compa-rable to the JHU vehicle suite. In addition, Hercules also hadonboard a calibrated stereo rig consisting of two downward-looking 12-bit digital-still cameras that collected imagery at arate of 1 frame every 8 seconds. Note, however, that the resultsbeing presented here were produced using imagery from onecamera only — the purpose of this self-imposed restriction to amonocular sequence is to demonstrate the general applicabilityof our VAN methodology.

2) Experimental Results: In Fig. 9 we see a time progres-sion of the camera constraints and vehicle trajectory estimatewith Fig. 10(a) showing the final 3D pose-constraint network.In particular, Fig. 9(c) depicts a large loop-closing eventwhereby the vehicle successfully re-localized by correctlyregistering 4 image pairs out of 64 hypothesized candidates.This was after having lost bottom-lock Doppler velocitymeasurements for an extended period of time. In all the vehicletraversed a (3D) path length of 3.4 km over the course of a344 minute survey with a (faster than real-time) total ESDFestimation time of less than 39 minutes (excluding imageprocessing time). The resulting convex hull of the final mappedregion encompasses an area over 3100 m2 with a total of 866

3These numbers are for the estimation time only and exclude any imageprocessing time.

images used to provide 3494 camera-generated relative-poseconstraints.

While there is no ground-truth for this dataset, the resultingpose-network qualitatively appears to be consistent in thatthe recovered vehicle trajectory forms the outline of a ship’shull. To quantitatively corroborate the recovered pose-networkaccuracy we pairwise triangulated scene structure using onlythe saved pairwise image correspondences and the final VANestimated vehicle poses. The results are shown in Fig. 10. Notethat the histograms of Fig. 10(d) and Fig. 10(e) contain twoerror measures and that the y-axis has been clipped to showfine detail. The first measure (white) is the triangulation errorbased upon the relative-pose camera measurements used bythe ESDF filter. This should serve as a baseline for the bestpossible pairwise triangulation error since each pose measureis the result of a two-view bundle adjustment. The secondmeasure (black) is the triangulation error based upon the finalVAN estimated poses. Scale for both measures has been set bythe VAN estimate.

Note that the VAN triangulated errors are more widelydistributed than the pairwise bundle-adjusted poses. This is,however, to be expected since VAN’s global estimate takesinto account all measured camera constraints. The “outliers”are due to poor triangulation resulting from residual error inthe global VAN estimate. Again, this error is to be expectedsince VAN is not directly enforcing structure consistency, onlypose consistency. In fact, because VAN is enforcing onlypose consistency, the overall coherence of the point clouds inFig. 10(b) and Fig. 10(c) (less than 7.5 cm of triangulationerror) corroborates the global consistency of VAN’s poseestimates. This result is even more impressive when takinginto consideration the fact that VAN does not explicitly enforceconsistency of structure, instead only consistency of poses.This adds further evidence that VAN’s global pose estimatesare near ideal. As an aside, note that the quality of VAN’sresults suggests that it can serve as a recursive scalable solutionto large-area structure-from-motion since the estimated poseand triangulated structure should provide a good initializationpoint in an optimal batch bundle-adjustment step.

VII. CONCLUSION

In conclusion, this article presented the insight that thedelayed-state view-based SLAM information matrix is exactlysparse and, furthermore, that this sparsity is a direct conse-quence of retaining historical trajectory samples. Moreover,while the EKF covariance formulation requires quadratic stor-age, the number of nonzero off-diagonal elements in theESDF information matrix is linear in the number of mea-sured relative-pose constraints. This sparse matrix structureallows for efficient full state recovery via recently proposedmultilevel relaxation methods, while approximate partial staterecovery allows motion prediction and navigation updates tobe performed in constant time. Finally, we demonstrated theefficiency and large-area applicability of the ESDF frameworkby presenting vision-based 6-DOF SLAM results for bothlaboratory and real-world experiments.

Page 12: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

12

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.51

1.5

2

2.5

3

3.5

4

North

[m]

East [m]

start

end

307 camera constraints 81 temporal (along−track links)226 spatial (across−track links)

(a) VAN recovered trajectory for 101 images of the JHU dataset.

0 200 400 600 800 1000 1200

0

200

400

600

800

1000

1200

(b) Covariance matrix.

0 200 400 600 800 1000 1200

0

200

400

600

800

1000

1200

(c) Information matrix.

Fig. 7. This figure contrasts the exact sparsity of the ESDF information matrix versus the density of the full-covariance matrix. (a) The spatial topology ofa 101 image sequence of underwater images collected from the JHU ROV — in all there are 307 camera constraints. The ground-truth trajectory is shownin gray with circles depicting ESDF trajectory samples; the recovered VAN pose-constraint network is depicted in black. The 3σ VAN covariance ellipses areunviewable at this scale with a standard deviation of less than 2 cm. Note that the recovered trajectory exhibits a slight systematic discrepancy from theground-truth; we believe this is due to an unaccounted for bias in the ground-truth experimental setup. (b) The nonzero elements of the covariance matrix; allelements above a normalized correlation score of 10% are shown. (c) The nonzero elements of the information matrix. Note that the covariance matrix has12242 = 1, 498, 176 nonzero elements, while the information matrix contains only 60,048. The covariance matrix and information matrix are numericallyequivalent, however, the information matrix is exactly sparse.

0 20 40 60 80 1000

0.05

0.1

0.15

0.2Prediction Time: EKF vs. ESDF

Pred

ictio

n Ti

me

[s]

0 20 40 60 80 1000

20

40

60Pointwise Ratio of EKF Predict to ESDF Predict

EKF

/ ESD

F

Number of Delayed−States

EKFESDF

(a) EKF vs. ESDF prediction times.

0 20 40 60 80 1000

0.5

1

1.5

2

2.5Update Time: EKF vs. ESDF

Upda

te T

ime

[s]

0 20 40 60 80 1000

100

200

300Pointwise Ratio of EKF Update to ESDF Update

EKF

/ ESD

F

Number of Delayed−States

EKFESDF

(b) EKF vs. ESDF update times.

Fig. 8. Time comparison of EKF vs. ESDF filtering operations using the JHU dataset. (a) The top figure shows both the EKF and ESDF prediction times inseconds versus the number of delayed-state entries, while the bottom figure shows their pointwise ratio. From the plots we can gather that prediction is aconstant-time operation for both filters. (b) The same plot layout as before, but now we show the update times for each filter. The y-axis of the upper graphhas been clipped from its nominal range of [0, 15] to show detail. For benchmark comparison we employed the full-state recovery technique of (13) afterevery camera measurement (using MATLAB’s “left divide” capability). Note that even despite this, the ESDF becomes more efficient relative to the EKF withincreasing state size due to the decreasing density of the information matrix.

Page 13: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

13

−4150 −4130 −4110

3740

3760

3780

3800

North

[m]

East [m]

START

(a) 1–200.

−4150 −4130 −4110

3740

3760

3780

3800

East [m]

North

[m]

(b) 1–400.

−4150 −4130 −4110

3740

3760

3780

3800

East [m]

North

[m]

(c) 1–600.

−4150 −4130 −4110 −4090

3740

3760

3780

3800

3820

3840

East [m]

North

[m]

(d) All.

−4150 −4130 −4110 −4090

3740

3760

3780

3800

3820

3840

East [m]

North

[m]

(e) 1–800.

Fig. 9. Time-evolution of the RMS Titanic pose constraint network. Successfully registered camera links are shown in gray and 3σ covariance boundsare depicted in black. Time progression begins with the upper left plot and proceeds clockwise: images 1–200, 1–400, 1–600, 1–800, all. Note the largeloop-closing event that occurs in (c) when the vehicle returns to the bow of the ship (depicted by the black arrow) after having traveled from the stern withthe camera turned off.

Page 14: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

14

−4160

−4140

−4120

−4100

3740 3760 3780 3800 3820 3840

−3765−3760−3755

North [m]East [m]

Dept

h [m

]

(a) Final 3D pose-constraint network.

(b) Raw triangulated point cloud.

Scene Relief [m]0 2 4 6 8 10 12 14 16 18 20

(c) Clipped point cloud.

−50 −40 −30 −20 −10 0 10 20 30 40 500

500

1000

1500

2000

2500

3000

Triangulation Error [cm]

Raw Point Cloud Triangulation Error

PairwiseVAN

(d) Raw triangulation error.

−10 −5 0 5 100

2

4

6

8

10 x 104 Clipped Point Cloud Triangulation Error

Triangulation Error [cm]

PairwiseVAN

(e) Clipped triangulation error.

Fig. 10. The triangulated structure for the RMS Titanic as computed from the final VAN pose estimate and saved pairwise correspondences. Triangulated 3Dpoints are defined as the midpoint of the minimum perpendicular distance between two corresponding camera rays. Since structure is triangulated on a pairwisebasis, redundant 3D points may occur. (a) Oblique view of the final 3D VAN pose-constraint network associated with using 866 images to provide 3494 cameraconstraints; 3σ bounds are unviewable at this scale. Green links represent temporally consecutive registered image pairs while red links represent spatiallyregistered image pairs. (b) The raw VAN triangulated points rendered in 3D (467,512 points in total). (c) This plot displays a reduced set of triangulated data(363,799 points) for which we have thrown away all points having a triangulation error greater than 7.5 cm. (d)–(e) Histograms of the triangulation error(i.e., the minimum perpendicular distance) for all points across all established camera pairs.

Page 15: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

15

ACKNOWLEDGMENTS

The authors wish to thank our colleagues, Dr. LouisWhitcomb and Mr. James Kinsey, for their collaboration incollecting the JHU tank dataset. We are also grateful toDr. Robert Ballard for providing us with the RMS Titanicdataset. Finally, we would like to thank Dr. Seth Teller for hisdiscussions regarding large-area, scalable SLAM. This workwas funded in part by the CenSSIS ERC of the NationalScience Foundation under grant EEC-9986821, in part by theWoods Hole Oceanographic Institution through a grant fromthe Penzance Foundation, and in part by a NDSEG Fellowshipawarded through the Department of Defense.

REFERENCES

[1] J. Reynolds, R. Highsmith, B. Konar, C. Wheat, and D. Doudna, “Fish-eries and fisheries habitat investigations using undersea technology,” inProc. IEEE/MTS OCEANS Conf. Exhib., vol. 2, Honolulu, HI, USA,Nov. 2001, pp. 812–820.

[2] H. Singh, R. Armstrong, F. Gilbes, R. Eustice, C. Roman, O. Pizarro,and J. Torres, “Imaging coral I: imaging coral habitats with the SeaBEDAUV,” J. Subsurface Sensing Tech. Apps., vol. 5, no. 1, pp. 25–42, Jan.2004.

[3] B. Fletcher, “Chemical plume mapping with an autonomous underwatervehicle,” in Proc. IEEE/MTS OCEANS Conf. Exhib., vol. 1, Honolulu,HI, USA, Nov. 2001, pp. 508–512.

[4] C. German, D. Connelly, R. Prien, D. Yoerger, M. Jakuba, A. Bradley,T. Shank, H. Edmonds, and C. Langmuir, “New techniques for hy-drothermal exploration: in situ chemical sensors on AUVs — prelim-inary results from the Lau Basin,” in EOS: Trans. Amer. GeophysicalUnion Fall Meeting Supplement, Dec. 2004, pp. A190+, Abstract.

[5] M. Jakuba, D. Yoerger, A. Bradley, C. German, C. Langmuir, andT. Shank, “Multiscale, multimodal AUV surveys for hydrothermal ventlocalization,” in Proc. Intl. Symp. Unmanned Unteth. Subm. Tech.,Durham, NH, 2005, to appear.

[6] D. Yoerger, A. Bradley, B. Walden, H. Singh, and R. Bachmayer,“Surveying a subsea lava flow using the Autonomous Benthic Explorer(ABE),” Intl. J. Systems Science, vol. 29, no. 10, pp. 1031–1044, Oct.1998.

[7] D. Yoerger, A. Bradley, M. Cormier, W. Ryan, and B. Walden, “Highresolution mapping of a fast spreading mid-ocean ridge with the Au-tonomous Benthic Explorer,” in Proc. Intl. Symp. Unmanned Unteth.Subm. Tech., Durham, NH, Aug. 1999.

[8] D. Yoerger, D. Kelley, and J. Delaney, “Fine-scale three-dimensionalmapping of a deep-sea hydrothermal cent site using the Jason ROVsystem,” Intl. J. Robotics Research, vol. 19, no. 11, pp. 1000–1014,Nov. 2000.

[9] D. Yoerger, H. Singh, L. Whitcomb, J. Catteau, J. Adams, B. Foley, andD. Mindell, “High resolution mapping for deep water archeology,” inAnnual Meeting of the Society of Historical Archaeology, 1998.

[10] H. Singh, J. Adams, D. Mindell, and B. Foley, “Imaging underwater forarcheology,” Amer. J. Field Archaeology, vol. 27, no. 3, pp. 319–328,2000.

[11] R. Ballard, L. Stager, D. Master, D. Yoerger, D. Mindell, L. Whitcomb,H. Singh, and D. Piechota, “Iron age shipwrecks in deep water offAshkelon, Israel,” Amer. J. Archaeology, vol. 106, no. 2, pp. 151–168,Apr. 2002.

[12] M. Hunt, W. Marquet, D. Moller, K. Peal, W. Smith, and R. Spindel,“An acoustic navigation system,” Woods Hole Oceanographic Institution,Tech. Rep. WHOI-74-6, Dec. 1974.

[13] P. Milne, Underwater acoustic positioning systems. Houston: GulfPublishing Company, 1983.

[14] Acoustic Doppler current profiler: principles of operation a practicalprimer, 2nd ed., RD Instruments, San Diego, CA, USA, Jan. 1996, P/N951-6069-00.

[15] T. Gaiffe, “U-Phins: a FOG-based inertial navigation system developedspecifically for AUV navigation and control,” in Intl. Conf. UnderwaterIntervention, New Orleans, LA, Feb. 2002.

[16] L. Whitcomb, D. Yoerger, and H. Singh, “Advances in Doppler-basednavigation of underwater robotic vehicles,” in Proc. IEEE Intl. Conf.Robot. Auto., vol. 1, 1999, pp. 399–406.

[17] R. Eustice, O. Pizarro, and H. Singh, “Visually augmented navigationin an unstructured environment using a delayed state history,” in Proc.IEEE Intl. Conf. Robot. Auto., vol. 1, New Orleans, USA, Apr. 2004,pp. 25–32.

[18] F. Lu and E. Milios, “Globally consistent range scan alignment forenvironment mapping,” Autonomous Robots, vol. 4, pp. 333–349, Apr.1997.

[19] J. Leonard and R. Rikoski, “Incorporation of delayed decision makinginto stochastic mapping,” in Experimental Robotics VII, ser. LectureNotes in Control and Information Sciences, vol. 271. Springer-Verlag,2001, pp. 533–542.

[20] S. Fleischer, “Bounded-error vision-based navigation of autonomousunderwater vehicles,” Ph.D. dissertation, Stanford University, May 2000.

[21] P. McLauchlan, “A batch/recursive algorithm for 3D scene reconstruc-tion,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., vol. 2, HiltonHead, SC, USA, 2000, pp. 738–743.

[22] R. Smith, M. Self, and P. Cheeseman, “A stochastic map for uncertainspatial relationships,” in Proc. Intl. Symp. Robotics Research. MITPress, 1988, pp. 467–474.

[23] ——, “Estimating uncertain spatial relationships in robotics,” in Au-tonomous Robot Vehicles, I. Cox and G. Wilfong, Eds. Springer-Verlag,1990, pp. 167–193.

[24] P. Moutarlier and R. Chatila, “An experimental system for incrementalenvironment modeling by an autonomous mobile robot,” in Proc. Intl.Symp. Experimental Robotics, Montreal, Canada, June 1989.

[25] J. Leonard and H. Feder, “Decoupled stochastic mapping,” IEEE J.Oceanic Eng., vol. 26, no. 4, pp. 561–571, 2001.

[26] M. Bosse, P. Newman, J. Leonard, and S. Teller, “An Atlas frameworkfor scalable mapping,” Intl. J. Robotics Research, vol. 23, pp. 1113–1139, Dec. 2004.

[27] J. Leonard and P. Newman, “Consistent, convergent, and constant-timeSLAM,” in Proc. Intl. Joint Conf. Art. Intell., Acapulco, Mexico, Aug.2003.

[28] A. Davison, “Mobile robot navigation using active vision,” Ph.D. dis-sertation, University of Oxford, 1999.

[29] S. Williams, “Efficient solutions to autonomous mapping and navigationproblems,” Ph.D. dissertation, University of Sydney, 2001.

[30] J. Guivant and E. Nebot, “Optimization of the simultaneous localizationand map building algorithm for real time implementation,” IEEE Trans.Robot. Automat., vol. 17, no. 3, pp. 242–257, June 2001.

[31] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, “FastSLAM: afactored solution to the simultaneous localization and mapping problem,”in Proc. AAAI Natl. Conf. Art. Intell. Edmonton, Canada: AAAI, 2002.

[32] M. Montemerlo, “FastSLAM: a factored solution to the simultaneouslocalization and mapping problem with unknown data association,”Ph.D. dissertation, Carnegie Mellon University, June 2003.

[33] S. Julier and J. Uhlmann, “Building a million beacon map,” in SensorFusion and Decentralized Control in Robotic Systems IV, G. McKee andP. Schenker, Eds. Proc. SPIE, Oct. 2001, vol. 4571, pp. 10–21.

[34] S. Thrun, Y. Liu, D. Koller, A. Ng, Z. Ghahramani, and H. Durrant-Whyte, “Simultaneous localization and mapping with sparse extendedinformation filters,” Intl. J. Robotics Research, vol. 23, no. 7-8, pp. 693–716, July-Aug. 2004.

[35] M. Paskin, “Thin junction tree filters for simultaneous localization andmapping,” in Proc. Intl. Joint Conf. Art. Intell., San Francisco, CA, 2003,pp. 1157–1164.

[36] U. Frese and G. Hirzinger, “Simultaneous localization and mapping —a discussion,” in Proc. IJCAI Workshop: Reasoning with Uncertainty inRobotics, Seattle, WA, 2001, pp. 17–26.

[37] U. Frese, “Treemap: an O(Log N) algorithm for simultaneous localiza-tion and mapping,” in Spatial Cognition IV, C. Freksa, Ed. SpringerVerlag, 2004.

[38] R. Eustice, H. Singh, and J. Leonard, “Exactly sparse delayed-statefilters,” in Proc. IEEE Intl. Conf. Robot. Auto., Barcelona, Spain, 2005,pp. 2428–2435.

[39] P. McLauchlan and D. Murray, “A unifying framework for structureand motion recovery from image sequences,” in Proc. IEEE Intl. Conf.Computer Vision, Boston, MA, 1995, pp. 314–320.

[40] U. Frese, “A proof for the approximate sparsity of SLAM informationmatrices,” in Proc. IEEE Intl. Conf. Robot. Auto., Barcelona, Spain,2005, pp. 331–337.

[41] R. Eustice, M. Walter, and J. Leonard, “Sparse extended informationfilters: insights into sparsification,” in Proc. IEEE/RSJ Intl. Conf. Intell.Robots Systems, 2005, pp. 641–648.

[42] M. Walter, R. Eustice, and J. Leonard, “A provably consistent methodfor imposing sparsity in feature-based SLAM information filters,” inProc. Intl. Symp. Robotics Research, San Francisco, CA, Oct. 2005, InPress.

Page 16: Exactly Sparse Delayed-State Filters for View-Based SLAM · 2008. 11. 3. · covariance form was published by McLauchlan and Murray [39], in the context of recursive structure-from-motion

16

[43] F. Dellaert, “Square root SAM,” in Proc. Robotics: Science & Systems.Cambridge, MA: MIT Press, June 2005, pp. 177–184.

[44] Y. Bar-Shalom, X. Rong Li, and T. Kirubarajan, Estimation withapplications to tracking and navigation. New York: John Wiley &Sons, Inc., 2001.

[45] J. Pearl, Probabilistic reasoning in intelligent systems: networks ofplausible inference. San Mateo, CA: Morgan Kaufmann Publishers,1988.

[46] Y. Weiss and W. Freeman, “Correctness of belief propagation inGaussian graphical models of arbitrary topology,” Neural Computation,vol. 13, no. 10, pp. 2173–2200, 2001.

[47] K. Murphy, “Bayesian map learning in dynamic environments,” inAdvances in Neural Information Processing Systems. MIT Press, 1999.

[48] J. Gutmann and K. Konolige, “Incremental mapping of large cyclicenvironments,” in Proc. IEEE Intl. Symp. Comp. Intell. Robot. Auto.,Monterey, CA, Nov. 1999, pp. 318–325.

[49] R. Eustice, “Large-area visually augmented navigation for autonomousunderwater vehicles,” Ph.D. dissertation, Massachusetts Institute ofTechnology / Woods Hole Oceaonographic Institution Joint Program,June 2005.

[50] J. Shewchuk, “An introduction to the conjugate gradient method withoutthe agonizing pain,” Carnegie Mellon University, Technical ReportCMU-CS-94-125, Aug. 1994.

[51] K. Konolige, “Large-scale map-making,” in Proc. AAAI Natl. Conf. Art.Intell., San Jose, CA, 2004, pp. 457–463.

[52] T. Duckett, S. Marsland, and J. Shapiro, “Learning globally consistentmaps by relaxation,” in Proc. IEEE Intl. Conf. Robot. Auto., SanFrancisco, CA, Apr. 2000, pp. 3841–3846.

[53] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numericalrecipes in C: the art of scientific computing, 2nd ed. CambridgeUniversity Press, 1992.

[54] U. Frese, P. Larsson, and T. Duckett, “A multilevel relaxation algorithmfor simultaneous localisation and mapping,” IEEE Trans. Robot., vol. 21,no. 2, pp. 1–12, 2005.

[55] A. Brandt, “Multi-level adaptive solutions to boundary-value problems,”Math. Comp., vol. 31, pp. 333–390, 1977.

[56] M. Dissanayake, P. Newman, S. Clark, H. Durrant-Whyte, andM. Csorba, “A solution to the simultaneous localization and mapbuilding (SLAM) problem,” IEEE Trans. Robot. Automat., vol. 17, no. 3,pp. 229–241, June 2001.

[57] Y. Liu and S. Thrun, “Results for outdoor-SLAM using sparse extendedinformation filters,” in Proc. IEEE Intl. Conf. Robot. Auto., vol. 1, Sept.2003, pp. 1227–1233.

[58] R. Eustice, H. Singh, J. Leonard, M. Walter, and R. Ballard, “Visuallynavigating the RMS Titanic with SLAM information filters,” in Proc.Robotics: Science & Systems. Cambridge, MA: MIT Press, June 2005,pp. 57–64.

[59] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon, “Bundleadjustment – a modern synthesis,” in Vision Algorithms: Theory andPractice, ser. LNCS, W. Triggs, A. Zisserman, and R. Szeliski, Eds.Springer-Verlag, 2000, pp. 298–375.

[60] C. Stachniss, G. Grisetti, and W. Burgard, “Recovering particle diversityin a Rao-Blackwellized particle filter for SLAM after actively closingloops,” in Proc. IEEE Intl. Conf. Robot. Auto., Barcelona, Spain, Apr.2005, pp. 655–660.

[61] Z. Zhang and Y. Shan, “Incremental motion estimation through localbundle adjustment,” Microsoft Research, Technical Report MSR-TR-01-54, May 2001.

[62] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.Cambridge University Press, 2000.

[63] C. Harris and M. Stephens, “A combined corner and edge detector,” inProc. 4th Alvey Vision Conf., Manchester, U.K., 1988, pp. 147–151.

[64] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Intl. J. Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[65] O. Pizarro, “Large scale structure from motion for autonomous under-water vehicle surveys,” Ph.D. dissertation, Massachusetts Institute of

Technology / Woods Hole Oceaonographic Institution Joint Program,September 2004.

[66] P. Rousseeuw and A. Leroy, Robust regression and outlier detection.New York: John Wiley and Sons, 1987.

[67] Z. Zhang, “Determining the epipolar geometry and its uncertainty: areview,” Intl. J. Computer Vision, vol. 27, no. 2, pp. 161–198, 1998.

[68] O. Pizarro, R. Eustice, and H. Singh, “Relative pose estimation forinstrumented, calibrated imaging platforms,” in Proc. Digital ImageComputing Apps., Sydney, Australia, Dec. 2003, pp. 601–612.

[69] B. Horn, “Relative orientation,” Intl. J. Computer Vision, vol. 4, no. 1,pp. 59–78, Jan. 1990.

[70] J. Kinsey, D. Smallwood, and L. Whitcomb, “A new hydrodynamics testfacility for UUV dynamics and control research,” in Proc. IEEE/MTSOCEANS Conf. Exhib., vol. 1, Sept. 2003, pp. 356–361.

[71] D. Smallwood, R. Bachmayer, and L. Whitcomb, “A new remotelyoperated underwater vehicle for dynamics and control research,” in Proc.Intl. Symp. Unmanned Unteth. Subm. Tech., Durham, NH, USA, 1999,pp. 370–377.

[72] D. Coleman, R. Ballard, and T. Gregory, “Marine archaeological explo-ration of the Black Sea,” in Proc. IEEE/MTS OCEANS Conf. Exhib.,vol. 3, Sept. 2003, pp. 1287–1291.

Ryan Eustice is a postdoctoral fellow with theDepartment of Mechanical Engineering at the JohnsHopkins University. His research interests are in theareas of navigation and mapping, underwater com-puter vision, and autonomous underwater vehicles.He holds a B.S. degree in Mechanical Engineeringfrom Michigan State University (1998) and a Ph.D.in Ocean Engineering from the MIT/WHOI Joint-Program (2005).

Hanumant Singh received the B.S. degree as adistinguished graduate in computer science and elec-trical engineering from George Mason University,Fairfax, VA, in 1989 and the Ph.D. degree from theMassachusetts Institute of Technology, Cambridge,MA/Woods Hole Oceanographic Institute (WHOI),Woods Hole, MA, joint program in 1995. He hasbeen a member of the staff at WHOI since 1995,where his research interests include high-resolutionimaging underwater and issues associated with dock-ing, navigation, and the architecture of underwater

vehicles.

John J. Leonard is Associate Professor of Mechan-ical and Ocean Engineering in the MIT Departmentof Mechanical Engineering. He is also a memberof the MIT Computer Science and Artificial Intelli-gence Laboratory (CSAIL). His research addressesthe problems of navigation, mapping, and persistentautonomy for autonomous mobile robots operatingin unstructured environments. He holds the degreesof B.S.E. in Electrical Engineering and Science fromthe University of Pennsylvania (1987) and D.Phil. inEngineering Science from the University of Oxford

(1994).