13
Monocular SLAM with Conditionally Independent Split Mapping Steven A. Holmes, Member, IEEE, and David W. Murray, Member, IEEE Abstract—The recovery of structure from motion in real time over extended areas demands methods that mitigate the effects of computational complexity and arithmetical inconsistency. In this paper, we develop SCISM, an algorithm based on relative frame bundle adjustment, which splits the recovered map of 3D landmarks and keyframes poses so that the camera can continue to grow and explore a local map in real time while, at the same time, a bulk map is optimized in the background. By temporarily excluding certain measurements, it ensures that both maps are consistent, and by using the relative frame representation, new results from the bulk process can update the local process without disturbance. The paper first shows how to apply this representation to the parallel tracking and mapping (PTAM) method, a real-time bundle adjuster, and compares results obtained using global and relative frames. It then explains the relative representation’s use in SCISM and describes an implementation using PTAM. The paper provides evidence of the algorithm’s real-time operation in outdoor scenes, and includes comparison with a more conventional submapping approach. Index Terms—Monocular SLAM, relative bundle adjustment, parallel tracking and mapping, split-mapping, submapping Ç 1 INTRODUCTION T WO difficulties faced in the real-time recovery of monocular structure from motion (SfM) over extended areas are, first, that the computational cost of finding a full optimum solution—one that involve all 3D landmarks, camera poses, and measurements—scales poorly with problem size, and second that algorithms that attempt to cheat complexity by neglecting parameters or by neglecting small dependencies in the problem are usually inconsistent. They are overconfident, placing too small error estimates on their suboptimal solutions. The impact of complexity and consistency has long been of concern in robotic simulta- neous localization and mapping (SLAM), where dense laser sensing delivers a torrent of data over large distances (e.g., [1], [2], [3]), but the challenge exists too for comparatively sparse visual sensing, where processing limits are reached over surprisingly short time scales. Among algorithms which seek a full solution to SfM and SLAM, that with the lowest complexity is FastSLAM [4], in which a factorization exploits the independence of land- mark locations when conditioned on the trajectory of the sensor. Its complexity is OðP logLÞ in the number of landmarks L using a particle filter with P particles to represent the trajectory. A set of extended Kalman filters (EKF), each conditioned on one of the particles, is used to estimate the landmark positions. FastSLAM, however, is consistent only for limited areas [5], beyond which the number of particles required to represent the density function adequately becomes prohibitive. Moving up in complexity are Kalman-based solutions [6], [7], [8] that rely on landmarks remaining fixed to reduce the complexity of the general extended Kalman filter from OðL 3 Þ to OðL 2 Þ. These filters always produce inconsistent estimates [9], [10], [11] because they linearize about incorrect operating points. However, their principal drawback is that even quadratic complexity in L puts a tight constraint on the number of landmarks that can be mapped—a couple of hundred on current processors. Yet another step up in complexity brings one to full optimizations, all variants of bundle adjustment [12], [13], [14]. Although linear in the number of landmarks, the method is ultimately limited by OðK 3 Þ complexity in the number of camera poses K. Some methods mitigate the effects of complexity by optimizing only a subset of the parameters [15] or by pausing only infrequently to calculate the full optimization [16]. More common though is to build submaps of limited size. In [17], the previous map is marginalized out when a new map is started. The full solution is found by back- projecting information to the old maps. In [3], [18], [19], an EKF is used for each submap and the transformations between them are optimized. An emphasis in recent work is the use of keyframe-based bundle adjustment to build ever larger maps, evident in the stereo or bearing and range systems described in [20], [21], and [22], and in the monocular, bearings only, work of [23]. In [22], Lim et al. propose a conventional submapping approach, but maintain both topological and metric maps. A local adjustment is carried out on a limited number of keyframes around the current camera position, and those linked to them via measurements on shared landmarks are included as static entities. Submapping is used in the global adjustment, with the submaps themselves determined in part from the topological map. Each is locally optimized before being treated as a rigid map in the optimization of the global map. The topological map helps avoids gross inconsistency during loop closure, but the results of the local and global optimizations are unlikely to be strictly consistent. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013 1451 . The authors are with the Department of Engineering Science, University of Oxford, Parks Road, Oxford OX1 3PJ, United Kingdom. E-mail: {sah, dwm}@robots.ox.ac.uk. Manuscript received 18 July 2011; revised 4 Aug. 2012; accepted 1 Oct. 2012; published online 24 Oct. 2012. Recommended for acceptance by A. Fitzgibbon. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2011-07-0478. Digital Object Identifier no. 10.1109/TPAMI.2012.234. 0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

Monocular SLAM with Conditionally Independent Split Mapping

  • Upload
    david-w

  • View
    218

  • Download
    6

Embed Size (px)

Citation preview

Page 1: Monocular SLAM with Conditionally Independent Split Mapping

Monocular SLAM withConditionally Independent Split Mapping

Steven A. Holmes, Member, IEEE, and David W. Murray, Member, IEEE

Abstract—The recovery of structure from motion in real time over extended areas demands methods that mitigate the effects of

computational complexity and arithmetical inconsistency. In this paper, we develop SCISM, an algorithm based on relative frame

bundle adjustment, which splits the recovered map of 3D landmarks and keyframes poses so that the camera can continue to grow and

explore a local map in real time while, at the same time, a bulk map is optimized in the background. By temporarily excluding certain

measurements, it ensures that both maps are consistent, and by using the relative frame representation, new results from the bulk

process can update the local process without disturbance. The paper first shows how to apply this representation to the parallel

tracking and mapping (PTAM) method, a real-time bundle adjuster, and compares results obtained using global and relative frames. It

then explains the relative representation’s use in SCISM and describes an implementation using PTAM. The paper provides evidence

of the algorithm’s real-time operation in outdoor scenes, and includes comparison with a more conventional submapping approach.

Index Terms—Monocular SLAM, relative bundle adjustment, parallel tracking and mapping, split-mapping, submapping

Ç

1 INTRODUCTION

TWO difficulties faced in the real-time recovery ofmonocular structure from motion (SfM) over extended

areas are, first, that the computational cost of finding a fulloptimum solution—one that involve all 3D landmarks,camera poses, and measurements—scales poorly withproblem size, and second that algorithms that attempt tocheat complexity by neglecting parameters or by neglectingsmall dependencies in the problem are usually inconsistent.They are overconfident, placing too small error estimates ontheir suboptimal solutions. The impact of complexity andconsistency has long been of concern in robotic simulta-neous localization and mapping (SLAM), where dense lasersensing delivers a torrent of data over large distances (e.g.,[1], [2], [3]), but the challenge exists too for comparativelysparse visual sensing, where processing limits are reachedover surprisingly short time scales.

Among algorithms which seek a full solution to SfM andSLAM, that with the lowest complexity is FastSLAM [4], inwhich a factorization exploits the independence of land-mark locations when conditioned on the trajectory of thesensor. Its complexity is OðP logLÞ in the number oflandmarks L using a particle filter with P particles torepresent the trajectory. A set of extended Kalman filters(EKF), each conditioned on one of the particles, is used toestimate the landmark positions. FastSLAM, however, isconsistent only for limited areas [5], beyond which thenumber of particles required to represent the densityfunction adequately becomes prohibitive. Moving up in

complexity are Kalman-based solutions [6], [7], [8] that relyon landmarks remaining fixed to reduce the complexity ofthe general extended Kalman filter from OðL3Þ to OðL2Þ.These filters always produce inconsistent estimates [9], [10],[11] because they linearize about incorrect operating points.However, their principal drawback is that even quadraticcomplexity in L puts a tight constraint on the number oflandmarks that can be mapped—a couple of hundred oncurrent processors. Yet another step up in complexitybrings one to full optimizations, all variants of bundleadjustment [12], [13], [14]. Although linear in the number oflandmarks, the method is ultimately limited by OðK3Þcomplexity in the number of camera poses K.

Some methods mitigate the effects of complexity byoptimizing only a subset of the parameters [15] or bypausing only infrequently to calculate the full optimization[16]. More common though is to build submaps of limitedsize. In [17], the previous map is marginalized out when anew map is started. The full solution is found by back-projecting information to the old maps. In [3], [18], [19], anEKF is used for each submap and the transformationsbetween them are optimized.

An emphasis in recent work is the use of keyframe-basedbundle adjustment to build ever larger maps, evident in thestereo or bearing and range systems described in [20], [21],and [22], and in the monocular, bearings only, work of [23]. In[22], Lim et al. propose a conventional submappingapproach, but maintain both topological and metric maps.A local adjustment is carried out on a limited number ofkeyframes around the current camera position, and thoselinked to them via measurements on shared landmarks areincluded as static entities. Submapping is used in the globaladjustment, with the submaps themselves determined inpart from the topological map. Each is locally optimizedbefore being treated as a rigid map in the optimization of theglobal map. The topological map helps avoids grossinconsistency during loop closure, but the results of the localand global optimizations are unlikely to be strictly consistent.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013 1451

. The authors are with the Department of Engineering Science, University ofOxford, Parks Road, Oxford OX1 3PJ, United Kingdom.E-mail: {sah, dwm}@robots.ox.ac.uk.

Manuscript received 18 July 2011; revised 4 Aug. 2012; accepted 1 Oct. 2012;published online 24 Oct. 2012.Recommended for acceptance by A. Fitzgibbon.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2011-07-0478.Digital Object Identifier no. 10.1109/TPAMI.2012.234.

0162-8828/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

Page 2: Monocular SLAM with Conditionally Independent Split Mapping

While this and other size-based methods of definingsubmaps have the practical advantage of simplicity, moremathematically satisfying is to create submaps that are mostoptimally independent. Nested dissection, used by Ni et al.[24] and reworked in Ni and Dellaert [21], is one suchapproach. Their recent algorithm recursively partitions agraphical representation, finding small separator groups oflandmark positions and sensor poses which, when tempora-rily frozen, create conditionally independent submaps. Onepotential disadvantage is that the map becomes overparti-tioned, involving additional costs when the greater numberof separator regions need folding into the optimization. Agreater problem is that the method in [21] appears bestsuited to batch operation, whereas here we are concernedwith local, exploratory, and incremental mapping.

Indeed, although submapping is a satisfactory solution(and, given that maps must be bounded, one that ultimatelymust be adopted), it makes little distinction between whathas to be achieved urgently in terms of localization andmapping during exploration, and what might be handledlater. Furthermore, the need to stitch submaps togetherretrospectively feels unsubtle.

This paper develops a method that addresses these twoshortcomings for monocular SLAM. The mathematicalvehicle proposed is the relative frame representation forbundle adjustment (RBA), introduced by Sibley [25] anddeveloped in several directions by Sibley et al. [20], [26], [27],[28], and the implementational engine is the parallel trackingand mapping (PTAM) algorithm of Klein and Murray [29].The approach is a monocular relative of RSLAM, although,as indicated below, there is difference in detail.

In the relative frame representation, the location of eachlandmark is specified in the camera keyframe from which itwas first observed, and the pose of each keyframe is itselfgiven relative to an existing keyframe. In [20], [26], [27], [28],constant-time operation is achieved using RBA by exploit-ing the progressively fading influence of the current camerapose and measurements on those in ever more distant areasof the map. Here, instead, the boundary between the localand remaining bulk of the map is more sharply delineated.By enforcing a conditional independence, we split theestimation problem in two. One optimization is local to thecurrent camera position. The number of keyframes in thisoptimization can be chosen at will but is here adapted to thelocal structure and bounded to ensure an on-averageconstant time computation during exploration. The otheroptimization is a non-time-critical optimization of theremainder of the keyframes, which is performed in thebackground. In effect, this permits a limited-size submap tomove around with the camera. Exploiting the conditionalindependence ensures that both local and full optimizationsprovide conservative estimates of the structure, estimatesthat are statistically consistent and can be combined withoutcontention. An early airing of the idea was given in [30].

The paper falls into two parts. Sections 2 and 3 describerelative frame bundle adjustment and its implementationwithin a real-time monocular SLAM algorithm. Section 4presents experimental results, giving some insight into theoptimization landscapes occupied by the relative and globalframe methods. The second part of the paper presents

SCISM, Slam with Conditionally Independent Split Map-ping. Sections 5 and 6 justify the approach and describe itsimplementation. Section 7 provides experimental findings,including a comparison with a direct submapping ap-proach. Overall conclusions are drawn in Section 8.

2 THE RELATIVE FRAME REPRESENTATION

The common way to represent camera pose and scenestructure in SLAM is to refer both to a single globalcoordinate frame fixed in the scene. A second way againuses a single reference frame, but now the frame moves withthe camera. In methods that retain earlier camera poses thesetwo choices produce optimizations with identical structure,but adopting the latter robo-centric [31] (or, strictly, sensor-centric) approach is found to improve the consistency ofEKF-based SLAM [31], [32] by reducing the degree ofsystemic nonlinearity.

A third way, more recently introduced, is the relativeframe representation [25], [26], [30], applicable only toalgorithms that estimate at least some of the camera’sprevious trajectory—so, good for bundle adjustment (e.g.,[12]) and sliding window filters (e.g., [33]), but not for theEKF, where the past camera trajectory is marginalized out. Inthis representation, both keyframe poses and 3D landmarkpositions are specified relative to other nearby keyframes.

Using Fig. 1 to establish notation, a landmark Xik is

indexed by a unique identifier i and by the keyframe k ¼ �ðiÞrelative to which its position is specified. (Often, but notuniquely, this is the earlier keyframe in the pair that firsttriangulated the landmark.) A measurement of landmark i

made in keyframe j is specified by zij. The pose ofkeyframe k is represented in two ways. In the parameterset for optimization, it is denoted as the 6-vector twist orscrew Kk

a, indexed by its unique keyframe identifier k and itsparent frame a ¼ �ðkÞ to which it is referenced. Oftenthough, it is the euclidean transformation transferring pointsfrom child to parent keyframe that is of interest. For example,to obtain the position of Xi in keyframe j one uses the chainof parent-child transformations ½Xi

j 1�> ¼ EjaEabEbk½Xik 1�>.

2.1 Parents, Children and Loops

Assigning parent-child relationships forms a tree of key-frames whose management is central to the success of therelative frame approach. To give the sparsest measurementJacobians, the chain of frames involved between that inwhich an image measurement occurs and that to which the

1452 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013

Fig. 1. Examples of keyframe subtrees in the relative frame representa-

tion. The arrow directions are from parent to child keyframe.

Keyframes j and b act as the local roots in the two cases, respectively.

Page 3: Monocular SLAM with Conditionally Independent Split Mapping

corresponding landmark is referred must be as short aspossible.

One contribution to this is the routine production ofshallow and broad tree structures typified by that in Fig. 2a.Here, this involves search with one or two stages. Given a newkeyframe, its parent is chosen as the keyframe that alreadyhas at least one child and has the most covisible landmarks,provided this number exceeds a threshold. If this test fails, theparent is chosen as the keyframe with the most covisiblelandmarks, irrespective of whether it is already a parent.

The second contribution, used in the stereo work of [20],[26], [27], is loop closure. Consider the camera completing aloop, as illustrated in Fig. 2b. An extra parent-child link canbe inserted between the loop’s end and start, introducing anew freedom to readjust the tree. With stereo, closure isstraightforward: The true forward and backward transfor-mations across the link are always inverses (of course), andbecause scale is observable so should be their estimates.However, after completing a loop with a single camera(Fig. 2c), scale drift may require the estimates of the sametransformation to differ between the outset and completionof the loop. Loops are therefore not closed in this work,leading to a somewhat suboptimal speed of operation.(Note that loop closure in the monocular case does noteliminate drift, but makes it scale with distance from theorigin rather than with distance traveled [34].)

2.2 Relative Frame Bundle Adjustment

Bundle adjustment [12], [13], [14] seeks the optimal set of3D landmark positions and camera poses fX; Kg byminimizing a robust cost function � of the scale-normalizedimage error eij=�ij,

fX; Kg ¼ argminfX;KgXi

Xj

vij�ðeij=�ijÞ; ð1Þ

where the visibility flag

vij ¼1 if point i is visible and matched in j0 otherwise:

�ð2Þ

As usual, function � is an M-estimator and the minimizationis performed using the Levenberg-Marquardt algorithm(LM) [35], [36].

In global frame adjustment, all parameters fXi0;K

j0g,

i ¼ 1:::L, j ¼ 1:::K, are referred to a single global coordinateframe K0, which is itself not involved in the minimization.A measurement zij ¼ ½x; y�> ¼ ½u=w; v=w�> of landmark i inkeyframe j is assumed to arise from perspective projection:

½zij 1�> / ½u v w�>ij ¼ C½Ij0� Ej0 ½Xi0 1�>; ð3Þ

where C is the known intrinsic camera calibration. Inpractice, a correction for radial distortion is also applied.Writing the predicted projection as zðXi

0;Kj0Þ, the image

error is then eij ¼ jzik � zðXi0;K

j0Þj. The measurement

Jacobian J, the matrix of derivatives of measurements withrespect to parameters, has just two blocks per measurement,one 2� 3 block @zij=X

i0 and another 2� 6 block @zij=K

j0 for

derivatives with respect to the landmark’s 3D position andkeyframe pose referred to the global frame.

In contrast, in RBA each landmark is referred to a nearbykeyframe, and each keyframe is referred to its parent so thatthe parameter set is fXi

�ðiÞ;Kj�ðjÞg. An image measurement

in keyframe j of landmark i which is referred to keyframe kdepends on the relative poses between it and the referenceframe:

½zij 1�> / C½Ij0� EjaEa� . . . E�k ½Xik 1�>; ð4Þ

as does the image error eij ¼ zij � zðXik; fKgÞ

�� ��. As notedearlier, because a particular landmark tends to be viewedfrom a subset of the keyframes that are near to each other,the length of these chains is small (an upper limit should bethe observation track length). Thus, while RBA’s Jacobian issomewhat denser than that in the global frame case, it is stillhighly sparse in absolute terms.

The Jacobian has a similar occupancy structure to theglobal version for derivatives @zij=@Xi

k, but that for deriva-tives with respect to each keyframe is more complicated ingeneral. If there aren keyframes involved traversing the localtree between landmark and observation, there will be ðn� 1Þblocks in the Jacobian—the “missing” block arises from thelack of dependency on the keyframe which acts as the localroot. Note this keyframe can, in different examples, beanywhere in the chain between landmark and observation. Inboth examples in Fig. 1, there are four keyframes, andtherefore three blocks. In Fig. 1a there is no dependency onKj, whereas in Fig. 1b there is none on Kb. Returning to Fig. 1aas an example, the derivatives with respect to Kb areevaluated by finding the change in landmark position in theobservation frame. The small change in the pose of “b” mustbe expressed in the parent frame “a” so that

Xij þ �Xi

j

1

� �¼ Eja exp

��Kb

a

�EabEbk

Xik

1

� �: ð5Þ

To first order

exp��Kb

a

�� Iþ

X6

g¼1

��Kb

a

�gGg;

where ð�KbaÞg is the gth component of �Kb

a and the six 4� 4matrices G1;...;6 are the generators of SE(3) [37], so

HOLMES AND MURRAY: MONOCULAR SLAM WITH CONDITIONALLY INDEPENDENT SPLIT MAPPING 1453

Fig. 2. (a) The method of parent-child assignment aims to produce broadand shallow trees. (b) Loop closure is made straightforward in stereobecause scale is observable. (c) Scale is not observable in monocularvision and, on return to the same area, estimates of transformations areno longer valid.

Page 4: Monocular SLAM with Conditionally Independent Split Mapping

�Xij

0

� �¼ Eja

X6

g¼1

��Kb

a

�gGg

" #EabEbk

Xik

1

� �: ð6Þ

Writing uij ¼ ½u v w�>ij and �uij ¼ C�Xij, the 3-vector

@uij

@�Kba

�g

¼ C½Ij0�EjaGgEabEbk Xik

1

� �; ð7Þ

from which the two components of @zij=@ðKbaÞg are found

as

@xij

@�Kba

�g

¼ 1

wij

@uij

@ðKbaÞg

!� xijwij

@wij

@ðKbaÞg

!; ð8Þ

and similarly for yij.As a second example, were the frame parentage ar-

ranged as in Fig. 1b one would find

Xij þ �Xi

j

1

� �¼ EjaEabEbk exp

�� �Kb

k

� Xik

1

� �; ð9Þ

and then

@uij

@�Kbk

�g

¼ C½Ij0�EjaEabEbkð�GgÞ Xik

1

� �: ð10Þ

3 RBA APPLIED TO PTAM

PTAM is a monocular SLAM algorithm that runs thread-parallel processes, one to track the camera at every frame onthe assumption that the 3D map is fixed, and another togrow and optimize the map using observations fromselected frames in a bundle adjustment [29], [38], [39]. Wefirst apply relative frame bundle adjustment in the mappingthread of PTAM as a replacement for its usual global frameadjustment, but without exploiting it fully. There areconsequential but trifling changes in the camera trackingthread concerning projection of map points.

3.1 Map Initialization

The root keyframe K0 is selected by hand and its pose set as½R0jt0� ¼ ½Ij0�. The camera is (predominantly) translatedlaterally by a distance D � 10 cm, an image captured for thesecond keyframe K1, distinctive features found andmatched, and robust homography computation and decom-position [40], [41] applied to derive its pose ½R1jt1�. Thefeatures used in the homography computation are triangu-lated, and a bundle adjustment1 carried out to refine themap and the second keyframe’s pose. Pairs of matchedFAST corners [42], [43] from the two images are thentriangulated to further populate the initial map. The scale ofthe map cannot be recovered using a monocular camera,but a working value is set using jt1 � t0j ¼ D.

3.2 Camera Tracking Frame-by-Frame

As soon as keyframe K1 is set, the camera’s pose P0 iscomputed at every successive frame using the current mapas a fixed resource. In both global and relative frameversions this pose is found relative to K0. On frame capture,a simple motion model predicts the camera’s pose relative

to the most recent keyframe, and all landmarks that are

deemed potentially visible are projected into the image.

Using the relative representation, this involves the trans-

formation through the chain of keyframes from the land-

mark’s coordinate system to the camera’s coordinate

system. It would be hugely wasteful to repeat this

calculation for every point. Instead, the transformations

between every keyframe and the most recent keyframe are

precalculated and stored, then multiplied by the additional

transformation into the current camera frame. When a point

is projected, it is placed into a set according to the scale at

which it is expected to appear in the image.FAST corners are computed in the image at multiple

scales. The first stage of pose calculation attempts to use

60 coarse-scale features to minimize

P0 ¼ arg minP0

Xi

� ei=�ið Þ; ð11Þ

where � is a Huber M-estimator on the scaled reprojection

error and ei ¼ jzi � zðP0;XiÞj. A second stage pulls in up to

1,000 points at fine scale, spread throughout the image to

improve stability. Several measures are taken to monitor

tracking performance and to relocalize the camera if

tracking fails [38]. These are unchanged from the global

frame version of PTAM, and require no further comment.

3.3 Selecting Frames as Keyframes

As the camera moves around the scene, the current frame is

selected as a keyframe whenever

1. tracking is deemed to be good,2. the camera has translated more than a minimum

distance from any other keyframe,3. the current queue of keyframes is less than three

long, and4. at least 20 frames have elapsed since the last

keyframe was selected.

The minimum distance in condition 2 depends on both the

euclidean distance between the frames and the mean depth

of the points in the scene, and was investigated in [44]. Again,

these measures are unchanged from the global method.

3.4 Updating the Map

Given a new keyframe, new 3D landmarks are added to the

map by selecting visually salient and unmatched image

features that are distant from existing matched measure-

ments. Matches are sought in the spatially nearest keyframe

and the 3D positions of successful matches are found in the

new keyframe’s coordinates by triangulation. In the relative

frame representation, it is this 3D position that is stored in

the map.The mapping process integrates the new keyframe into

the map in two stages. A local bundle adjustment involving

only the new keyframe, its four spatially closest neighbors,

and landmarks visible from them is performed first, and the

output used as starting values for a full adjustment include

all keyframes, landmarks and measurements.

1454 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013

1. With just two keyframes, relative and global frame bundle adjust-ments are identical.

Page 5: Monocular SLAM with Conditionally Independent Split Mapping

4 OPTIMIZATIONS COMPARED

Aspects of the optimizations in relative and global frame

bundle adjustments are now contrasted, using as metrics

the difference in reprojection error, the difference in

keyframe positions, and the number of iteration steps and

time to convergence. The comparison must be treated with

caution. First, the relative representation should not be

regarded merely as a like-for-like replacement for global

frame adjustment. Second, its performance depends on the

typical depth of the keyframe tree, which, to exercise the

optimization, is here made deeper than usual by restricting

the number of children each frame may have, in addition to

not closing loops.Two sequences are used. The first was taken in the street,

where the camera spent the most time exploring new areas

and only occasionally returned to a previously explored

location. Typical keyframes and maps are shown later inFigs. 9 and 10. In the second sequence, the camera movesover a cluttered desk, repeatedly traversing the same areaand viewing the scene from many positions. Keyframeviews are in Fig. 3 and the map shown in Fig. 4a and moreclearly showing the keyframe axes in the enlargement inFig. 4b. In this sequence, there are many features that arevisible in a large fraction of the keyframes, which willproduce a more densely populated Jacobian than theoutdoor example. To remove any differences arising fromnondeterministic elements in camera tracking betweenkeyframes, both sequences were analyzed first usingPTAM’s global frame bundle adjustment, and the samematches input into RBA.

The main panel of Fig. 5a shows the total reprojectionerror for global and relative frame adjustment for theoutdoor scene as more keyframes (and hence points) are

HOLMES AND MURRAY: MONOCULAR SLAM WITH CONDITIONALLY INDEPENDENT SPLIT MAPPING 1455

Fig. 3. Typical keyframes culled from an indoor office sequence used for comparative experiments.

Fig. 4. (a) Indoor map used in the comparisons between global and relative frame bundle adjustments. (b) The same enlarged to show the keyframecoordinate systems.

Fig. 5. (a) The upper panel shows the total reprojection error (in units of 102 pixels) for global (red) and relative frame (blue) adjustment, for theoutdoor experiment. The curves are usually indistinguishable and have been purposely offset, but at keyframe 225 it is evident that the two methodscan converge to different minima. The lower panel plots the difference in reprojection error, which is small and unbiased. (b) The same for the indoorsequence, but the error is in units of 103 pixels. The difference is again small, though a bias grows through the sequence. Panes (c) and (d) show theaveraged normalized euclidean distance between corresponding keyframes after optimization with relative and global bundle adjustment for theoutdoor and indoor sequences, respectively.

Page 6: Monocular SLAM with Conditionally Independent Split Mapping

included. The curves are for the most part indistinguish-able, and they have been purposely offset on the graph. Thedifference between the reprojection errors (Global minusRBA) is shown in the lower panel, has zero mean, and isusually less than 0.1 percent of the total reprojection error.However, there is a short period close to the end where therelative representation converges to a nonlocal minimum,indicating 1) that the cost function landscapes differ and2) that, at least with somewhat longer chains than usual,RBA’s can be more challenging. Fig. 5b shows the same forthe indoor sequence. Although the difference is againindiscernable in the upper graph, the lower shows that RBAyields an increasingly large total reprojection error. Theoptimization for the indoor scene is made more difficult bythe loopy nature of the camera trajectory—with longerchains of dependencies and a correspondingly denserJacobian. Errors accumulate in the chain of transformations,limiting the reduction in reprojection error during optimi-zation. Although the error is rising with the number ofkeyframes, it is important here not to confuse the number ofkeyframes with time. Were the camera to continue to moveround the same scene, few new keyframes would beinserted. The plot of error against time would flatten off.

The second metric used for comparison is the averageeuclidean distance between corresponding keyframe posi-tions estimated by global and relative bundle adjustment.Some care is needed to make a fair comparison as either therelative frame positions need to be transferred into theglobal frame or vice versa. Using the first method isunsatisfactory because errors arising from estimation ornumerical imprecision in the pose of one keyframe wouldbe applied to all the following keyframes. The comparisonsare therefore carried out by transforming the globalmeasurements into relative coordinates. The translationvector between each child-parent keyframe pair, k and �ðkÞ,is found directly as �t

�ðkÞkR in the parent frame using the

relative representation, and as �t�ðkÞkG using global frame

values transformed into the parent frame. The average overk of j�t

�ðkÞkR ��t

�ðkÞkG j=j�t

�ðkÞkR j is used as the normalized

euclidean distance.Figs. 5c and 5d show this distance for the outdoor and

indoor sequences. In the former, it settles to a fixed value asfurther keyframes are added, reflecting the decouplednature of this problem. Each keyframe contributes a similarerror to the total, and thus the average is a constant. In the

indoor case, the value is still small, but it remains unclearwhere there is bound on the size of the error.

The times and number of steps taken for the adjustmentsto converge are compared in Fig. 6. Although the timeneeded to complete an adjustment increases linearly withthe number of landmarks and, eventually, as the cube of thenumber of keyframes K, in the regime used here thecomplexity of both methods is still OðK2Þ, dominated bythe calculation of the measurement Jacobian in LM. Themark up in time for relative bundle adjustment is here duemerely to the long chain lengths imposed on this experi-ment. While the timings for the global adjustment increasequite smoothly, there can be significant jumps in the relativeframe timings when insertion of a new keyframe involvingthe reobservation of some neglected landmarks causes asubstantial change to the Jacobian by creating longkinematic chains. The total numbers of steps to convergencegiven in Figs. 6c and 6d are well correlated with the times,providing reassurance that the jumps are not artifacts ofcomputer loading or memory utilage. The results empha-size the importance of using short chains.

4.1 Keyframe Limit in Real-Time RBA

To explore the performance of the real-time version of RBA,live runs were performed both outdoors and indoors usingthe wearable AR system described in [45] that has a 2.2 GHzdual core processor. Statistics were accumulated on the timeperiods between keyframe addition and the point at whichprocessing became sub-frame-rate.

As before, the outdoor sequence was dominated by steadyexploratory camera motion, and the addition of keyframesoccurred quite uniformly throughout the sequence. Themedian time between addition of keyframes was 1.9 s, closeto the mean of 2.8 s. Some 40 keyframes could be adjusted inthe local map thread during the interkeyframe period. In theindoor experiment, however, the camera was at first movedquickly around the environment, rapidly adding keyframesand building the 3D map. The camera was then moved moresteadily within that same restricted environment, and theexisting map proved largely sufficient for tracking for theremaining time. The motions were deliberately toward andaway from the scene so that the extent of the map was notgreatly increased. For this sequence the median time of 0.7 sbetween keyframe addition was substantially lower than themean of 4.3 s. During the early phase of this experiment, theprocessor struggled to adjust 20 keyframes in the local map,

1456 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013

Fig. 6. (a, b) The times taken for bundle adjustment with increasing number of keyframes for the global frame (lower, red) and relative frame (upper,blue) for the outdoor and indoor sequences. The jumps in time are correlated with the number of steps to reach convergence, as apparent from (c)and (d), where again the points with higher values (blue) are for relative frame adjustment.

Page 7: Monocular SLAM with Conditionally Independent Split Mapping

but as the rate of exploration diminished that number

increased to 40 keyframes. The values derived are of course

scene and processor dependent, but they provide evidence

for our introductory remarks on the need to mitigate the

effects of complexity.

5 CONDITIONALLY INDEPENDENT SPLIT MAPPING

Implementations of bundle adjustment commonly integrate

a keyframe into the existing map by preceding the full

adjustment with one that involves only the new keyframe, a

fixed number of nearest neighbors, and their associated

scene points. The preadjustment runs in constant time, but

the values and their errors are not guaranteed to be

consistent with their finally adjusted values.In [20], [26], [27], on-average constant time updating is

achieved using RBA by performing breadth-first search

around a newly added keyframe to discover keyframes for

which the average reprojection error is affected by an above-

threshold amount. Remarkably, that threshold can be set as

small as 10�6 while still keeping the number of keyframes

and map points to be adjusted manageable at frame rate. The

remainder of the map need not necessarily be adjusted: By

monitoring “fading influence” RBA maintains consistency.Here, instead, we report on a variant of the approach,

perhaps more suited to smaller scale maps where the bulkof the map is still in flux. A more explicit and arbitrary cut ismade between the most recently acquired keyframes andthe bulk of the map, and both are continually optimized.Because no check is made on fading influence, conditionalindependence of the optimizations can only be ensured byremoving certain measurements, but temporarily.

Consider the Bayes’ net for the toy bundle adjustment

problem shown in Fig. 7. Assume that keyframes are

labeled in order of creation, so the most recent are to the

right, and that the parent-child links are as shown.

Measurements zij are conditioned on a landmark Xi and

the pose Kj of the camera observing that landmark. Two

types of conditional independences are apparent, viz,

P ðK1;K2;K3jX1Þ ¼ P ðK1jX1Þ P ðK2jX1Þ P ðK3jX1Þ; ð12Þ

and similarly for X2, and

P ðX1;X2jK1Þ ¼ P ðX1jK1Þ P ðX2jK1Þ; ð13Þ

and similarly for K2;3. The first set allows the camera pose tobe calculated if the landmarks are known (used in thecamera tracking stage of PTAM, for example), and thesecond indicates that landmark estimations are independentgiven the camera trajectory (the basis of FastSLAM [4]).

These hold whether or not measurement z13 is present.However, temporarily ignoring z13 exposes another condi-tional independence. Keyframe 1 does not then observe anyof the landmarks that keyframe 3 does, and vice versa,making the keyframes and landmarks left and right ofkeyframe 2 conditionally independent so that

P ðX1;K1;X2;K3jK2Þ ¼ P ðX1;K1jK2Þ P ðX2;K3jK2Þ: ð14Þ

Provided measurements that “cross” the partition keyframeare excluded, the optimization can be split and both sidesoptimized simultaneously with a guarantee that the resultsof each will be consistent with any full optimization of thewhole. The partition keyframe appears to enter bothoptimizations, but while it is fully involved in theoptimization to the left, it makes only a tacit appearancein the optimization to the right as the root frame relative towhich other keyframes to the right are adjusted.

Now imagine many more keyframes and landmarks lieto the left of the partition, making up the bulk of the existingmap, and a few more to the right making up the exploratorymap local to the camera. The two parts have differentcharacteristics and requirements. The bulk adjustment willcomplete comparatively slowly, but the local adjustmentshould complete within the typical time between addition ofkeyframes. The local adjustment therefore needs to be ableto run multiple times as new keyframes are added,preserving its conditional independence from the bulkadjustment. When the bulk adjustment completes, the localadjustment should incorporate the new results seamlesslyand be able to hand on the older parts of the local map to thebulk adjuster—the partition needs to be easily moved on.

The proposed method responds to these requirements asfollows. First, each time the local adjustment completes,additional keyframes can be added at will to the right of thepartition and, provided only appropriate measurements areused, conditional independence is maintained. The rulesalready established for choice of parent will almost ensurethat a new keyframe is attached to tree on the leaf side of thepartition, but care is taken to handle pathological caseswhere the preferred parent is in the bulk map. Next, whenthe bulk adjustment completes, the local map does notrequire updating because 1) its results are relative to thepartition and 2) it does not involve landmarks from the bulk.The updates actually take effect when the bulk adjustmenttakes over the next chunk of the local map. Third, althoughthe choice of partition is held constant during the period ofone update of the bulk map (and hence the multiple updatesof the local map during that period), at the moment whenboth have completed, the local map can be regrown from theleaf—the newest keyframe—toward the root, as will beexplained in Section 6.2. The new local map is determinedby the current camera position, and not by the past partitionkeyframe. Last, we note that measurements that are sharedare only temporarily excluded, and not dropped entirely. As

HOLMES AND MURRAY: MONOCULAR SLAM WITH CONDITIONALLY INDEPENDENT SPLIT MAPPING 1457

Fig. 7. The Bayes’ net for a simple bundle adjustment containing threekeyframes, one set of landmarks observed by keyframes 1 and 3, and aset of landmarks observed by all the keyframes.

Page 8: Monocular SLAM with Conditionally Independent Split Mapping

soon as the local map moves on they are reinstated in thecomputation of the bulk map.

6 IMPLEMENTING SCISM WITH RBA-PTAM

SCISM is built on three threaded parallel processes outlinedin Fig. 8. One performs camera tracking or resectioning, thesecond and most interesting thread determines the extent ofthe local map, adds new keyframes to it, and adjusts it, andthe third adjusts the remaining bulk of the map. A set ofinterfaces mediates the provision and exchange of data viathe map and none of the processes needs to interact directlywith another during execution.

6.1 Camera Tracking

The camera tracking stage is unchanged from that describedin Section 3.2. Note, though, that the camera continues toretain access to the entire map, both to those landmarks thathave undergone full adjustment and to those that have beenonly locally adjusted. Indeed, it is the latter that are themore important as they are more likely to make up thecurrently visible scene. The tracker retains responsibility foroffering new keyframes to the map expansion thread.

6.2 Local Map Definition

The integration of a new keyframe and discovery of new 3Dlandmarks is carried out as a precursor to local bundleadjustment. When a new keyframe Kn is offered by thecamera tracker, a parent K�ðnÞ is assigned to it, and its posetransformed to be relative to the parent. This is followed bya search in the image for matches to all existing landmarks,and an attempt to match any unmatched image featureswith similarly unmatched features in the spatially nearestkeyframe, as described in Section 3.4.

To populate a local map ab initio, before any adjustmentis carried out, the keyframe membership list for the localmap is initialized as KL ¼ fng, then the parent of the headof the list and that parent’s other children and theirdescendents are prepended. Hence, after the first cycle,KL ¼ f�ðnÞ;PðnÞ; ng, where PðnÞ denotes the descendentsof �ðnÞ, excluding n and n’s descendents, and, after thesecond, KL ¼ f�ð�ðnÞÞ;Pð�ðnÞÞ;PðnÞ; ng, and so on until aminimum size has been exceeded. We use jKLjmin ¼ 14.

The selection of landmarks for the local map is madestraightforward by the references that each landmark holdsto the frame to which it is referred and to those keyframesfrom which it is observed. The landmarks considered arethose that are both referred to one of the keyframes selectedfor local adjustment and that are observed in at least oneother local keyframe. That is, the local landmark setbecomes XL ¼ fXi

j : j 2 KL; 9zik : k 2 KL; k 6¼ jg. The set ofmeasurements can be chosen in a similarly straightforwardfashion and without reference to the bulk map asZL ¼ fzik : Xi 2 XL; k 2 KLg.

6.3 Bulk Map Definition

The set of keyframes KB for bulk adjustment comprises theremainder of the tree of keyframes—i.e., the complete tree,including the partition keyframe head ðKLÞ, but lopping offall descendents below it. The bulk landmark set could bewritten as XB ¼ fXi

j : j 2 KB; 9zik : k 2 KB; k 6¼ jg, but be-cause these landmarks must have been instantiated beforethe local keyframes were even created, the second measure-ment is certain to exist. Thus, XB ¼ fXi

j : j 2 KBg issufficient definition, and the set of measurements isZB ¼ fzik : Xi 2 XB; k 2 KBg. Note that at the early stagesof a run, the local map is likely to consume all thekeyframes, and KB ¼ ;. The bulk adjustment is not run,but this is of no special consequence.

6.4 Local and Bulk Adjustments and Interaction

With both maps defined and measurements selected, thelocal RBA is started. The head of KL defines the partitionframe, and is hence the local adjustment’s root keyframe.Provided jKBj > 1, the bulk adjustment is also started.

Whenever a local adjustment completes, a check is madeon the status of the bulk adjustment. There are fourpossibilities.

1. The bulk adjustment is still running, and the local map size

is within its maximum size. The next local adjustment willkeep the partition keyframe the same. The next newkeyframe, n0 say, is attached at an appropriate place inthe local tree, and the keyframe list KL is simply updatedrather than being built from scratch, as in Section 6.2. Newscene structure is sought, and the local RBA is rerun.

1458 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013

Fig. 8. Skeleton code for SCISM’s three threads managing (a) frame-by-frame tracking, (b) local adjustment around the camera, and (c) bulkadjustment of the remainder of the map.

Page 9: Monocular SLAM with Conditionally Independent Split Mapping

2. The bulk adjustment is still running, but the local map hasreached its maximum size. On average, a local adjustmentshould finish within the mean time between addition ofkeyframes, placing an upper limit on the size of the localmap of about jKLjmax ¼ 40, say. If the partition was onlyallowed to be moved when the bulk adjustment com-

pleted, the size of the bulk map would be limited. (Anestimate can be found by supposing local and bulkcomplexities are both quadratic in the keyframes and that,on moving the partition, the local map routinely drops insize by d keyframes. The maximum bulk size is thenjKBjmax � jKLjmax

ffiffiffidp

, or around 250 keyframes.)Instead, the partition is moved by recalculating the local

map list and placing some frames, landmarks, andmeasurements in limbo between both optimizations. Dur-ing these times there are effectively three conditionallyindependent maps. When the bulk map results arrive, theyupdate both the limbo map and, either directly or via limbokeyframes, the local map. The limbo map is immediatelyabsorbed into the next optimization of the bulk map.

3. The bulk process has completed. The local process copiesall results back into the overall map. It then rebuilds thelocal keyframe list KL starting from fn0g and thence thebulk keyframe list, as described in Sections 6.2 and 6.3. Bothlocal and bulk RBAs are restarted.

4. The bulk process has never run. There are no bulk resultsto handle. The local process again rebuilds the local mapfrom scratch. It will restart local adjustment and, if jKBj > 1,will start the bulk adjustment for the first time.

7 EXPERIMENTAL RESULTS

The ability of SCISM to build and explore substantial mapsin real time is illustrated in Fig. 9. Mapping was undertakenusing the wearable camera system [45] mentioned earlier.The camera was moved around the scene at a walking pace,looking at the buildings on the outside of the path shown in(a), at a distance of between 5 and 20 meters from them. Themap derived using SCISM is shown in (b) from a 45 degreesviewpoint (the axes on the left are aligned with the vertical).Several large scale features can be matched between viewand map—the building frontage moving away from thecamera at the middle right and the continuation of theroadway at the top right are the most obvious. A lateralview through the map is shown in Fig. 9c.

A second shorter but more loopy run was made walkingthe camera around the same area three times. The secondpass was sufficiently different from the first for additionalkeyframes to be required for the map, but in the third thecamera was close enough to the existing keyframes not torequire any more. Selected keyframes and aerial views of theevolution of the map created using SCISM from this sequence

HOLMES AND MURRAY: MONOCULAR SLAM WITH CONDITIONALLY INDEPENDENT SPLIT MAPPING 1459

Fig. 9. A large outdoor map created using the SCISM algorithm. (a) Theactual path taken overlaid on an aerial view. (b) The recovered mapshown from a 45 degrees view. The continuous trail shows thekeyframes computed. (c) A lateral orthographic view through the mapshowing a college frontage. The taller structure to the left is a churchtower and spire.

Fig. 10. Various keyframe images and the evolution of the map when using SCISM.

Page 10: Monocular SLAM with Conditionally Independent Split Mapping

are shown in Fig. 10. Fig. 11a shows a segment of loop, andFig. 11b shows the tree structure around this loop enlarged.The emphasis in this work on broad shallow trees is evident.

The raw images from the sequence used in Fig. 10 werestored to permit offline investigation of the timing andaccuracy of SCISM.

7.1 Timing

Fig. 12a shows the time taken by the local bundleadjustment as a function of the number of keyframes inthe entire map. As the local map increases in size, theexpected � OðjKLj2Þ dependence is evident and, includingthe occasional blip, of the sort seen earlier in Fig. 6.However, when jKLj just exceeds 40, the rebuilding of thelocal keyframe list moves the partition keyframe to ayounger generation. As parents typically have manychildren, jKLj drops sharply, and the computation timefollows. As mapping continues, this process repeats, slowlyincreasing the number of keyframes in the optimization andthen removing a significant number when the leaf-to-nodebreadth-first search terminates at a different parent.

7.2 Accuracy

Ground truth is not available for the image sequences, andthe metric used to quantify the performance of relativeframe bundle adjustment is the total reprojection error. Atintervals during offline analysis of the sequence usingSCISM, all regular processing was suspended and the totalreprojection error for the separated bulk and local maps

was calculated, but including, for the sake of fair compar-ison, the error for the excluded “cross-partition” measure-ments. Then, a complete bundle adjustment was carriedout, including those excluded measurements, and using asa starting point the current solution obtained by SCISM, andthe total reprojection error derived. Fig. 12b shows that thedifferences between the two errors are slight. Where it isdiscernable, the value for the SCISM solution is, of course,higher than that for the complete optimal solution. Whilethe solution is suboptimal, it is nonetheless consistent.

7.3 Scale Drift

Fig. 12c reveals a characteristic of SCISM that is not evident inthe reprojection error. It shows two maps and keyframetrails generated using an adjustment of the complete map(on the left) and using SCISM (on the right), where thecamera path undergoes a sharp 90 degree turn. Analysisshows that the keyframes and landmarks become morewidely spaced after the turn. Removing measurements as thecamera turns encourages the scale in the local estimate toconverge to a different value to that in the rest of the map.The problem spreads into later bulk adjustments: Theadditional scale drift can cause a significant fraction of thecross-partition measurements to be marked as outliers whenan attempt is made to restore them to the optimization of thebulk map. Though rotational motion is inherently difficultfor monocular SfM, we find that forcing the local map toretain more measurements (and abandon consistency)constrains and preserves the scale better.

1460 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013

Fig. 11. A map enlargement showing keyframe positions and the resulting tree structure. (Magenta/green codes the parent/child end of a link.)

Fig. 12. (a) Time taken for the local bundle adjustment to complete as a function of the total number of keyframes in both maps. The sharp fallscorrespond to sudden shrinkages of the size of the local map. (b) The total reprojection error as a function of the total number of keyframes in bothmaps as computed by SCISM and a complete adjustment. (c) Enlargement of the map and keyframe trail in a run involving a sharp 90 degree turn.That on the left is produced using a complete adjustment. On the right it is evident that the exclusion of cross-partition measurements in SCISM,though temporary, causes increased scale drift, with increased keyframe and landmark spacing.

Page 11: Monocular SLAM with Conditionally Independent Split Mapping

7.4 Comparison with PTAM-Based Submapping

Submapping has been applied to visual monocular SLAMby Clemente et al. [46] using a global coordinate frame andby Eade and Drummond [18] in a method that hassomething of the relative frame representation to it. Theybuild submaps around nodes with coordinate frameschosen to minimize linearization effects in the EKF, and,whenever the camera moves too far from any existing node,a new node is created and the transformation between itand the previous node is added to the set of transformationsto be optimized. However, to our knowledge, submappinghas not been applied to a keyframe-based monocularsystem, and we report briefly on this first. The algorithmis summarized in Fig. 13.

Given the overhead cost associated with splitting andmerging maps, submaps in this work are not of fixed size,but are created only when the size and rate of expansion ofthe current map threaten real-time performance. The mapbuilding process monitors both the rate of addition ofkeyframes and the rate of adjustment completion to predictwhen the former will outstrip the latter. At that point a newsubmap is spawned.

The new submap must from the outset provide sufficientstructural content for the camera tracking process tocontinue, and is initialized with the eight keyframes thatare nearest to the camera’s current location. The submaphas the same coordinate frame as the original map to easethe transition of tracking between submaps. The keyframes,and map points visible from them, are copied to the newmap, and a transition table is constructed to record thekeyframes and map points shared by both maps at theirjoin. The join needs to be represented carefully. It allowsboth the tracker to transfer between adjacent submaps if thecamera is retracing its trajectory, and the submaps to bemerged together later.

When map making and tracking move away from alocality, a background process looks for records of cameratransitions between two currently unused maps and joinsthem by computing the optimal similarity transformationusing landmarks common to both, and then movingkeyframes and points into the unified map.

Figs. 14a, 14b, and 14c show example submaps generatedusing the stored sequence from the earlier experiment, andFig. 14d shows the combined map. As seen in Table 1, themaps have an encouragingly similar number of keyframes,but submapping generates some 30 percent more land-marks. This does not indicate that submapping is better atproducing a map—indeed, quite the opposite. If a newkeyframe is added in a submap that observes an imagefeature that corresponds to a point in a different submap, anew map point will be created instead of an old point beingupdated. This leads to the map created by merging the twomaps having two points where one would suffice. Table 1also shows the mean and maximum computation times for areestimation of the local map (for SCISM) or submap. The

HOLMES AND MURRAY: MONOCULAR SLAM WITH CONDITIONALLY INDEPENDENT SPLIT MAPPING 1461

Fig. 13. Skeleton code for submapping.

Fig. 14. (a)-(c) Example submaps and (d) the combined map.

TABLE 1Comparative Statistics for SCISM and Submapping

in the Experiment of Fig. 14

Fig. 15. A comparison of total reprojection errors for the 3D structurecalculated by aligning the submaps into a single map with that computedby a global bundle adjustment.

Page 12: Monocular SLAM with Conditionally Independent Split Mapping

mean for SCISM is some two times that taken by submap-ping, but since the local map size is bounded, the maximumtime recorded is the lower. Submapping has the opportunityto include many more keyframes in the local estimation.

A more informative comparison is made from totalreprojection error summed over all measurements—Fig. 12b for SCISM and Fig. 15 for submapping. Becausethe two methods may involve different landmarks, key-frames, and measurements, no direct comparison betweenoverall values should be made. Instead, it is their relativeperformance vis-a-vis the error derived from their respec-tive full adjustments (using all of each method’s keyframes,map points, and measurements) that is relevant. It isevident from Fig. 15 that submapping creates an overallmap which is sometimes very suboptimal: The reprojectionerror is reduced by up to 4,000 pixels by the subsequentbundle adjustment. This is significantly worse than theerror produced by SCISM, and is caused by the branches inthe camera trajectory leaving small submaps that then needreintegration into the main map.

8 CONCLUSIONS

The first part of this paper described the application of therelative frame representation within parallel tracking andmapping, a real-time monocular SLAM algorithm. The arearequiring most detailed change from the more usual globalframe adjustment was in the computation of the measure-ment Jacobian, which, while remaining sparse, is less sothan in the global frame.

Experimental comparison of the relative and globalframe methods solving the same full problem (something,we stress, that the relative frame methods allows one toavoid) showed that they provide all but identical results inmost cases, though with three caveats. First, the paths takenby the two methods to the optimum solution throughparameter hyperspace are different and, in some cases, thiscan lead to convergence to different minima. Second, thevariance of the error introduced by machine precision isproportional to the number of transformations in a chain.This reduces the accuracy with which the relative framerepresentation can specify the global solution. In turn, thisincreases the average minimum reprojection error that canbe achieved. The third caveat is that there can beconfigurations that make the optimization landscape moredifficult in the relative frame. More steps are needed toreach the optimum and each takes longer because of thegreater degree of coupling between parameters. (This isreinforced by the observation that the reconvergence timeson adding just one new keyframe are similar in bothrepresentations—the new error that needs minimizingaway is then mostly due to that one pose at the leaf of thetree, and hence at the end of a kinematic chain.)

The second part described and demonstrated a variationon Sibley’s approach to using RBA to define a local,exploratory map around the camera which can be opti-mized in constant time. In SCISM, the division between thelocal and bulk maps is defined sharply but arbitrarily,rather than using the notion of fading influence. Themethod selectively removes measurements from the opti-mization to create two conditionally independent submaps.

The first corresponds to the structure in the immediate

vicinity of the camera, structure that needs to be updated in

constant time. The second corresponds to the remaining

structure that is further away and does not need to be

computed urgently. The two processes run in separate

threads, allowing exploration to continue unhindered while

a bulk optimal map is produced in the background.

Experimentation has shown that SCISM can produce large

maps of everyday scenes in real time. Although before

recombination the maps are each suboptimal, they are

consistent, and the degree of suboptimality has been shown

to usually be negligible and always small compared with a

full solution run on the same data.The approach has been shown to produce results with

significantly lower reprojection error than conventional

submapping.Our results reinforce the need to keep transformation

chains within keyframe subtrees short, as occurs naturally

when RBA is deployed with influence checking [20], [26],

[27] and, albeit somewhat less elegantly, with the sharper cut

of SCISM. We would also suggest that achieving loop closure

with a single camera (whether adopting influence checking

or SCISM) is essential if monocular RBA is to compete with

its stereo cousin, notwithstanding the latter’s greater data

bandwidth and computational demands.

ACKNOWLEDGMENTS

This work was partially supported by grants GR/S97774

and EP/D037077 from the UK Engineering and Physical

Science Research Council, and through an EPSRC Research

Studentship to Steven A. Holmes. The authors are grateful

to Dr. Georg Klein and Dr. Gabe Sibley for their contribu-

tions to the early stages of this work, and particularly to the

latter for more recent discussions.

REFERENCES

[1] M.W.M.G. Dissanayake, P.M. Newman, S. Clark, H.F. Durrant-Whyte, and M. Csorba, “A Solution to the SimultaneousLocalisation and Map Building (SLAM) Problem,” IEEE Trans.Robotics and Automation, vol. 17, no. 3, pp. 229-241, June 2001.

[2] P.M. Newman and J.J. Leonard, “Consistent, Convergent andConstant-Time SLAM,” Proc. Int’l Joint Conf. Artificial Intelligence,2003.

[3] M. Bosse, P. Newman, J. Leonard, and S. Teller, “SimultaneousLocalization and Map Building in Large-Scale Cyclic Environ-ments Using the Atlas Framework,” Int’l J. Robotics Research,vol. 23, no. 12, pp. 1113-1139, 2004.

[4] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, “FastSLAM:A Factored Solution to the Simultaneous Localization andMapping Problem,” Proc. AAAI Nat’l Conf. Artificial Intelligence,2002.

[5] T. Bailey, J. Nieto, and E. Nebot, “Consistency of the FastSLAMAlgorithm,” Proc. IEEE Int’l Conf. Robotics and Automation, pp. 424-429. 2006,

[6] R.C. Smith and P. Cheeseman, “On the Representation andEstimation of Spatial Uncertainty,” Int’l J. Robotics Research, vol. 5,no. 4, pp. 56-68, 1986.

[7] J.J. Leonard and H.F. Durrant-Whyte, Directed Sonar Sensing forMobile Robot Navigation. Kluwer Academic, 1992.

[8] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, and A. Calway, “Real-Time and Robust Monocular SLAM Using Predictive Multi-Resolution Descriptors,” Proc. Second Int’l Symp. Visual Computing,pp. 276-285, 2006,

1462 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 6, JUNE 2013

Page 13: Monocular SLAM with Conditionally Independent Split Mapping

[9] T. Bailey, J. Nieto, J. Guivant, M. Stevens, and E. Nebot,“Consistency of the EKF-SLAM Algorithm,” Proc. IEEE/RSJ Conf.Intelligent Robots and Systems, pp. 3562-3568, Oct. 2006.

[10] S. Julier and J. Uhlmann, “A Counter Example to the Theory ofSimultaneous Localization and Map Building,” Proc. IEEE Int’lConf. Robotics and Automation, pp. 4238-4243, 2001.

[11] S. Holmes, G. Klein, and D.W. Murray, “A Square Root UnscentedKalman Filter for Visual MonoSLAM,” Proc. IEEE Int’l Conf.Robotics and Automation, 2008.

[12] C. McGlone, E. Mikhail, and J. Bethel, Manual of Photogrammetry,fifth ed. Am. Soc. of Photogrammetry and Remote Sensing, 2004.

[13] W. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon, “BundleAdjustment—A Modern Synthesis,” Proc. Int’l Workshop VisionAlgorithms: Theory and Practice, B. Triggs, A. Zisserman, andR. Szeliski, eds., pp. 298-372, 2000.

[14] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision, second ed. Cambridge Univ. Press, 2004.

[15] R.M. Eustice, H. Singh, and J.J. Leonard, “Exactly Sparse Delayed-State Filters for View-Based SLAM,” IEEE Trans. Robotics, vol. 22,no. 6, pp. 1100-1114, Dec. 2006.

[16] M. Kaess, A. Ranganathan, and F. Dellaert, “iSAM: FastIncremental Smoothing and Mapping with Efficient Data Associa-tion,” Proc. IEEE Int’l Conf. Robotics and Automation, 2007.

[17] P. Pinies and J. Tardos, “Scalable SLAM Building ConditionallyIndependent Local Maps,” Proc. IEEE/RSJ Conf. Intelligent Robotsand Systems, pp. 3466-3471, Oct./Nov. 2007.

[18] E. Eade and T. Drummond, “Unified Loop Closing and Recoveryfor Real Time Monocular SLAM,” Proc. 18th British Machine VisionConf., 2008.

[19] A.J. Davison, I.D. Reid, N.D. Molton, and O. Stasse, “MonoSLAM:Real-Time Single Camera SLAM,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 26, no. 6, pp. 1052-1067, June 2007.

[20] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid,“RSLAM: A System for Large-Scale Mapping in Constant TimeUsing Stereo,” Int’l J. Computer Vision, vol. 94, no. 2, pp. 198-214,2011.

[21] K. Ni and F. Dellaert, “Multi-Level Submap Based SLAM UsingNested Dissection,” Proc. IEEE/RSJ Conf. Intelligent Robots andSystems, pp. 2558-2565, Oct. 2010.

[22] J. Lim, J-M. Frahm, and M. Pollefeys, “Online EnvironmentMapping,” Proc. 24th IEEE Conf. Computer Vision and PatternRecognition, pp. 3489-3496, 2011.

[23] H. Strasdat, J.M. Montiel, and A.J. Davison, “Drift Aware LargeScale Monocular SLAM,” Proc. Robotics: Science and Systems Conf.,2010.

[24] K. Ni, D. Steedly, and F. Dellaert, “Out-of-Core Bundle Adjust-ment for Large-Scale 3D Reconstruction,” Proc. 11th IEEE Int’lConf. Computer Vision, pp. 1-8, 2007.

[25] G. Sibley, “Relative Bundle Adjustment,” Technical Report 2307/09, Dept. of Eng. Science, Univ. of Oxford, Jan. 2009.

[26] G. Sibley, C. Mei, I. Reid, and P. Newman, “Adaptive RelativeBundle Adjustment,” Proc. Robotics: Science and Systems Conf., 2009.

[27] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid, “AConstant Time Efficient Stereo SLAM System,” Proc. 19th BritishMachine Vision Conf., 2009.

[28] G. Sibley, C. Mei, I. Reid, and P. Newman, “Vast Scale OutdoorNavigation Using Adaptive Relative Bundle Adjustment,” Int’l J.Robotic Research, vol. 29, no. 8, pp. 958-980, 2010.

[29] G. Klein and D.W. Murray, “Parallel Tracking and Mapping forSmall AR Workspaces,” Proc. IEEE/ACM Sixth Int’l Symp. Mixedand Augmented Reality, 2007.

[30] S.A. Holmes, G. Sibley, G. Klein, and D.W. Murray, “Using aRelative Representation in Parallel Tracking and Mapping,” Proc.IEEE Int’l Conf. Robotics and Automation, 2009.

[31] J.A. Castellanos, R. Martinez-Cantin, J.D. Tardos, and J. Neira,“Robocentric Map Joining: Improving the Consistency of EKF-SLAM,” Robotics and Autonomous Systems, vol. 55, no. 1, pp. 21-29,Jan. 2007.

[32] B. Williams, “Simultaneous Localisation and Mapping Using aSingle Camera,” DPhil dissertation, Dept. of Eng. Science, Univ. ofOxford, 2009.

[33] G. Sibley, G.S. Sukhatme, and L. Matthies, “Constant Time SlidingWindow Filter SLAM as a Basis for Metric Visual Perception,”Proc. IEEE Int’l Conf. Robotics and Automation, 2007.

[34] A.J. Davison and D.W. Murray, “Sequential Localisation andMap-Building Using Active Vision,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 24, no. 7, pp. 865-880, July 2002.

[35] K. Levenberg, “A Method for the Solution of Certain Non-LinearProblems in Least Squares,” Quarterly Applied Math., vol. 2,pp. 164-168, 1944.

[36] D.W. Marquardt, “An Algorithm for the Least-Squares Estimationof Non-Linear Parameters,” J. SIAM, vol. 11, no. 2, pp. 431-441,1963.

[37] R.M. Murray, Z. Li, and S.S. Sastry, A Mathematical Introduction toRobotic Manipulation. CRC Press, 1994.

[38] G. Klein and D.W. Murray, “Improving the Agility of Keyframe-Based SLAM,” Proc. 10th European Conf. Computer Vision, 2008.

[39] R.O. Castle, G. Klein, and D.W. Murray, “Wide-Area AugmentedReality Using Camera Tracking and Mapping in MultipleRegions,” Computer Vision and Image Understanding, vol. 115,pp. 854-867, 2011.

[40] O. Faugeras and F. Lustman, “Motion and Structure from Motionin a Piecewise Planar Environment,” Int’l J. Pattern Recognition inArtificial Intelligence, vol. 2, pp. 485-508, 1988.

[41] M.A. Fischler and R.C. Bolles, “Random Sample Consensus: AParadigm for Model Fitting with Applications to Image Analysisand Automated Cartography,” Comm. ACM, vol. 24, no. 6, pp. 381-395, 1981.

[42] E. Rosten and T. Drummond, “Machine Learning for High-SpeedCorner Detection,” Proc. Ninth European Conf. Computer Vision,2006.

[43] E. Rosten, R. Porter, and T. Drummond, “Faster and Better: AMachine Learning Approach to Corner Detection,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 105-119,Jan. 2010.

[44] S.A. Holmes, “Challenges in Real-Time Slam: Curbing Complex-ity, Cultivating Consistency,” DPhil dissertation, Dept. of Eng.Science, Univ. of Oxford, 2010.

[45] R.O. Castle, G. Klein, and D.W. Murray, “Combining monoSLAMwith Object Recognition for Scene Augmentation Using aWearable Camera,” Image and Vision Computing, vol. 28, no. 12,pp. 1548-1556, 2010.

[46] L.A. Clemente, A.J. Davison, I.D. Reid, J. Neira, and J.D. Tardos,“Increasing the Size of Maps Mapping Large Loops with a SingleHand-Held Camera,” Proc. Robotics: Science and Systems Conf.,2007.

Steven A. Holmes received the master’sdegree in engineering science from New Col-lege, University of Oxford, in 2006 with firstclass honors, and continued there as a gradu-ate student in the Active Vision Laboratory,working on issues of complexity in monocularvisual SLAM. He completed his doctorate in2010. He now works on motion capture forVicon in Oxford, part of the Oxford MetricsGroup plc. He is a member of the IEEE and the

IEEE Computer Society.

David W. Murray graduated with first classhonors in physics from the University of Oxfordin 1977 and received the doctorate degreethere in low-energy nuclear physics in 1980. Hewas a research fellow, again in physics, at theCalifornia Institute of Technology before joiningGEC’s research laboratories in London, wherehis primary research interests were in motioncomputation, structure from motion, and objectrecognition. He moved to the University of

Oxford in 1989 as a fellow of St. Anne’s College and as a Universitylecturer in the Department of Engineering Science, where he foundedthe Active Vision Laboratory. He was made a Professor of EngineeringScience in 1997. His research interests center on active and ego-centric approaches to visual sensing and perception. He is a memberof the IEEE and the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

HOLMES AND MURRAY: MONOCULAR SLAM WITH CONDITIONALLY INDEPENDENT SPLIT MAPPING 1463