[IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

Efficient Monocular SLAM using sparse informationfilters

Zhan WangARC Centre of Excellence for

Autonomous Systems (CAS)Faculty of Engineering and IT

University of Technology, Sydney, AustraliaEmail: [email protected]

Gamini DissanayakeARC Centre of Excellence for

Autonomous Systems (CAS)Faculty of Engineering and IT

University of Technology, Sydney, AustraliaEmail: [email protected]

Abstract—A new method for efficiently mapping three dimen-sional environments from a platform carrying a single calibratedcamera, and simultaneously localizing the platform within thismap is presented in this paper. This is the Monocular SLAMproblem in robotics, which is equivalent to the problem ofextracting Structure from Motion (SFM) in computer vision. Anovel formulation of Monocular SLAM which exploits recentresults from multi-view geometry to partition the feature locationmeasurements extracted from images into providing estimates ofenvironment representation and platform motion is developed.Proposed formulation allows rich geometric information from alarge set of features extracted from images to be maximally incor-porated during the estimation process, without a correspondingincrease in the computational cost, resulting in more accurateestimates. A sparse Extended Information Filter (EIF) which fullyexploits the sparse structure of the problem is used to generatecamera pose and feature location estimates. Experimental resultsare provided to verify the algorithm.

I. INTRODUCTION

Efficiently acquiring a geometric representation of the en-vironment is crucial to the success of autonomous vehiclenavigation. In the last decade, the simultaneous localizationand mapping (SLAM) problem has been one of the mostresearched problems in the robotics community. The body ofliterature in SLAM is large. For an extensive review of therecent work in SLAM the reader is referred to Durrant-Whyteand Bailey [1] [2].

With vision becoming more and more popular as a sensingtechnique due to advances in computer vision, the MonocularSLAM problem has received significant attention. MonocularSLAM is a special SLAM problem in that it requires a verysimple device, a single camera. On the other hand, it is achallenging problem as only bearing observations are available,and thus the state of the environment is not observable fromone location.

Structure from Motion (SFM) is the problem of recon-structing the geometry of a scene as well as the cameraposes from a stream of images. In computer vision, the SFMproblem has been studied for many years. There is a closerelationship between SFM and Monocular SLAM. They bothuse a sequence of images to recover the scene structure (termedas ’map’ in SLAM) and platform poses.

Despite all the similarities, the focus of the two problemsis quite different. In Monocular SLAM, only the camera posewhen the last image was taken is of interest as this is sufficientto make navigation decisions. Therefore, all previous poses areoften marginalized out reducing the number of states to beestimated, although this process makes the information matrixof the resulting estimation problem dense. The marginalizationstep is analogous to the reduction procedure in bundle adjust-ment, however, it is performed for a very different purpose.Revisiting previously traversed regions of the environment,termed as loop closure, is very important in SLAM whereasthis is not of much concern in SFM. SFM focuses more onobtaining an off-line optimal solution, whereas in SLAM real-time solutions and adequate estimates of the reliability of theresult (uncertainty) are essential. Despite these differences,clearly it is possible to exploit some of the strategies usedin solving the SFM problem to develop new and more ef-fective algorithms for SLAM. The close relationship betweenMonocular SLAM and SFM has been realized by the roboticscommunity and some research has been directed in exploitingideas from both fields [3]. However, there is much potentialfor further research in this area and many opportunities forsignificant advances exist.

While some solutions using optimisation strategies haveemerged recently, most successful Monocular SLAM algo-rithms use constant velocity model [5] [6] to constrain con-secutive camera poses, when direct motion measurements donot exist. In order for the constant velocity model to be valid,the camera motion must be smooth which significantly limitsthe application scope. Alternatively, the frequency of acquiringimages needs to be high, which can be achieved by using thehigh frequency camera [7]. However, this is at the expense ofsignificantly more computational resources.

Although the sparseness of the problem structure has beenwidely acknowledged and used in SFM methods [8] [9] andSLAM methods [10] [11] to develop efficient algorithms, incurrent development of Monocular SLAM, the sparseness hasnot been adequately exploited. Most current Monocular SLAMmethods use Extended Kalman Filter (EKF) based techniquesand tend to limit the number of features used for representing

978-1-4244-8551-2/10/$26.00 c©2010 IEEE 311 ICIAfS10

the environment due to computational reasons. In [5], onlyaround one hundred features are estimated. The Hierarchicalmap approach is introduced into Monocular SLAM [6], butthe total number of estimated features is still limited.

The main contributions of this paper are as follows. SectionII describes a new formulation of Monocular SLAM thatmakes it possible to incorporate platform motion estimatescomputed using ideas from multi-view geometry into thetraditional SLAM observation equations. A novel efficientMonocular SLAM algorithm, that fully exploits the sparsestructure of the new formulation is presented in Section III.Experimental results to demonstrate the effectiveness of theproposed techniques are provided in Section IV.

II. NEW FORMULATION OF MONOCULAR SLAM

Popular image feature extraction methods such as SURF[12] allow hundreds or thousands of features to be extractedfrom a single image and tracked across images. However,incorporating the information present in all these observationsin to the SLAM solution by maintaining these features in thestate vector leads to an unacceptable computational cost.

The new formulation presented below tackles this problemfrom a novel perspective. A significant proportion of informa-tion gathered from the extracted features are transformed intogeometrical information relating consecutive camera poses,and a structure that allows these information to be incorporatedin the SLAM solution is provided. The remaining features aremaintained in a map to provide a comprehensive representa-tion of the environment. The new formulation preserves thesparseness of the problem structure, thereby allowing efficientestimation of the states using sparse information filters. By thismeans, the information extracted from the images is maximallyexploited, while maintaining the computational efficiency andthus providing a more accurate and detailed map of theenvironment.

Suppose two images are taken at camera pose Pi andthen Pj , and a set of features Λ are extracted from theseimages. SURF [12] feature extractor is a popular choice forthis purpose. It is proposed to divide the feature set Λ in totwo parts as Λ = [Λp,Λm]. The set of measurements to thefeatures in Λ are accordingly divided as Ω = [Ωp,Ωm]. Eachmeasurement in Ω is in the form of (u, v), which representsthe feature position in the image. While Ωm is to be used forestimating feature locations in the map as usual, the featuresin the subset Ωp are used for computing relative pose betweenPi and Pj , providing an alternative method for capturinginformation contained in the observations.

Features in Ωp that are common to the two images madefrom Pi and Pj are used to estimate the essential matrixby the five-point method [4] using known camera intrinsicparameters. Outliers are handled using RANSAC [14] and theepipolar constraint. From the resulting essential matrix, therotation matrix, R, and translation vector, T = [xT , yT , zT ]

T ,describing the relative pose of Pj relative to Pi are computedusing the method provided in [13]. Consideration should begiven to select poses with a large baseline to ensure an accurate

estimate of the essential matrix. Due to the unobservablescale in the essential matrix, T only represents the translationdirection of Pj with respect to Pi. Then R is transformedinto the quaternion q, which is denoted as ZR. The translationvector, T , is transformed into ZT = [ζ, η]T , where ζ = xT /zTand η = yT /zT .

We resolve the two key issues that have so far madeit difficult to exploit ZR and ZT in an estimation theo-retic framework. Uncertainty of the relative pose estimateis computed using the Unscented transform [15], making itstraightforward to obtain. Relative pose between two pairs ofposes computed this way may become statistically dependentdue to the ’reuse’ of raw data. This is eliminated by makingsure that measurements to a feature in the environment fromdifferent poses are allocated to only one of the subsets (Ωp

or Ωm). , i.e. these can not be at one time be allocated inΩp, while in Ωm at other times. The Laplacian sign associatedwith features extracted by provides a convenient mechanismfor this purpose.

III. SPARSE EIF FORMULATION FOR MONOCULAR SLAM

A. Coordinate system

The world frame, denoted as W , is the reference frame. Thecamera frame, C, is attached to the camera, with the originlocated at the lens optical center with the Z axis along theoptical axis. The Euclidean coordinates of a feature in theworld frame is denoted as Xf = [xf , yf , zf ]

T . The camerapose, Xc, contains the Euclidean coordinates of the pose inthe world frame, XW

cORG, and the quaternion describing theorientation of the camera, qWC . The state vector contains allcamera poses and feature locations:

X = (XTc1, X

Tc2, . . . , X

Tf1, X

Tf2, . . .)

T . (1)

B. Observation equations

To exploit information present in the observations, rela-tionships between the measurements and the state vector isrequired. In the proposed formulation, measurements take threedifferent forms; Zm that correspond to the features that belongto the set Ωm used for representing the environment, and ZT

and ZR that represent the relative pose measurements extractedfrom feature observations that belong to the set Ωp. Thus threeobservation equations are needed.

Suppose a measurement Zm in Ωm, is made from a camerapose Xc to a feature Xf . The coordinates of the feature in thecamera frame is given by

XCf = RCW (Xf −XW

cORG), (2)

where the rotation matrix RCW is obtained from qWC , andXC

f = [xCf , y

Cf , z

Cf ]

T . Then the observation equation for Zm

is

Zm =

[uv

]= hm(X) + wm (3)

978-1-4244-8551-2/10/$26.00 c©2010 IEEE 312 ICIAfS10

where hm(X) = [xCf

zCf

foc + u0,yCf

zCf

foc+ v0]T , (u0, v0) is the

coordinate of the principle point in the image coordinate frame(in pixels), foc is the focal length (in pixels), and wm is themeasurement noise assumed to be white Gaussian.

The observation equation for ZR is

ZR = qCiCj = hR(X) + wR (4)

where hR is obtained from qWCj = qCiCj⊗

qWCi (⊗

indicates the quaternion product), and wR is the measurementnoise assumed to be white Gaussian.

The observation ZT is the position (up to a scale) of posePj relative to pose Pi. First transform XcjORG into the cameraframe Ci by

XCicj = RCiW (XW

cjORG −XWciORG), (5)

where the rotation matrix RCiW is obtained from qWCi , andXCi

cj = [xCicj , y

Cicj , z

Cicj ]

T . The observation equation for ZT is

ZT =

[ζη

]= hT (X) + wT (6)

where hT (X) = [xCicj

zCicj

,yCicj

zCicj

]T and wT is the measurement noise

assumed to be white Gaussian.

C. Map initialization

For the estimation process to start, initial estimates of allthe elements of the state vector are required.

The first camera pose frame is selected as the worldframe. To initialize the map. To obtain accurate relative posemeasurements between two poses, the two images need to havesufficient number of common features and the two poses needto have a reasonably large baseline. Starting from the secondimage, all the images in the selected set are tested against theimages made from known poses (at the first step, only the firstpose) with respect to the above two conditions. If any imagesatisfies those conditions, then the relative pose measurementsZT and ZR are computed, and the corresponding pose aswell as new features are initialized by triangulation. Wheninitializing a pose, a feature measurement from the pose toa known feature and a relative pose measurement with aknown pose are used. When initializing a feature, two featuremeasurements from the two related poses each are used. Theabove process is repeated until all poses in the selected set areinitialized. Normally this can be achieved except in the extremecase where there are very few features in the environment.

After all poses and features are initialized, all measurementsfrom the selected images are fused using EIF to obtain animproved estimate as follows.

D. Sequential fusion

The steps for fusing the measurements from a new imageare summarized in Algorithm 1.

The steps of computing initial guesses for the new poseand locations of new features in the new image are the same

Algorithm 1 Sequential fusion of the image taken from posePk

Require: Correspondence between features in current imageand one or more previous images, as well as features inthe current map.

1: Select a known pose and compute the relative pose mea-surements with Pk.

2: Compute the initial guess for Xck by triangulation fromthe selected pose and an old feature.

3: Compute initial guesses for new features by triangulationfrom Pk and old poses.

4: Add zeros to information vector and information matrixfor the new pose and new features.

5: Fuse all measurements incurred by the current image usingthe EIF.

as those in the process described in Section III-C. Theseinitial guesses are then added into the state vector, and zeroscorresponding to the new states are added into the informationmatrix and information vector.

The information in the feature measurements and relativepose measurements are fused using the observation equationsin (3), (4) and (6). A general form of these observationequations is

Zg = hg(X) + wg (7)

where Zg can be any of Zm, ZR and ZT , with hg and wg

being the corresponding expressions and measurement noisesin (3), (4) and (6).

The information matrix, I , and information vector, i, areupdated by

I(k) = I(k − 1) +∇HTg Q

−1g ∇Hg

i(k) = i(k − 1) +∇HTg Q

−1g [zg(k)−Hg(X̂(k − 1))

+∇HgX̂(k − 1)](8)

where ∇Hg is the Jacobian of the function hg with respect toall the states evaluated on the current state estimate X̂(k− 1),and Qg is the covariance matrix of the measurement noise. Itshould be noted that ZR and ZT from the same relative posemeasurement should be fused together as they are correlated.

E. Smoothing

The above filtering steps produce accurate estimates offeature locations and camera poses. A further smoothing stepusing all stored measurements can be used to further improvethe estimates. During smoothing, the latest state estimate isused to compute the Jacobians in equation (8). Equation (8) canthen be used to fuse all stored measurements in a batch processwhile setting the initial information matrix and informationvector being all zeros, to prevent any artificial incorporation ofinformation. This process can be repeated to refine the result inthe sequence of: recomputing Jacobians in equation (8), fusingall measurements to obtain the new information matrix and

978-1-4244-8551-2/10/$26.00 c©2010 IEEE 313 ICIAfS10

information vector, and recovering the state vector from thenew information matrix and information vector. It was foundthat just one iteration of this process is enough to producevery good results, due to accurate estimates from the filteringprocess.

F. Feature matching

The process described above relies on a robust strategyfor identifying the correspondence between features in theset of images acquired. Matching features across two imagesare required for computing relative pose estimations whilematching features present in the map to those in the currentimage is required for data association. A hypothesize-and-testStrategy is used for this purpose. First the SURF featuresextracted from two images are matched using the SURFdescriptors. This provides correct matching for most features,but outliers do exist. Therefore, the epipolar constraint thatmust be satisfied by corresponding points in two images areused to validate the matches. We use the method in [13] tosolve for the epipolar constraint. The above two steps areorganized in a RANSAC scheme [14], and produce the bestgroup of matching features. When the correspondences in anold image are found for the current image, then the featuresin the map, which are labeled as being observed in this oldimage, are also associated with the corresponding features inthe current image.

G. State vector recovery

Since the estimation precess only produces an informationvector and an information matrix, the state estimate X̂(k) isnot directly available. This can be recovered by solving thesparse linear equation

I(k)X̂(k) = i(k) (9)

This can be solved efficiently by first solving L(k)Y = i(k)and L(k)T X̂(k) = Y , where L(k) is the Cholesky factoriza-tion of the information matrix, I(k). The Note that L(k), canbe obtained efficiently from L(k − 1), using the incrementalmethod provided in the previous work [16].

H. Exactly sparse information matrix

In the matrix ∇HTg Q−1

g ∇Hg in (8), all the elements relat-ing to states that are not present in the observation equationare exactly zero. This can be easily seen by the fact that theJacobain

∇Hg =

[0, · · · , 0, ∂Hg

∂Xa, 0, · · · , 0, ∂Hg

∂Xb, 0, · · · , 0

](10)

is very sparse and only the items corresponding to the statesinvolved in the observation equation are non-zeros. It can beseen from (4) and (6), that only the states for the two relatedposes are present in these two observation equations. Similarly,only the states for the related pose and feature are presentin (3). As the information matrix update involves a simpleaddition, only the off-diagonal elements relating to the statesthat are simultaneously involved in an observation equation

will be non-zeros, while all others will remain exactly zero.Therefore, a significant portion of the information matrix willbe exactly zero, resulting in an exactly sparse informationmatrix.

In general, each column (row) of the information matrix willcontain at most a constant number of non-zero elements (thisconstant depends on the density of features and the camerafield of view), independent of the size of the environmentand/or the total number of features.

IV. EXPERIMENT RESULTS

Two sets of experimental results are presented in thissection. In the first experiment, a Dragonfly-II camera is used.The field of view of this camera is 56 degrees and the imageresolution used is 1024 × 768. The experiment environment,which is shown in Figure 1(a), consists of three rectangularboxes of regular shape on an oval table. A trolley is used tocarry the camera to move round the table and take images. Fiftyfour images are obtained and processed by an implementationof the above algorithm in Matlab.

The sparse information matrix is shown in Figure 1(b).Figures 2(a) and 2(b) show the resulting map in 3D andthe projection in 2D. Ground truth is not available for thisexperiment. However the result can be evaluated by observingthe shape of the estimated boxes in the results especially inFigure 2(b). The ratio between width and length of the box onthe right in Figure 1(a) is measured to be 0.80:1, while in theestimate it is approximately 0.83:1.

(a)

(b)

Fig. 1. (a) Experiment environment. The three boxes are put on a horizontalsurface, but not exactly aligned. (b) Sparse information matrix.

978-1-4244-8551-2/10/$26.00 c©2010 IEEE 314 ICIAfS10

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−0.4−0.200.2

1

2

3

4

Y

Z

X

(a)

−2 −1.5 −1 −0.5 0 0.5 1 1.5

0.5

1

1.5

2

2.5

3

3.5

4

YX

Z

(b)

Fig. 2. Experiment results. The asterisks represent camera poses; the solidlines indicate orientation of the camera forward direction; the dots representfeatures. (a) Camera pose and feature location estimates in 3D. (b) Camerapose and feature location estimates projected to 2D.

The second experiment is based on a public domain datasetincorporating an image sequence taken in New College, Ox-ford [17]. The implementation of our algorithm in Matlab isapplied to the image sequence taken from the left camera of thePoint Grey Bumbleebee, with resolution of 512× 384. Figure3(a) shows a sample image from the dataset. The final 3Dmap, which is projected to the horizontal plane, is shown inFigure 3(b). As a comparison, we perform bundle adjustmentmaintaining all features in the map by using the SBA library[18]. It can be seen that the trajectory obtained from ourmethod is very close to that obtained from bundle adjustment.It is worth noting that our method only maintains about 30percent of the extracted features in the map.

V. CONCLUSIONS AND DISCUSSIONS

A novel algorithm to maximally fuse feature informationfrom image sequences in a very efficient way, to achieve quickand quality environment and pose estimates is demonstratedin this paper. This new formulation incorporates informationfrom all extracted features by uniquely distributing them intoobservations about features and relative poses. The innovationis that all information present in all the feature extracted isexploited without increasing the number of parameters to be

(a)

0 5 10 15 20 25 30 35 40

−5

0

5

10

15

20

25

X

Z

(b)

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 134643

(c)

Fig. 3. Experimental results. (a) One sample image. (b) Camera pose andfeature location estimates projected to 2D. The asterisks and dots representcamera poses and feature locations from our algorithm; the circles representcamera poses from SBA. (c) Exactly sparse information matrix.

estimated. The use of EIF fully captures the sparseness of theproblem structure, to achieve efficiency.

Due to the nature of the observations from images, theresulting estimate of the environment and camera poses is up toan unobservable scale, which can be recovered by any absolutemeasurement of the states, such as the distance between twofeatures. Due to the sparseness of the information matrix,this can be achieved very efficiently by using the Sherman-Morrison Formula [20].

The proposed algorithm can work in the filtering mode,which can provide an accurate immediate map for roboticnavigation; it can also work in a filtering-relinearisation-smoothing mode, which provides a refined map at some extracomputational cost. This additional smoothing step is alsolikely to improve the consistency of the estimate [19]. Further

978-1-4244-8551-2/10/$26.00 c©2010 IEEE 315 ICIAfS10

work is required to verify this. It is also worth noting that thesmoothing mode has a very close relationship to the non-linearoptimization schemes used in bundle adjustment [8].

As analyzed in Section III-H and shown in Section IV,the information matrix in the proposed algorithm is exactlysparse. This leads to significant computational savings in theestimation process. The update step is constant time, and thestate recovery step is linear with the state dimension until thereis a loop closure. The data association involving matchingfeatures across images is made efficient by using the shortSURF descriptors, and the processing time almost remainconstant for each image. Feature extraction consumes mostof the total time in each step, but its processing time almostremains constant for all images and does not increase with thestate dimension. A full scale implementation to exploit theseproperties is currently underway.

ACKNOWLEDGMENT

This work is supported by the ARC Centre of Excellenceprogramme, funded by the Australian Research Council (ARC)and the New South Wales State Government.

REFERENCES

[1] Durrant-Whyte, H., and Bailey, T. 2006. Simultaneous localization andmapping: part I. Robotics and Automation Magazine, 13(2):99-110.

[2] Bailey, T., and Durrant-Whyte, H. 2006. Simultaneous localization andmapping (SLAM): part II. Robotics and Automation Magazine, 13(3):108-117.

[3] Nister, D., Naroditsky, O. and Bergen, J. 2006. Visual Odometry forGround Vehicle Applications, Journal of Field Robotics, 23(1): 3-20.

[4] Nister, D. 2004. An Efficient Solution to the Five-Point Relative PoseProblem, IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 26(6): 756-770.

[5] Davison, A. J., Reid, I. D., Molton, N. D. and Stasse, O. 2007.MonoSLAM: Real-Time Single Camera SLAM, IEEE transactions onPattern Analysis and Machine Intelligence, 29(6): 1052-1067.

[6] Clemente, L. A., Davison, A. J., Reid, I., Neira, J. and Tardos, J. D.2007. Mapping Large Loops with a Single Hand-Held Camera. Robotics:Science and Systems.

[7] Gemeiner, P, Davison, A and Vincze, M. 2008. Improving LocalizationRobustness in Monocular SLAM Using a High-Speed Camera, Robotics:Science and Systems.

[8] Triggs, W., McLauchlan, P. F., Hartley, R. I. and Fitzgibbon, A. 2000.BA survey Adjustment – A Modern Synthesis, Triggs, B., Zisserman, A.,and Szeliski, R. (Eds.): Vision Algorithms, pp. 298-372.

[9] Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F. and Sayd, P.2006. 3D reconstruction of complex structures with bundle adjustment:an incremental approach, IEEE International Conference on Robotics andAutomation, pp:3055-3061.

[10] Thrun, S., Burgard, W., and Fox. D. 2005. Probabilistic Robotics. TheMIT Press.

[11] Dellaert, F. and Kaess, M. 2006. Square Root SAM: simultaneous lo-cation and mapping via square root information smoothing. InternationalJournal of Robotics Research, 25(12):1181-1203.

[12] Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. 2008. SURF: SpeededUp Robust Features, Computer Vision and Image Understanding, 110(3):346-359.

[13] Hartley, R. I. and Zisserman, A. 2004. Multiple view geometry incomputer vision, 2nd edition, Cambridge University Press.

[14] Martin, A. F. and Bolles, R. C. 1981. Random Sample Consensus: AParadigm for Model Fitting with Applications to Image Analysis andAutomated Cartography, Comm. of the ACM 24: 381-395.

[15] Julier, S. J. and Uhlmann, J. K. 2004. Unscented filtering and nonlinearestimation, Proceedings of the IEEE, 92(3):401-422.

[16] Wang, Z., Huang, S. and Dissanayake, G. 2007. D-SLAM: a decoupledsolution to simultaneous localization and mapping. International Journalof Robotics Research, 26(2): 187-204.

[17] Smith, M., Baldwin, I., Churchill, W., Paul, R. and Newman, P. 2009.The New College Vision and Laser Data Set, International Journal forRobotics Research, 28(5): 595-599.

[18] Lourakis, M. I. A. and Argyros, A. A. 2009. SBA: A software packagefor generic sparse bundle adjustment. ACM Transactions on MathematicalSoftware, 36(1):1-30.

[19] Huang, S. and Dissanayake, G. 2007. Convergence and consistencyanalysis for Extended Kalman Filter based SLAM. IEEE Transactionson Robotics, 23(5):1036-1049.

[20] William H. P., Brian, P. F., Saul, A. T. and William T. V., 1992.Numerical Recipes in C: The Art of Scientific Computing, CambridgeUniversity Press.

978-1-4244-8551-2/10/$26.00 c©2010 IEEE 316 ICIAfS10

Documents

[IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information