[IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information and Automation for Sustainability - Efficient Monocular SLAM using sparse information filters

Embed Size (px)

Text of [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) -...

  • Efficient Monocular SLAM using sparse informationfilters

    Zhan WangARC Centre of Excellence for

    Autonomous Systems (CAS)Faculty of Engineering and IT

    University of Technology, Sydney, AustraliaEmail: zhwang@eng.uts.edu.au

    Gamini DissanayakeARC Centre of Excellence for

    Autonomous Systems (CAS)Faculty of Engineering and IT

    University of Technology, Sydney, AustraliaEmail: gdissa@eng.uts.edu.au

    AbstractA new method for efficiently mapping three dimen-sional environments from a platform carrying a single calibratedcamera, and simultaneously localizing the platform within thismap is presented in this paper. This is the Monocular SLAMproblem in robotics, which is equivalent to the problem ofextracting Structure from Motion (SFM) in computer vision. Anovel formulation of Monocular SLAM which exploits recentresults from multi-view geometry to partition the feature locationmeasurements extracted from images into providing estimates ofenvironment representation and platform motion is developed.Proposed formulation allows rich geometric information from alarge set of features extracted from images to be maximally incor-porated during the estimation process, without a correspondingincrease in the computational cost, resulting in more accurateestimates. A sparse Extended Information Filter (EIF) which fullyexploits the sparse structure of the problem is used to generatecamera pose and feature location estimates. Experimental resultsare provided to verify the algorithm.


    Efficiently acquiring a geometric representation of the en-vironment is crucial to the success of autonomous vehiclenavigation. In the last decade, the simultaneous localizationand mapping (SLAM) problem has been one of the mostresearched problems in the robotics community. The body ofliterature in SLAM is large. For an extensive review of therecent work in SLAM the reader is referred to Durrant-Whyteand Bailey [1] [2].

    With vision becoming more and more popular as a sensingtechnique due to advances in computer vision, the MonocularSLAM problem has received significant attention. MonocularSLAM is a special SLAM problem in that it requires a verysimple device, a single camera. On the other hand, it is achallenging problem as only bearing observations are available,and thus the state of the environment is not observable fromone location.

    Structure from Motion (SFM) is the problem of recon-structing the geometry of a scene as well as the cameraposes from a stream of images. In computer vision, the SFMproblem has been studied for many years. There is a closerelationship between SFM and Monocular SLAM. They bothuse a sequence of images to recover the scene structure (termedas map in SLAM) and platform poses.

    Despite all the similarities, the focus of the two problemsis quite different. In Monocular SLAM, only the camera posewhen the last image was taken is of interest as this is sufficientto make navigation decisions. Therefore, all previous poses areoften marginalized out reducing the number of states to beestimated, although this process makes the information matrixof the resulting estimation problem dense. The marginalizationstep is analogous to the reduction procedure in bundle adjust-ment, however, it is performed for a very different purpose.Revisiting previously traversed regions of the environment,termed as loop closure, is very important in SLAM whereasthis is not of much concern in SFM. SFM focuses more onobtaining an off-line optimal solution, whereas in SLAM real-time solutions and adequate estimates of the reliability of theresult (uncertainty) are essential. Despite these differences,clearly it is possible to exploit some of the strategies usedin solving the SFM problem to develop new and more ef-fective algorithms for SLAM. The close relationship betweenMonocular SLAM and SFM has been realized by the roboticscommunity and some research has been directed in exploitingideas from both fields [3]. However, there is much potentialfor further research in this area and many opportunities forsignificant advances exist.

    While some solutions using optimisation strategies haveemerged recently, most successful Monocular SLAM algo-rithms use constant velocity model [5] [6] to constrain con-secutive camera poses, when direct motion measurements donot exist. In order for the constant velocity model to be valid,the camera motion must be smooth which significantly limitsthe application scope. Alternatively, the frequency of acquiringimages needs to be high, which can be achieved by using thehigh frequency camera [7]. However, this is at the expense ofsignificantly more computational resources.

    Although the sparseness of the problem structure has beenwidely acknowledged and used in SFM methods [8] [9] andSLAM methods [10] [11] to develop efficient algorithms, incurrent development of Monocular SLAM, the sparseness hasnot been adequately exploited. Most current Monocular SLAMmethods use Extended Kalman Filter (EKF) based techniquesand tend to limit the number of features used for representing

    978-1-4244-8551-2/10/$26.00 c2010 IEEE 311 ICIAfS10

  • the environment due to computational reasons. In [5], onlyaround one hundred features are estimated. The Hierarchicalmap approach is introduced into Monocular SLAM [6], butthe total number of estimated features is still limited.

    The main contributions of this paper are as follows. SectionII describes a new formulation of Monocular SLAM thatmakes it possible to incorporate platform motion estimatescomputed using ideas from multi-view geometry into thetraditional SLAM observation equations. A novel efficientMonocular SLAM algorithm, that fully exploits the sparsestructure of the new formulation is presented in Section III.Experimental results to demonstrate the effectiveness of theproposed techniques are provided in Section IV.


    Popular image feature extraction methods such as SURF[12] allow hundreds or thousands of features to be extractedfrom a single image and tracked across images. However,incorporating the information present in all these observationsin to the SLAM solution by maintaining these features in thestate vector leads to an unacceptable computational cost.

    The new formulation presented below tackles this problemfrom a novel perspective. A significant proportion of informa-tion gathered from the extracted features are transformed intogeometrical information relating consecutive camera poses,and a structure that allows these information to be incorporatedin the SLAM solution is provided. The remaining features aremaintained in a map to provide a comprehensive representa-tion of the environment. The new formulation preserves thesparseness of the problem structure, thereby allowing efficientestimation of the states using sparse information filters. By thismeans, the information extracted from the images is maximallyexploited, while maintaining the computational efficiency andthus providing a more accurate and detailed map of theenvironment.

    Suppose two images are taken at camera pose Pi andthen Pj , and a set of features are extracted from theseimages. SURF [12] feature extractor is a popular choice forthis purpose. It is proposed to divide the feature set in totwo parts as = [p,m]. The set of measurements to thefeatures in are accordingly divided as = [p,m]. Eachmeasurement in is in the form of (u, v), which representsthe feature position in the image. While m is to be used forestimating feature locations in the map as usual, the featuresin the subset p are used for computing relative pose betweenPi and Pj , providing an alternative method for capturinginformation contained in the observations.

    Features in p that are common to the two images madefrom Pi and Pj are used to estimate the essential matrixby the five-point method [4] using known camera intrinsicparameters. Outliers are handled using RANSAC [14] and theepipolar constraint. From the resulting essential matrix, therotation matrix, R, and translation vector, T = [xT , yT , zT ]T ,describing the relative pose of Pj relative to Pi are computedusing the method provided in [13]. Consideration should begiven to select poses with a large baseline to ensure an accurate

    estimate of the essential matrix. Due to the unobservablescale in the essential matrix, T only represents the translationdirection of Pj with respect to Pi. Then R is transformedinto the quaternion q, which is denoted as ZR. The translationvector, T , is transformed into ZT = [, ]T , where = xT /zTand = yT /zT .

    We resolve the two key issues that have so far madeit difficult to exploit ZR and ZT in an estimation theo-retic framework. Uncertainty of the relative pose estimateis computed using the Unscented transform [15], making itstraightforward to obtain. Relative pose between two pairs ofposes computed this way may become statistically dependentdue to the reuse of raw data. This is eliminated by makingsure that measurements to a feature in the environment fromdifferent poses are allocated to only one of the subsets (por m). , i.e. these can not be at one time be allocated inp, while in m at other times. The Laplacian sign associatedwith features extracted by provides a convenient mechanismfor this purpose.


    A. Coordinate system

    The world frame, denoted as W , is the reference frame. Thecamera frame, C, is attached to the camera, with the originlocated at the lens optical center with the Z axis along theoptical axis. The Euclidean coordinates of a feature in theworld frame is denoted as Xf = [xf , yf , zf ]T . The camerapose, Xc, contains the Euclidean coordinates of the pose inthe world frame, XWcORG, and the quaternion describing theorientation of the camera, qWC . The state vector contains allcamera poses and featu