[IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information and Automation for Sustainability - Efficient Monocular SLAM using sparse information filters

  • Published on

  • View

  • Download


<ul><li><p>Efficient Monocular SLAM using sparse informationfilters</p><p>Zhan WangARC Centre of Excellence for</p><p>Autonomous Systems (CAS)Faculty of Engineering and IT</p><p>University of Technology, Sydney, AustraliaEmail: zhwang@eng.uts.edu.au</p><p>Gamini DissanayakeARC Centre of Excellence for</p><p>Autonomous Systems (CAS)Faculty of Engineering and IT</p><p>University of Technology, Sydney, AustraliaEmail: gdissa@eng.uts.edu.au</p><p>AbstractA new method for efficiently mapping three dimen-sional environments from a platform carrying a single calibratedcamera, and simultaneously localizing the platform within thismap is presented in this paper. This is the Monocular SLAMproblem in robotics, which is equivalent to the problem ofextracting Structure from Motion (SFM) in computer vision. Anovel formulation of Monocular SLAM which exploits recentresults from multi-view geometry to partition the feature locationmeasurements extracted from images into providing estimates ofenvironment representation and platform motion is developed.Proposed formulation allows rich geometric information from alarge set of features extracted from images to be maximally incor-porated during the estimation process, without a correspondingincrease in the computational cost, resulting in more accurateestimates. A sparse Extended Information Filter (EIF) which fullyexploits the sparse structure of the problem is used to generatecamera pose and feature location estimates. Experimental resultsare provided to verify the algorithm.</p><p>I. INTRODUCTION</p><p>Efficiently acquiring a geometric representation of the en-vironment is crucial to the success of autonomous vehiclenavigation. In the last decade, the simultaneous localizationand mapping (SLAM) problem has been one of the mostresearched problems in the robotics community. The body ofliterature in SLAM is large. For an extensive review of therecent work in SLAM the reader is referred to Durrant-Whyteand Bailey [1] [2].</p><p>With vision becoming more and more popular as a sensingtechnique due to advances in computer vision, the MonocularSLAM problem has received significant attention. MonocularSLAM is a special SLAM problem in that it requires a verysimple device, a single camera. On the other hand, it is achallenging problem as only bearing observations are available,and thus the state of the environment is not observable fromone location.</p><p>Structure from Motion (SFM) is the problem of recon-structing the geometry of a scene as well as the cameraposes from a stream of images. In computer vision, the SFMproblem has been studied for many years. There is a closerelationship between SFM and Monocular SLAM. They bothuse a sequence of images to recover the scene structure (termedas map in SLAM) and platform poses.</p><p>Despite all the similarities, the focus of the two problemsis quite different. In Monocular SLAM, only the camera posewhen the last image was taken is of interest as this is sufficientto make navigation decisions. Therefore, all previous poses areoften marginalized out reducing the number of states to beestimated, although this process makes the information matrixof the resulting estimation problem dense. The marginalizationstep is analogous to the reduction procedure in bundle adjust-ment, however, it is performed for a very different purpose.Revisiting previously traversed regions of the environment,termed as loop closure, is very important in SLAM whereasthis is not of much concern in SFM. SFM focuses more onobtaining an off-line optimal solution, whereas in SLAM real-time solutions and adequate estimates of the reliability of theresult (uncertainty) are essential. Despite these differences,clearly it is possible to exploit some of the strategies usedin solving the SFM problem to develop new and more ef-fective algorithms for SLAM. The close relationship betweenMonocular SLAM and SFM has been realized by the roboticscommunity and some research has been directed in exploitingideas from both fields [3]. However, there is much potentialfor further research in this area and many opportunities forsignificant advances exist.</p><p>While some solutions using optimisation strategies haveemerged recently, most successful Monocular SLAM algo-rithms use constant velocity model [5] [6] to constrain con-secutive camera poses, when direct motion measurements donot exist. In order for the constant velocity model to be valid,the camera motion must be smooth which significantly limitsthe application scope. Alternatively, the frequency of acquiringimages needs to be high, which can be achieved by using thehigh frequency camera [7]. However, this is at the expense ofsignificantly more computational resources.</p><p>Although the sparseness of the problem structure has beenwidely acknowledged and used in SFM methods [8] [9] andSLAM methods [10] [11] to develop efficient algorithms, incurrent development of Monocular SLAM, the sparseness hasnot been adequately exploited. Most current Monocular SLAMmethods use Extended Kalman Filter (EKF) based techniquesand tend to limit the number of features used for representing</p><p>978-1-4244-8551-2/10/$26.00 c2010 IEEE 311 ICIAfS10</p></li><li><p>the environment due to computational reasons. In [5], onlyaround one hundred features are estimated. The Hierarchicalmap approach is introduced into Monocular SLAM [6], butthe total number of estimated features is still limited.</p><p>The main contributions of this paper are as follows. SectionII describes a new formulation of Monocular SLAM thatmakes it possible to incorporate platform motion estimatescomputed using ideas from multi-view geometry into thetraditional SLAM observation equations. A novel efficientMonocular SLAM algorithm, that fully exploits the sparsestructure of the new formulation is presented in Section III.Experimental results to demonstrate the effectiveness of theproposed techniques are provided in Section IV.</p><p>II. NEW FORMULATION OF MONOCULAR SLAM</p><p>Popular image feature extraction methods such as SURF[12] allow hundreds or thousands of features to be extractedfrom a single image and tracked across images. However,incorporating the information present in all these observationsin to the SLAM solution by maintaining these features in thestate vector leads to an unacceptable computational cost.</p><p>The new formulation presented below tackles this problemfrom a novel perspective. A significant proportion of informa-tion gathered from the extracted features are transformed intogeometrical information relating consecutive camera poses,and a structure that allows these information to be incorporatedin the SLAM solution is provided. The remaining features aremaintained in a map to provide a comprehensive representa-tion of the environment. The new formulation preserves thesparseness of the problem structure, thereby allowing efficientestimation of the states using sparse information filters. By thismeans, the information extracted from the images is maximallyexploited, while maintaining the computational efficiency andthus providing a more accurate and detailed map of theenvironment.</p><p>Suppose two images are taken at camera pose Pi andthen Pj , and a set of features are extracted from theseimages. SURF [12] feature extractor is a popular choice forthis purpose. It is proposed to divide the feature set in totwo parts as = [p,m]. The set of measurements to thefeatures in are accordingly divided as = [p,m]. Eachmeasurement in is in the form of (u, v), which representsthe feature position in the image. While m is to be used forestimating feature locations in the map as usual, the featuresin the subset p are used for computing relative pose betweenPi and Pj , providing an alternative method for capturinginformation contained in the observations.</p><p>Features in p that are common to the two images madefrom Pi and Pj are used to estimate the essential matrixby the five-point method [4] using known camera intrinsicparameters. Outliers are handled using RANSAC [14] and theepipolar constraint. From the resulting essential matrix, therotation matrix, R, and translation vector, T = [xT , yT , zT ]T ,describing the relative pose of Pj relative to Pi are computedusing the method provided in [13]. Consideration should begiven to select poses with a large baseline to ensure an accurate</p><p>estimate of the essential matrix. Due to the unobservablescale in the essential matrix, T only represents the translationdirection of Pj with respect to Pi. Then R is transformedinto the quaternion q, which is denoted as ZR. The translationvector, T , is transformed into ZT = [, ]T , where = xT /zTand = yT /zT .</p><p>We resolve the two key issues that have so far madeit difficult to exploit ZR and ZT in an estimation theo-retic framework. Uncertainty of the relative pose estimateis computed using the Unscented transform [15], making itstraightforward to obtain. Relative pose between two pairs ofposes computed this way may become statistically dependentdue to the reuse of raw data. This is eliminated by makingsure that measurements to a feature in the environment fromdifferent poses are allocated to only one of the subsets (por m). , i.e. these can not be at one time be allocated inp, while in m at other times. The Laplacian sign associatedwith features extracted by provides a convenient mechanismfor this purpose.</p><p>III. SPARSE EIF FORMULATION FOR MONOCULAR SLAM</p><p>A. Coordinate system</p><p>The world frame, denoted as W , is the reference frame. Thecamera frame, C, is attached to the camera, with the originlocated at the lens optical center with the Z axis along theoptical axis. The Euclidean coordinates of a feature in theworld frame is denoted as Xf = [xf , yf , zf ]T . The camerapose, Xc, contains the Euclidean coordinates of the pose inthe world frame, XWcORG, and the quaternion describing theorientation of the camera, qWC . The state vector contains allcamera poses and feature locations:</p><p>X = (XTc1, XTc2, . . . , X</p><p>Tf1, X</p><p>Tf2, . . .)</p><p>T . (1)</p><p>B. Observation equations</p><p>To exploit information present in the observations, rela-tionships between the measurements and the state vector isrequired. In the proposed formulation, measurements take threedifferent forms; Zm that correspond to the features that belongto the set m used for representing the environment, and ZTand ZR that represent the relative pose measurements extractedfrom feature observations that belong to the set p. Thus threeobservation equations are needed.</p><p>Suppose a measurement Zm in m, is made from a camerapose Xc to a feature Xf . The coordinates of the feature in thecamera frame is given by</p><p>XCf = RCW (Xf XWcORG), (2)</p><p>where the rotation matrix RCW is obtained from qWC , andXCf = [x</p><p>Cf , y</p><p>Cf , z</p><p>Cf ]</p><p>T . Then the observation equation for Zmis</p><p>Zm =</p><p>[uv</p><p>]= hm(X) + wm (3)</p><p>978-1-4244-8551-2/10/$26.00 c2010 IEEE 312 ICIAfS10</p></li><li><p>where hm(X) = [xCfzCf</p><p>foc + u0,yCfzCf</p><p>foc+ v0]T , (u0, v0) is the</p><p>coordinate of the principle point in the image coordinate frame(in pixels), foc is the focal length (in pixels), and wm is themeasurement noise assumed to be white Gaussian.</p><p>The observation equation for ZR is</p><p>ZR = qCiCj = hR(X) + wR (4)</p><p>where hR is obtained from qWCj = qCiCj</p><p>qWCi (</p><p>indicates the quaternion product), and wR is the measurementnoise assumed to be white Gaussian.</p><p>The observation ZT is the position (up to a scale) of posePj relative to pose Pi. First transform XcjORG into the cameraframe Ci by</p><p>XCicj = RCiW (XWcjORG XWciORG), (5)</p><p>where the rotation matrix RCiW is obtained from qWCi , andXCicj = [x</p><p>Cicj , y</p><p>Cicj , z</p><p>Cicj ]</p><p>T . The observation equation for ZT is</p><p>ZT =</p><p>[</p><p>]= hT (X) + wT (6)</p><p>where hT (X) = [xCicj</p><p>zCicj</p><p>,yCicj</p><p>zCicj</p><p>]T and wT is the measurement noise</p><p>assumed to be white Gaussian.</p><p>C. Map initialization</p><p>For the estimation process to start, initial estimates of allthe elements of the state vector are required.</p><p>The first camera pose frame is selected as the worldframe. To initialize the map. To obtain accurate relative posemeasurements between two poses, the two images need to havesufficient number of common features and the two poses needto have a reasonably large baseline. Starting from the secondimage, all the images in the selected set are tested against theimages made from known poses (at the first step, only the firstpose) with respect to the above two conditions. If any imagesatisfies those conditions, then the relative pose measurementsZT and ZR are computed, and the corresponding pose aswell as new features are initialized by triangulation. Wheninitializing a pose, a feature measurement from the pose toa known feature and a relative pose measurement with aknown pose are used. When initializing a feature, two featuremeasurements from the two related poses each are used. Theabove process is repeated until all poses in the selected set areinitialized. Normally this can be achieved except in the extremecase where there are very few features in the environment.</p><p>After all poses and features are initialized, all measurementsfrom the selected images are fused using EIF to obtain animproved estimate as follows.</p><p>D. Sequential fusion</p><p>The steps for fusing the measurements from a new imageare summarized in Algorithm 1.</p><p>The steps of computing initial guesses for the new poseand locations of new features in the new image are the same</p><p>Algorithm 1 Sequential fusion of the image taken from posePkRequire: Correspondence between features in current image</p><p>and one or more previous images, as well as features inthe current map.</p><p>1: Select a known pose and compute the relative pose mea-surements with Pk.</p><p>2: Compute the initial guess for Xck by triangulation fromthe selected pose and an old feature.</p><p>3: Compute initial guesses for new features by triangulationfrom Pk and old poses.</p><p>4: Add zeros to information vector and information matrixfor the new pose and new features.</p><p>5: Fuse all measurements incurred by the current image usingthe EIF.</p><p>as those in the process described in Section III-C. Theseinitial guesses are then added into the state vector, and zeroscorresponding to the new states are added into the informationmatrix and information vector.</p><p>The information in the feature measurements and relativepose measurements are fused using the observation equationsin (3), (4) and (6). A general form of these observationequations is</p><p>Zg = hg(X) + wg (7)</p><p>where Zg can be any of Zm, ZR and ZT , with hg and wgbeing the corresponding expressions and measurement noisesin (3), (4) and (6).</p><p>The information matrix, I , and information vector, i, areupdated by</p><p>I(k) = I(k 1) +HTg Q1g Hgi(k) = i(k 1) +HTg Q1g [zg(k)Hg(X(k 1))</p><p>+HgX(k 1)](8)</p><p>where Hg is the Jacobian of the function hg with respect toall the states evaluated on the current state estimate X(k 1),and Qg is the covariance matrix of the measurement noise. Itshould be noted that ZR and ZT from the same relative posemeasurement should be fused together as they are correlated.</p><p>E. Smoothing</p><p>The above filtering steps produce accurate estimates offeature locations and camera poses. A further smoothing stepusing all stored measurements can be used to further improvethe estimate...</p></li></ul>


View more >