6
Map-aided 6-DOF Relative Pose Estimation for Monocular SLAM using Sparse Information Filters Zhan Wang ARC Centre of Excellence for Autonomous Systems (CAS) Faculty of Engineering University of Technology, Sydney Sydney, Australia [email protected] Gamini Dissanayake ARC Centre of Excellence for Autonomous Systems (CAS) Faculty of Engineering University of Technology, Sydney Sydney, Australia [email protected] Abstract—This paper addresses the problem of mapping three dimensional environments from a sequence of images taken by a calibrated camera, and simultaneously generating the camera motion trajectory. This is the Monocular SLAM problem in robotics, and is akin to the Structure from Motion (SFM) problem in computer vision. We present a novel map-aided 6-DOF relative pose estimation method based on a new formulation of the Monocular SLAM that is able to provide better initial estimates of new camera poses than the simple triangulation traditionally used in this context. The ’6-DOF’ means relative to the map which itself is up to an unobservable scale. The proposed pose estimator also allows more effective outlier rejection in matching features present in the map and features extracted from two consecutive images. Our Monocular SLAM algorithm is able to deal with arbitrary camera motion, making the smooth motion assumption, which is required by the typically used constant velocity model, unnecessary. In the new Monocular SLAM formu- lation, the measurements of extracted features from images are partitioned into those used for the estimation of the environment and those used for estimating the camera motion. The new formulation enables the current map estimate to aid achieving the full 6-DOF relative pose estimation up to the mapping scale while maximally exploiting the geometry information in images. Experiment results are provided to verify the proposed algorithm. Index Terms—Monocular SLAM, sparse information filter. I. I NTRODUCTION Autonomous vehicles, such as mobile robots, are required to navigate intelligently in the scenarios of indoor guidance, robot rescue, firefighting, underwater exploration etc. Effi- ciently acquiring a geometric representation of the environment is crucial in the success of these tasks. The simultaneous localization and mapping (SLAM) problem has been one of the most important problems in the last decade in the robotics community [1] [2]. Monocular SLAM is a special SLAM problem in that it requires very simple devices, a single camera. Only bearing observations are available and there is no odometry information, making it a challenging problem. In the computer vision literature, the structure form motion problem, that deals with reconstructing the geometry of a scene as well as the camera poses from a stream of images, has been studied for many years. It has been realized that there is an inherent link between Monocular SLAM and SFM. As direct motion measurements do not exist, many success- ful Monocular SLAM algorithms use constant velocity model (for example [3] and [4]) to constrain consecutive camera poses. This may require the camera motion to be smooth, which significantly limits the application scope. The high frequency cameras [5] can be used to minimize the effect of this limitation at the expense of more computational resources. In [6], Monocular SLAM is treated as a graph optimization problem. Each node is a local submap, which is constructed by combining observations about features in the submap. The nodes are related by scaled Euclidean transformations containing estimates of common features. With the number of features bounded in each node, the method achieves real- time performance for totally hundreds of features. In [7], particle filters are used and mapping of hundreds of features is also reported. In the more recent work [8], recovery and loop closing is solved by using the bag-of-words model, and thousands of features are mapped in real-time based on the implementation on a due-core computer. In solving the SFM problem using bundle adjustment [9] [10], it has been widely acknowledged that by exploiting the sparseness of the problem structure, the efficiency can be improved by applying sparse optimization and sparse matrix techniques [11]. In [12], adjustment is only applied to the most recent poses and the relevant features. The method is shown to be able to deal with an outdoor image sequence covering a distance of over one kilometer within close to real- time, but with a small drift. In [13], adjustment is performed within submaps, followed by a global aligning step, but the method is not real-time. The sparse problem structure is also been used to develop efficient algorithms for range-bearing SLAM problems (such as [14] and [15]). However, most current Monocular SLAM methods do not make use of this sparseness and therefore can not maintain many features due to computational cost. Relative pose estimation is a crucial step in both Monocular SLAM and SFM methods. Pose initial guess is the lineariza- tion point for the observation model. Accurate relative pose estimates are important as linearization point being close to the true value is the fundamental requirement for non-linear filters such as Extended Information Filter (EIF) to provide 978-1-4244-7815-6/10/$26.00 c 2010 IEEE ICARCV2010

[IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - Map-aided 6-DOF relative pose

  • Upload
    gamini

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Map-aided 6-DOF Relative Pose Estimation forMonocular SLAM using Sparse Information Filters

Zhan WangARC Centre of Excellence for Autonomous Systems (CAS)

Faculty of EngineeringUniversity of Technology, Sydney

Sydney, [email protected]

Gamini DissanayakeARC Centre of Excellence for Autonomous Systems (CAS)

Faculty of EngineeringUniversity of Technology, Sydney

Sydney, [email protected]

Abstract—This paper addresses the problem of mapping threedimensional environments from a sequence of images taken bya calibrated camera, and simultaneously generating the cameramotion trajectory. This is the Monocular SLAM problem inrobotics, and is akin to the Structure from Motion (SFM) problemin computer vision. We present a novel map-aided 6-DOF relativepose estimation method based on a new formulation of theMonocular SLAM that is able to provide better initial estimatesof new camera poses than the simple triangulation traditionallyused in this context. The ’6-DOF’ means relative to the mapwhich itself is up to an unobservable scale. The proposed poseestimator also allows more effective outlier rejection in matchingfeatures present in the map and features extracted from twoconsecutive images. Our Monocular SLAM algorithm is able todeal with arbitrary camera motion, making the smooth motionassumption, which is required by the typically used constantvelocity model, unnecessary. In the new Monocular SLAM formu-lation, the measurements of extracted features from images arepartitioned into those used for the estimation of the environmentand those used for estimating the camera motion. The newformulation enables the current map estimate to aid achievingthe full 6-DOF relative pose estimation up to the mapping scalewhile maximally exploiting the geometry information in images.Experiment results are provided to verify the proposed algorithm.

Index Terms—Monocular SLAM, sparse information filter.

I. INTRODUCTION

Autonomous vehicles, such as mobile robots, are requiredto navigate intelligently in the scenarios of indoor guidance,robot rescue, firefighting, underwater exploration etc. Effi-ciently acquiring a geometric representation of the environmentis crucial in the success of these tasks. The simultaneouslocalization and mapping (SLAM) problem has been one ofthe most important problems in the last decade in the roboticscommunity [1] [2]. Monocular SLAM is a special SLAMproblem in that it requires very simple devices, a singlecamera. Only bearing observations are available and there is noodometry information, making it a challenging problem. In thecomputer vision literature, the structure form motion problem,that deals with reconstructing the geometry of a scene as wellas the camera poses from a stream of images, has been studiedfor many years. It has been realized that there is an inherentlink between Monocular SLAM and SFM.

As direct motion measurements do not exist, many success-ful Monocular SLAM algorithms use constant velocity model(for example [3] and [4]) to constrain consecutive cameraposes. This may require the camera motion to be smooth,which significantly limits the application scope. The highfrequency cameras [5] can be used to minimize the effect ofthis limitation at the expense of more computational resources.

In [6], Monocular SLAM is treated as a graph optimizationproblem. Each node is a local submap, which is constructedby combining observations about features in the submap.The nodes are related by scaled Euclidean transformationscontaining estimates of common features. With the numberof features bounded in each node, the method achieves real-time performance for totally hundreds of features. In [7],particle filters are used and mapping of hundreds of featuresis also reported. In the more recent work [8], recovery andloop closing is solved by using the bag-of-words model, andthousands of features are mapped in real-time based on theimplementation on a due-core computer.

In solving the SFM problem using bundle adjustment [9][10], it has been widely acknowledged that by exploiting thesparseness of the problem structure, the efficiency can beimproved by applying sparse optimization and sparse matrixtechniques [11]. In [12], adjustment is only applied to themost recent poses and the relevant features. The method isshown to be able to deal with an outdoor image sequencecovering a distance of over one kilometer within close to real-time, but with a small drift. In [13], adjustment is performedwithin submaps, followed by a global aligning step, but themethod is not real-time. The sparse problem structure is alsobeen used to develop efficient algorithms for range-bearingSLAM problems (such as [14] and [15]). However, mostcurrent Monocular SLAM methods do not make use of thissparseness and therefore can not maintain many features dueto computational cost.

Relative pose estimation is a crucial step in both MonocularSLAM and SFM methods. Pose initial guess is the lineariza-tion point for the observation model. Accurate relative poseestimates are important as linearization point being close tothe true value is the fundamental requirement for non-linearfilters such as Extended Information Filter (EIF) to provide

978-1-4244-7815-6/10/$26.00 c⃝2010 IEEE ICARCV2010

��������������� ����������������������������������������������� !�����"������������

���#

consistent estimates [16].By using the epipolar constraint and other properties of the

fundamental matrix/essential matrix, different methods havebeen developed to solve the relative pose problem. In theeight-point algorithm, eight image points are used to computea unique fundamental matrix. It is quite robust when theimage points are normalized first [17]. The current state-of-the-art five-point algorithm [18] efficiently computes possiblesolutions of the essential matrix by using the minimal fiveimage points from two calibrated views. After the essentialmatrix is verified through a test procedure, the rotation matrixand translation vector between the two views can be recovered,up to a scale. In [19] the cheirality constraint, which requiresthat reconstructed point correspondences lie in front of thecameras is integrated into RANSAC relative pose estimators.It provides more consistent relative pose estimate and preventsinliers being falsely identified as cheirality outliers.

We solve the relative pose problem within a MonocularSLAM structure using EIF. The Monocular SLAM is refor-mulated such that the current map estimates are able to aidthe full 6-DOF relative pose estimation, where the scale isconsistent with that assigned to the map. With this method,our Monocular SLAM algorithm does not rely on the constantvelocity model [20] or any vehicle motion model [21]. Theconsistency of the estimates is improved due to more accuratelinearization point. The relative pose estimation method alsohelps detect wrong data association (outlier in matching),which may cause the filter to fail. With existing MonocularSLAM making use of only a small portion of image features,our algorithm provides a mechanism to make use of theinformation present in image features that are not incorporatedin the map, thereby maximally exploit image information.

The rest of the paper is organized as follows. SectionII describes a strategy to partition the extracted image fea-tures into two groups, making it possible for the map underconstruction is able to provide aid for the the relative poseestimation and to transform image information into platformmotion observations to improved the estimates. A novel map-aided 6-DOF relative pose algorithm is presented in Section III.The map construction procedure is briefly described in SectionIV. Experimental results are provided in Section V to verifythe proposed algorithms.

II. PARTITIONING IMAGE FEATURE INFORMATION INTO

MAPS AND RELATIVE POSE OBSERVATIONS

Image feature extraction approaches such as SURF [22]extract hundreds of features from each single image. How toefficiently use these features is crucial for Monocular SLAMalgorithms and is obviously a challenge. Using all thesefeatures to represent the environment in a filter mechanismwill lead to overwhelming computational efforts to maintainthe huge number of states. On the other hand, using allthese features to estimate the relative pose between successiveimage frames as in visual odometry, does not provide a directrepresentation of the environment, making it impossible toexploit additional information available when a previously

traversed region is revisited. We partition the extracted featuresinto two groups: one is used as point features to represent theenvironment; the other is used to estimate the relative pose.With this formulation, the map under construction is able toprovide the full 6-DOF relative pose estimation (up to anarbitrary scale that is used in the initial construction of themap).

Suppose one image is taken at camera pose 𝑃𝑗 , and a setof features Λ are extracted (using SURF [22] in this work)from the image. We use the Laplacian sign to divide theextracted SURF feature set Λ into two groups as Λ = [Λ𝑝,Λ𝑚].The sign of the Laplacian distinguishes bright blobs on darkbackgrounds from the reverse situation. Therefore the featuresin Λ𝑝 and Λ𝑚 will never be messed up when they areobserved from different poses. The set of measurements to thefeatures in Λ are accordingly divided as Ω = [Ω𝑝,Ω𝑚]. Eachmeasurement in Ω is in the form of (𝑢, 𝑣), which represents thefeature position in the image. The feature subset Ω𝑝 from 𝑃𝑗 isto be matched with the feature subset Ω𝑝 from other poses forcomputing the relative pose estimate; and Ω𝑚 is to be used forestimating feature locations in the map. The Monocular SLAMproblem is reformulated as shown in Figure 1. It is obviousthat many feature observations which are thrown away in theold formulation are transformed into relative pose observationsand thereby are fused into the system in the new formulation.It is an very efficient way of incorporating feature informationin images by transforming many feature observations involvingmany feature states to the relative pose observation which onlyinvolves two camera pose states.

(a) Old formulation (b) New formulation

Fig. 1. Reformulation of Monocular SLAM

Monocular SLAM can be taken as a constrained optimiza-tion problem which is to find the best environment configura-tion and camera poses to fit in the feature-pose observations.In the old formulation (Figure 1(a)), which is used by existingalgorithms, only a small amount of feature-pose constraints areused, while others are thrown away. These unused constraints,however, can be transformed to relative pose constraints andused in the new formulation (Figure 1(b)). Thus better opti-mization results are expected. Alternatively, Monocular SLAMcan also be taken as an information collection process. Theinformation in the images are continuously incorporated intothe estimates of feature locations and camera poses. The newformulation obviously provides a mechanism to collect themaximal amount of geometry information from the images.Importantly, the new formulation makes it possible to producethe full 6-DOF relative pose estimation (up to an overall scale)

���

with the aid of current map estimates when a new imagebecomes available.

We use the extended Information Filter (EIF) for estimatingthe map and camera poses. It will be shown that the sparseproblem structure of Monocular SLAM is still maintained bythe new formulation. The task now is to estimate the locationsof the camera poses and features, in order to fit the feature-pose observations and relative pose observations. The worldframe, denoted as 𝑊 , is the reference frame. The cameraframe, denoted as 𝐶, is attached to the camera, with the originlocated at the lens optical center and 𝑍 axis overlapping withthe optical axis pointing to the front.

Let the Euclidean 3D coordinate of a feature in the worldframe be denoted as 𝑋𝑓 = [𝑥𝑓 , 𝑦𝑓 , 𝑧𝑓 ]

𝑇 . The camera pose isdenoted as 𝑋𝑐, which contains the Euclidean 3D coordinateof the pose in the world frame, 𝑋𝑊

𝑐𝑂𝑅𝐺, and the quaterniondescribing the orientation relative to the world frame, 𝑞𝑊𝐶 .The whole state vector contains all camera poses and featurelocations:

𝑋 = (𝑋𝑇𝑐1, 𝑋

𝑇𝑐2, . . . , 𝑋

𝑇𝑓1, 𝑋

𝑇𝑓2, . . .)

𝑇 . (1)

III. MAP-AIDED 6-DOF RELATIVE POSE ESTIMATION

Most of existing Monocular SLAM algorithms rely on theconstant velocity model for estimating the next camera posesimply because there is no odometry information. Obviouslythis assumes that the camera motion is smooth, which signifi-cantly limits the application. By exploiting the correspondingfeatures in two successive images, 5-DOF relative pose can beestimated using multiple-view geometry. However, the depthin translation is not observable. We present a novel method,which can produce a full 6-DOF relative pose estimate, whichis still up to a scale but sufficient to provide accurate poseinitialization. This method makes our Monocular SLAM al-gorithm capable of handling arbitrary camera motion withoutintroducing any other sensors, such as odometry and IMU.

The overall algorithm of the 6-DOF relative pose estimationmethod is summarized in Algorithm 1 and the details areprovided in the following sections.

A. Computing essential matrix and 5-DOF relative pose

Epipolar geometry, probably one of the most studied prob-lems in computer vision, describes the intrinsic projectivegeometry between two images. The fundamental matrix, de-noted as 𝐹 , captures this geometry and satisfies the relation𝑥𝑇2 𝐹𝑥1 = 0, where 𝑥1 and 𝑥2 are the corresponding points

in the two images. It defines the epipolar constraint that foran image point, 𝑥1, in the first image, its corresponding pointin the second image must be on the eipolar line defined by𝐹𝑥1 [17]. In the case of our work, where the camera iscalibrated, epipolar geometry only depends on the relative posebetween the two views. The fundamental matrix degrades tothe essential matrix, denoted as 𝐸, and the epipolar constraintsimplifies to 𝑥𝑇

2 𝐸𝑥1 = 0, in which the image points 𝑥1 and 𝑥2

have been premultiplied by the inverse of the correspondingcalibration matrix.

Algorithm 1 Map-aided 6-DOF relative pose estimation of 𝑃𝑗

relative to 𝑃𝑖

Require: Estimates of 𝑃𝑖 and locations of features in theenvironment

1: With the calibration parameters, use Ω𝑝 of 𝑃𝑖 and 𝑃𝑗 toestimate the essential matrix.

2: Compute the 5-DOF relative pose estimate from the es-sential matrix.

3: Compute the initial guess of 𝑃𝑗 . The location of 𝑃𝑗 in thetranslation direction is decided by assuming same speedas last step.

4: Match Λ𝑚 of 𝑃𝑗 with those of previous poses and obtainthe matching set between Λ𝑚 of 𝑃𝑗 and features in thecurrent map.

5: Using the matching set in 4) to estimate the location of 𝑃𝑗

along the translation direction from 𝑃𝑖. At the same time,detect outliers in the matching set in 4).

We choose the features in Ω𝑝 that are observed from thetwo camera poses 𝑃𝑖 and 𝑃𝑗 to compute the essential matrixusing the five-point method [18]. After that, the relative poseof 𝑃𝑗 with respect to 𝑃𝑖 is recovered from the essential matrixby the method in [17]. The rotation matrix is denoted as 𝑅,and translation vector is denoted as 𝑇 = [𝑥𝑇 , 𝑦𝑇 , 𝑧𝑇 ]

𝑇 . Itshould be noted that 𝑇 only represents the translation direction,therefore the relative pose description is 5-DOF. 𝑅 and 𝑇 areused to compute the initial guess for 𝑃𝑗 and also are used asrelative pose observations between 𝑃𝑖 and 𝑃𝑗 in the EIF updatestep. Uncertainty of the relative pose observations is computedusing the Unscented transform [23]. Large baseline should beensured between the two poses to obtain an accurate estimateof the essential matrix. Features in Ω𝑝 need to be selected notto be used for multiple relative pose estimation in order toavoid reusing raw data.

B. Resolving 6th dimension of relative pose up to mappingscale

From the results of last section, we have a 5-DOF de-scription of the pose 𝑃𝑗 relative to 𝑃𝑖, which is the rotationand translation direction. Since the estimated value of 𝑃𝑖

is available, we can compute the orientation of 𝑃𝑗 and 𝑃𝑗

is located somewhere on the line decided by 𝑃𝑖 and thetranslation direction. This is illustrated by Figure 2(a). InFigure 2(a), the position of 𝑃𝐺

𝑗 on the line 𝐿 is decided byassuming the distance between 𝑃𝑗 and 𝑃𝑖 is the same as thatbetween 𝑃𝑖 and the previous pose. Since the value of 𝑃𝑖 andthe 5-DOF relative pose estimate are reasonably accurate, weassume 𝑃𝑇

𝑗 is on the line 𝐿. Therefore, the task becomessearching for 𝑃𝑇

𝑗 on the line 𝐿 based on 𝑃𝐺𝑗 .

Suppose the map feature observations made at 𝑃𝑗 matchwith a few features in the map, among which there areoutliers. As illustrated by Figure 2(a), the solid lines from𝑃𝑇𝑗 to features indicate inliers; while the dashed line indicates

outliers. The predicted values of these observations made at𝑃𝑗 can be computed from the estimates of observed features

���$

(a)

(b)

Fig. 2. (a) Idea of resolving the 6th dimension of relative pose up to the mapscale (𝑃𝐺

𝑗 is the initial guess; 𝑃𝑇𝑗 is the true value). (b) Offsetting observation

model

and the value of 𝑃𝑗 which at the moment should be 𝑃𝐺𝑗 .

Obviously the real observations are not equal to the predictedvalues due to the difference between 𝑃𝐺

𝑗 and 𝑃𝑇𝑗 . However, the

inlier observations should be in accordance with an observationmodel involving 𝑃𝐺

𝑗 , the estimates of relevant features, and thedifference between 𝑃𝐺

𝑗 and 𝑃𝑇𝑗 , while the outlier observations

are not. By using SURF descriptors, the matching processproduces a much larger number of inliers than outliers. Thus,the problem becomes a model fitting problem with outliers,by solving which we can obtain an improved value of 𝑃𝑗 anddetect the outliers.

C. Offsetting observation model

By using the pinhole camera model, the offsetting observa-tion model can derived as illustrated in Figure 2(b). The posein red (gray) is the true pose; while the one in black is theinitial guess. A feature 𝐹𝑠 is observed. Figure 2(b) describesan overlook of the scenario, and 𝑓𝑢 is the focal length alongthe 𝑢 axis in the image space.

The predicted observation 𝑢2 can be computed as

𝑢2 =𝑥𝑠

𝑧𝑠𝑓𝑢 + 𝑢0, (2)

where 𝑢0 is the is the coordinate of the principle point alongthe 𝑢 axis in the image coordinate frame (in pixels). However,the real image observation ,𝑢, should not be equal to 𝑢2.Instead, it should be equal to 𝑢1 ideally. As 𝑃𝑇

𝑗 is not known,𝑢1 should be computed as

𝑢1 =𝑥𝑠 − 𝑛

𝑧𝑠 −𝑚𝑓𝑢 + 𝑢0. (3)

In this model, the unknown variables are 𝑚 and 𝑛. Similarequations exist for the observation 𝑣.

D. One dimensional curve fitting

As shown in Figure 2(b), the initial guess and true value of𝑃𝑗 are both on the line 𝐿. Suppose the current estimate of 𝑃𝑖

is 𝑋𝑊𝑐𝑖 and the initial guess of 𝑃𝑗 is 𝑋𝑊

𝑐𝑗 .First transform 𝑋𝑐𝑖𝑂𝑅𝐺 into the camera frame 𝐶𝑗 by

𝑋𝐶𝑗𝑐𝑖 = 𝑅𝐶𝑗𝑊 (𝑋𝑊

𝑐𝑖𝑂𝑅𝐺 −𝑋𝑊𝑐𝑗𝑂𝑅𝐺), (4)

where the rotation matrix 𝑅𝐶𝑗𝑊 is obtained from 𝑞𝑊𝐶𝑗 , and𝑋

𝐶𝑗𝑐𝑖 = [𝑥

𝐶𝑗𝑐𝑖 , 𝑦

𝐶𝑗𝑐𝑖 , 𝑧

𝐶𝑗𝑐𝑖 ]

𝑇 . It should be noted that 𝑋𝐶𝑗𝑐𝑖 indicates

the direction of line 𝐿 in the camera frame 𝐶𝑗 . The variables𝑚 and 𝑛 in equation (3) can be written as 𝑚 = 𝑑𝑧

𝐶𝑗𝑐𝑖 and

𝑛 = 𝑑𝑥𝐶𝑗𝑐𝑖 , where 𝑑 is an unknown variable. Therefore, there

is only one parameter in the offsetting observation model.The values 𝑥𝑠 and 𝑧𝑠 in equation (3) can be computed from

𝑋𝑊𝑐𝑗 and the estimates of feature 𝐹𝑠, 𝑋𝑊

𝑓𝑠. The coordinate of

the feature in world frame is transformed into the camera frame𝐶𝑗 by

𝑋𝐶𝑗

𝑓𝑠= 𝑅𝐶𝑊 (𝑋𝑊

𝑓𝑠 −𝑋𝑊𝑐𝑗𝑂𝑅𝐺), (5)

where the rotation matrix 𝑅𝐶𝑗𝑊 is obtained from 𝑞𝑊𝐶𝑗 , and𝑋

𝐶𝑗

𝑓𝑠= [𝑥

𝐶𝑗

𝑓𝑠, 𝑦

𝐶𝑗

𝑓𝑠, 𝑧

𝐶𝑗

𝑓𝑠]𝑇 . Then 𝑥𝑠 = 𝑥

𝐶𝑗

𝑓𝑠and 𝑧𝑠 = 𝑧

𝐶𝑗

𝑓𝑠.

Normal least squares fitting methods are vulnerable tooutliers. We use least squares fitting with bisquare weights[25] to fit data with the offsetting observation model. Itfollows an iteratively reweighed least-squares procedure. Thebisquare weights are assigned to data points depending ontheir distance to the model according to the bisquare function[26]. Points near the curve get full weight; while points faraway get reduced weight. By this means, the effect of outliersis minimized. The model parameter 𝑑 is obtained and usedto improve the value of 𝑃𝑗 . Outliers are also detected andremoved from the matching list.

Please note so far the observations made at pose 𝑃𝑗 havenot been used in the EIF update step. The improved value of𝑃𝑗 will be used as the liberalization point for the observationmodel, by which the estimate of pose 𝑃𝑗 will be updated. Thefollowing section describes the filtering process.

IV. MAP CONSTRUCTION

We use EIF to build the map. The first camera pose is usedto define the the world frame. To start the mapping process, weselect a set of images in sequence from the beginning. Startingfrom the second image, select images that have a large baselineand observe enough common features with the images fromknown poses (at the first step, only the first pose). The relativepose measurements 𝑅 and 𝑇 are computed for the selectedimages, and the corresponding pose as well as new featuresare initialized by triangulation. This process is repeated untilall poses in the selected set are initialized.

When a new image becomes available, the feature observa-tions from it are fused into the map by the steps in Algorithm2.

���%

Algorithm 2 Fusing a new image at 𝑃𝑗 into the map usingEIFRequire: Estimates of previous poses and locations of features

in the environment.1: Initialize the pose 𝑃𝑗 by using the relative pose estimation

method described in Section III.2: Match features in the current image with those in selected

previous images to find correspondences of both oldfeatures in the map and new features which are to beinitialized.

3: Initialize features that are observed from two poses bytriangulation.

4: Update the pose and feature location estimates using thepose-feature observations and relative pose observationsmade at 𝑃𝑗 by EIF.

A general form of the observation equation is

𝑍𝑔 = ℎ𝑔(𝑋) + 𝑤𝑔 (6)

where 𝑍𝑔 can be any of the pose-feature observation, as well asrelative pose orientation and translation direction observations,with ℎ𝑔 and 𝑤𝑔 being the corresponding expressions andmeasurement noises.

The information matrix, 𝐼 , and information vector, 𝑖, areupdated by

𝐼(𝑘) = 𝐼(𝑘 − 1) +∇𝐻𝑇𝑔 𝑄

−1𝑔 ∇𝐻𝑔

𝑖(𝑘) = 𝑖(𝑘 − 1) +∇𝐻𝑇𝑔 𝑄

−1𝑔 [𝑧𝑔(𝑘)−𝐻𝑔(�̂�(𝑘 − 1))

+∇𝐻𝑔�̂�(𝑘 − 1)](7)

where ∇𝐻𝑔 is the Jacobian of the function ℎ𝑔 with respect toall the states evaluated on the current state estimate �̂�(𝑘− 1),and 𝑄𝑔 is the covariance matrix of the measurement noise.It should be noted that 𝑅 and 𝑇 from the same relative poseobservation should be fused together as they are correlated.

V. EXPERIMENTAL RESULTS

A real image sequence taken in New College, Oxford [27]was used to verify the algorithm. The image resolution is512 × 384. Our implementation of the algorithm in Matlabis applied to the collected images. The image sequence takenfrom the left camera of the Point Grey Bumbleebee is used inthis experiment. Figure 3(a) shows a sample image from thedataset. The final 3D map is shown in Figure 3(b) which isprojected to the horizontal plane. In this experiment, featureson the ground and the building that have big parallax arechosen by the algorithm, and their locations are estimated.

We also evaluated the algorithm using the outdoor datasetcollected on the street at Leuven [24] from cameras onboarda vehicle. The image resolution is 360× 228. We applied ourMatlab implementation to a large part of the image sequencetaken from the left camera where the vehicle is moving on astreet with moving objects such as cars and bicycle riders.An image from the dataset is shown in Figure 4(a). The

(a)

0 5 10 15 20 25 30 35 40

−10

−5

0

5

10

15

20

25

30

35

X

Z

(b)

Fig. 3. Experiment results. (a) One sample image. (b) Camera pose andfeature location estimates projected to 2D. The asterisks represent cameraposes; the solid lines indicate orientation of the camera forward direction; thedots represent features.

experimental environment and the vehicle trajectory are shownin Figure 4(b). In the 2D projection of the resulting map asshown in Figure 4(c), it can be seen the camera trajectory isquite similar to the trajectory in Figure 4(b). The results alsoshow that the algorithm is working well even in the presenceof moving objects.

VI. DISCUSSIONS AND CONCLUSIONS

In this paper we presented a full 6-DOF relative pose esti-mation method in a new formulation of the Monocular SLAMalgorithm. Multiply-view geometry is applied to compute a 5-DOF constraint of the relative pose estimate, and the estimatedmap is used to resolve the remaining dimension up to thescale assigned to the map. The resulting estimates of the mapare subject to an unobservable scale, which can be resolvedby any measurements of absolute states, such as the locationof a camera pose or the distance between two features. TheMonocular SLAM algorithm we proposed does not rely onthe constant velocity model or any vehicle motion model, andthus is able to deal with arbitrary camera motion. Differingfrom other filter based methods, the prediction step in ourmethod is improved by using a nonlinear optimizer, whichgreatly improves the accuracy of the predicted states, as initerative Extended Kalman Filter.

The resulting information matrix of the algorithm is exactlysparse. The sparse structure in SFM and Monocular SLAM has

����

(a)

(b)

0 20 40 60 80 100 120

20

40

60

80

100

120

140

160

180

200

submap after init(in local coordinate)

X

Z

(c)

Fig. 4. Experiment results. (a) One sample image. (b) Experimentalenvironment and vehicle trajectory. (c) Camera pose and feature locationestimates projected to 2D. The asterisks represent camera poses; the solidlines indicate orientation of the camera forward direction; the dots representfeatures.

an inherent relationship. In both problems, sparseness occursin different forms, but is due to a fundamental characteristic,which is that the measurements are made among local objects(poses or features). This sparse structure is captured by our EIFbased Monocular SLAM algorithm, and leads to significantcomputational savings.

REFERENCES

[1] Bailey, T., and Durrant-Whyte, H. 2006. Simultaneous localization andmapping (SLAM): part II. Robotics and Automation Magazine, 13(3):108-117.

[2] Durrant-Whyte, H., and Bailey, T. 2006. Simultaneous localization andmapping: part I. Robotics and Automation Magazine, 13(2):99-110.

[3] Davison, A. J., Reid, I. D., Molton, N. D. and Stasse, O. 2007.MonoSLAM: Real-Time Single Camera SLAM, IEEE Transactions onPattern Analysis and Machine Intelligence, 29(6): 1052-1067.

[4] Clemente, L. A., Davison, A. J., Reid, I., Neira, J. and Tardos, J. D.2007. Mapping Large Loops with a Single Hand-Held Camera. Robotics:Science and Systems.

[5] Gemeiner, P, Davison, A and Vincze, M. 2008. Improving LocalizationRobustness in Monocular SLAM Using a High-Speed Camera, Robotics:Science and Systems.

[6] Eade, E. and Drummond, T. 2007. Monocular SLAM as a Graph ofCoalesced Observations, IEEE International Conference Computer Vision,pp:1-8.

[7] Eade, E. and Drummond, T. 2006. Scalable Monocular SLAM, IEEEComputer Society Conference on Computer Vision and Pattern Recogni-tion, pp:469-476.

[8] Eade, E. and Drummond, T. 2008. Unified Loop Closing and Recoveryfor Real Time Monocular SLAM, British Machine Vision Conference.

[9] Triggs, W., McLauchlan, P. F., Hartley, R. I. and Fitzgibbon, A. 2000. BAsurvey Adjustment – A Modern Synthesis, Triggs, B., Zisserman, A., andSzeliski, R. (Eds.): Vision Algorithms, pp. 298-372.

[10] Konolige, K. and Agrawal, M. 2008. FrameSLAM: From Bundle Ad-justment to Real-Time Visual Mapping IEEE Transactions on Robotics,24(5): 1066-1077.

[11] Lourakis, M.I.A. and Argyros, A.A. 2004. The Design and Implementa-tion of a Generic Sparse Bundle Adjustment Software Package Based onthe Levenberg - Marquardt Algorithm, ICS / FORTH Technical Report,No. 340.

[12] Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F. and Sayd, P.2006. 3D reconstruction of complex structures with bundle adjustment:an incremental approach, IEEE International Conference on Robotics andAutomation, pp:3055-3061.

[13] Ni, K., Steedly, D. and Dellaert, F. 2007. Out-of-Core Bundle Adjustmentfor Large-Scale 3D Reconstruction, IEEE International Conference onComputer Vision (ICCV), pp:1-8.

[14] Thrun, S., Burgard, W., and Fox. D. 2005. Probabilistic Robotics. TheMIT Press.

[15] Dellaert, F. and Kaess, M. 2006. Square Root SAM: simultaneous lo-cation and mapping via square root information smoothing. InternationalJournal of Robotics Research, 25(12):1181-1203.

[16] Huang S., and Dissanayake G. 2007. Convergence and consistencyanalysis for Extended Kalman Filter based SLAM. IEEE Transactionson Robotics, 23(5):1036-1049.

[17] Hartley, R. I. and Zisserman, A. 2004. Multiple view geometry incomputer vision, 2nd edition, Cambridge University Press.

[18] Nister, D. 2004. An Efficient Solution to the Five-Point Relative PoseProblem, IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 26(6): 756-770.

[19] Xu, W. and Mulligan, J. 2008. Robust relative pose estimation with in-tegrated cheirality constraint. International Conference on Pattern Recog-nition, pp: 1-4.

[20] Civera, J., Grasa, O.G., Davison, A.J. and Montiel, J.M.M. 2009. 1-pointRANSAC for EKF-based Structure from Motion, IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pp:3498-3504.

[21] Scaramuzza, D., Fraundorfer, F. and Siegwart, R. 2009. Real-timemonocular visual odometry for on-road vehicles with 1-point RANSAC,IEEE Conference on Robotics and Automation, pp:4293-4299.

[22] Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. 2008. SURF: SpeededUp Robust Features, Computer Vision and Image Understanding, 110(3):346-359.

[23] Julier, S. J. and Uhlmann, J. K. 2004. Unscented filtering and nonlinearestimation, Proceedings of the IEEE, 92(3):401-422.

[24] Leibe, B., Cornelis, N., Cornelis, K. and Gool, L. V. 2007. Dynamic 3DScene Analysis from a Moving Vehicle, IEEE Conference on ComputerVision and Pattern Recognition, pp:1-8.

[25] Martinez, W. L. and Martinez, A. R. 2001. Computational StatisticsHandbook with MATLAB, Chapman & Hall/CRC.

[26] Mosteller, F. and Tukey, J. W. 1977. Data Analysis and Regression,Addison-Wesley Publishers.

[27] Smith, M., Baldwin, I., Churchill, W., Paul, R. and Newman, P. 2009.The New College Vision and Laser Data Set, International Journal forRobotics Research, 28(5): 595-599.

����