Monocular SLAM with undelayed initialization for an indoor robot

Robotics and Autonomous Systems 60 (2012) 841–851

Contents lists available at SciVerse ScienceDirect

Robotics and Autonomous Systems

journal homepage: www.elsevier.com/locate/robot

Monocular SLAM with undelayed initialization for an indoor robotKiwan Choi, Jiyoung Park, Yeon-Ho Kim, Hyoung-Ki Lee ∗

Micro Systems Lab., Samsung Advanced Institute of Technology, Samsung Electronics Inc., Yongin-Si, Gyeonggi-Do, 449-712, South Korea

a r t i c l e i n f o

Article history:Received 13 May 2010Received in revised form3 February 2012Accepted 6 February 2012Available online 13 February 2012

Keywords:Feature initializationLinearization errorsMono visionSLAMMobile robots

a b s t r a c t

This paper presents a new feature initializationmethod formonocular EKF SLAM (Extended Kalman FilterSimultaneous Localization and Mapping) which utilizes a 3D measurement model in the camera framerather than 2D pixel coordinates in the image plane. The key idea is to regard a camera as a range andbearing sensor, of which the range information contains numerous uncertainties. 2D pixel coordinates ofmeasurement are converted to 3D points in the camera framewith an assumed depth. The element of themeasurement noise covariance corresponding to the depth of the feature is set to a very high value. Andit is shown that the proposedmeasurement model has very little linearization error, which can be criticalfor the EKF performance. Furthermore, this paper proposes an EKF SLAM system that combines odometry,a low-cost gyro, and low frame rate (1–2 Hz) monocular vision. Low frame rate is crucial for reducing theprice of the processor. This system combination is cost-effective enough to be commercialized for a realvacuum cleaning application. Simulations and experimental results show the efficacy of the proposedmethod with computational efficiency in indoor environments.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Samsung Electronics has been developing vacuum cleaningrobots for the home environment that generate a cleaning pathwith minimal overlap of the traveled region. To do this, the robotsrequire the capability of localization and map building. For thelocalization capability, our robots are equipped with a camerafor external position measurements. They execute a visual SLAM(Simultaneous Localization and Mapping) algorithm based on anEKF (Extended Kalman Filter). In home environments, the needfor the visual features for a SLAM algorithm does not permitrobots to travel under a bed, for example. Thus, our robot hasa dead-reckoning system of two wheel encoders and a low-costgyro for relative position calculations when the visual features arenot available. A calibrated MEMS gyro gives a good orientationestimate for a few minutes and very accurate relative positioninformation in the frame of Kalman filter [1,2]. Thanks to thisdead-reckoning system, the required frame rate of images forvisual feature tracking can be reduced to 1 or 2 Hz while previousmonocular SLAM with only one camera needs the high frame rateof more than 20 Hz. The sensor combination of a low-cost MEMSgyro and a low frame rate camera leads to a practical system forvacuum cleaning application.

Davison [3] presented the feasibility of real-time monocularSLAM with a standard low-cost camera using an EKF framework.

∗ Corresponding author. Tel.: +82 31 280 1761; fax: +82 31 280 9257.E-mail addresses: [email protected], [email protected]

(H.-K. Lee).

0921-8890/$ – see front matter© 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.robot.2012.02.002

In his work, the camera obtains only bearing information, meaningthat multiple observations from different points of view mustbe processed to achieve accurate depth information. Davisonused an extra particle filter to estimate the feature depths.However, this method causes a delay in feature initialization.Bailey [4] also proposed a delayed initialization method forbearing-only SLAM. This method used a Gaussian approximationapproach similar to EKF. Costa et al. [5] presented another delayedinitialization method which calculated the initial uncertainty offeature locations. Their method projected measurements fromthe sensor space onto the Cartesian world frame and assumedthat their depth directional uncertainties had large values. Theirmethod was also a delayed method. Such delayed initializationschemes have the drawback that new features, held outsidethe main probabilistic state, are not able to contribute to theestimation of the camera position until finally included in themap.Further, features that retain low parallax over many frames (thosevery far from the camera or close to themotion epipole) are usuallyrejected completely because they never pass the test for inclusion.

Kwok and Dissanayake [6] presented a multiple hypothesisapproach that outlined their undelayed feature initializationscheme. Each landmark is initialized in the form of multiplehypotheses with different depths distributed along the directionof the bearing measurement. Using subsequent measurements,the depth of the landmark is estimated in an EKF framework andhypothesized landmarks with different depths are tested by asequential probability ratio test. Invalid hypotheses are removedfrom the system state. Sola et al. [7] improved computationalefficiency of this approach by approximating a Gaussian sum filter.

http://dx.doi.org/10.1016/j.robot.2012.02.002

http://www.elsevier.com/locate/robot

http://www.elsevier.com/locate/robot

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.robot.2012.02.002

842 K. Choi et al. / Robotics and Autonomous Systems 60 (2012) 841–851

Eade andDrummond [8] estimate the inverse depths of featuresrather than their depths. The inverse depth concept has anadvantage in terms of linearity, allowing linear Kalman filteringto operate successfully. While this approach utilizes a FastSLAM-type particle filter, Montiel et al. [9,10] applied this concept toEKF SLAM. Using an inverse depth entry in the state vector anda modified measurement model allows unified modeling andprocessing within the standard EKF without a special initializationprocess. Measurement equations have little linearization errorfor distant features, and the estimation uncertainty is accuratelymodeled as the Gaussian in inverse depth. A feature is representedby the state vector of six dimensions at the first observation—threefor the camera position, two for the direction of the ray and one forthe reciprocal depth. This is an over-parameterization comparedto three dimensions for Euclidean XYZ coding and hence increasesthe computational cost.

This paper presents a new method for initializing landmarksdirectly from the first measurement without any delay. Ourapproach updates themeasurement in the 3D camera frame. This isin contrast to previous methods that utilize 2D image coordinatesin the image plane. The proposed method was inspired by themeasurement model of 3D cameras (such as the SR4000 by MesaImaging) and by 3D range sensors that measure the range andbearing simultaneously. A camera is considered as a 3D rangesensor butwith a considerable amount of depth uncertainty.Whena feature is initially measured, the depth value of the feature in the3D coordinates is set to an arbitrary value with a sufficiently largecovariance. It should be noted that the arbitrary value has littleeffect on the EKF performance due to its large covariance. Anotheradvantage of the proposed scheme is its high degree of linearity.A more linear measurement equation leads to better performancefrom the EKF. Our approach is much simpler and more efficientthan earlier approaches that usemultiple hypotheses or an inversedepth technique.

This paper is organized as follows. Section 2 presents our EKFframework with the new measurement update model. Section 3explains the undelayed initialization scheme using our EKFframework and introduces the methods adopted to implementSLAM. Section 4 shows the simulation and experimental results.Section 5 closes the paper with a conclusion.

2. EKF framework

2.1. Coordinate frames

For a clear explanation, we will start with a definition of theCartesian coordinate frames. See Fig. 1. The letters ‘w’, ‘b’ and ‘c ’denote the world frame, the body frame and the camera frame,respectively. The origin of the camera frame is the focal point of thecamera and that of the body frame is the center of the two drivingwheels. The body frame and the camera frame are in the same rigidbody.

2.2. State vector definition

A robot moving on the irregular floor of a home environmentinvolves attitude changes. Thus, a 6-DOFmodel is used despite thefact that the robot moves with 3-DOF most of the time.

The state vector to represent the position and orientation of therobot in the world frame is given by

xv =

xvpψ

=

x y z θ ϕ ψ

T (1)

where θ, ϕ and ψ are the roll, pitch, and yaw of the Euler anglesrespectively; xvp is the position vector of the origin of the cameraframe and ψ is the orientation vector of the camera. It should be

Fig. 1. Coordinate frames.

noticed that the robot position and orientation are represented bythe camera frame rather than the body frame, which can reducethe level of complexity. Because the pitch angle of our robotis always between −π/2 and π/2, the Euler angles are morecomputationally efficient than the quaternion which is commonlyadopted for visual SLAM algorithms to use a hand-held camera.Because a dead-reckoning system can provide good predictioninformation, a velocity term doesn’t appear in the state vector. Thestate vector includes the 3D position vector y of the landmark inthe world frame, as do the conventional SLAM frameworks. Thus,the state vector x and its error covariance matrix P for our EKFframework are given as

x =

xvy1...yn

, P =

Px Pxy1 · · · Pxyn

Py1 Py1yn...

. . ....

Pynx · · · Pyn

(2)

where n is the number of registered landmarks.

2.3. Prediction model

Our dead-reckoning system consists of a yaw gyroscope andtwo wheel encoders. It can calculate a translation in the XY -planeof the body frame and the rotation around the z-axis of the bodyframe. Given that it cannot provide any information about thetranslation in the Z direction and the roll and pitch rotations,they are assumed to remain unchanged. Instead, the uncertainty inthese values will increase as the robotmoves. Feature locations arealso assumed to be stationary. Thus the prediction model at step kcan be given as

xvp,k|k−1 = xvp,k−1|k−1 + Cwc 1xc (3)

ψk|k−1 = ψ,k−1|k−1 +

1 sin θ tanϕ cos θ tanϕ0 cos θ − sin θ0 sin θ secϕ cos θ secϕ

1ψc (4)

yi,k|k−1 = yi,k−1|k−1 (5)

where Cwc is the transformation matrix from the camera frame tothe world frame [11].1xc and1ψc are the incremental changes inthe position vector and the orientation vector in the camera framerespectively. They are simply calculated from the incrementalchanges in the body frame given by the dead-reckoning system.Eq. (4) is a simple forward difference representation of the partialdifferential equation which relates the Euler angle to the bodyrates. It is acceptable if1ψc is small enough and if the roll and pitchangles change within small values. EKF predicts the state vector xand its covariance matrix P, as follows:

xv,k|k−1 = f(xv,k−1|k−1,uk) (6)

K. Choi et al. / Robotics and Autonomous Systems 60 (2012) 841–851 843

Fig. 2. Measurement model.

Px,k|k−1 =∂f∂xv

Px,k−1|k−1∂f∂xv

T

+∂f∂u

Qk∂f∂u

T

(7)

Pxy,k|k−1 =∂f∂xv

Pxy,k−1|k−1 (8)

where

uk =

1xc1ψc

k

(9)

and Q is the covariance matrix of u.

2.4. Measurement model

In the conventional monocular SLAM, observations are trackedfeature points in the image plane. Naturally, all in the case of theprevious monocular SLAM update the measurements in the imageplane. This means that the measurement vector z is expressed inthe image plane. In contrast, all components of the state vector xare in theworld frame. To perform EKF, the feature positions of thestate vectormust inevitably be transformed into those in the imageplane. The upper block in Fig. 2 shows this process. Particularly,the transformation from the camera frame to the image plane cancause a critical linearization error due to the large uncertainty ofthe depth.

To overcome this, we propose a different approach inwhich themeasurement is represented in the camera frame rather than in theimage plane. In Fig. 2, yc is the 3D position vector of the landmarkin the camera frame. As shown in the lower block in Fig. 2, thistechnique fundamentally removes the linearization error inducedby transformation from the camera frame to the image plane. Thisis described in greater detail in the linearization error analysis inthe next section.

Our measurement and its covariance matrix are

z =

vc1...vcm

, R =

Rv1 0. . .

0 Rvm

(10)

where m is the number of observed features, vc is the featureposition in the camera frame, and Rv is the error covariancematrixof vc . The observation vector z requires the 3D coordinates of thefeature, referring to Fig. 3. This involves the problem of how toconvert the 2D information of the camera measurement to 3Dinformation.

Here, we explain how to obtain vc from the observed value of v.The 3 × 4 camera projection matrix which describes the mapping

Fig. 3. Landmark initialization.

of a camera from 3D points in the camera frame to 2D points in theimage plane is written as

C = KC [M | t] (11)

whereM is the 3 × 3 rotation matrix, KC is the camera calibrationmatrix and t is the translation vector which is actually the zerovector because the origin of the camera frame is the focal point.The directional vector ct in Fig. 3 can be obtained by the followingequation:

ct =

ct1ct2ct3

= (KCM)−1

v1

. (12)

In addition, vc can be obtained by

vc = dct|ct |

(13)

where d is an unknown depth. Here, ct is in the camera frame. Ifprior information about the depth is available, d can be estimatedwith it. Otherwise, d is set to an arbitrary value. For example, whenwe observe it for the first or second time, there is no information.Therefore, it can be assumed to be the maximum distance of theworkspace or another proper value. Instead, the covariance matrixRv will be used through careful consideration.

To determine Rv , an error covariance of the feature in the(ct , cn1, cn2) frame is introduced as

Rv0 =

ε2ct 0 00 ε2cn1 00 0 ε2cn2

(14)

where εct , εcn1 and εcn2 are the standard deviations of the error inthe ct , cn1 and cn2 directions, respectively (Fig. 3). ct , cn1 and cn2


are orthogonal to each other and cn1 and cn2 can be chosen in anydirection.

The most important point is that the value of εct must be setas large enough to reflect the uncertainty of the depth. The idealvalue of εct is an infinite value, but this is not feasible for numericalcomputation. In our system, a value over 105 cm occasionallycaused problems during the process of inverting a matrix, andSLAM resulted in failure. It is recommended that

3000 cm ≤ εct ≤ 105 cm. (15)

The values above were obtained empirically. They serve simply toprovide a guideline when selecting a proper value of εct .

Because the value of εct cannot actually be infinite, there isone problem that must be addressed. The same observed imagescan lead to erroneous results. For example, when a robot doesnot move, it observes nearly identical images and tracks the samepoints that have the same errors. Given that the Kalman filter wasinduced with the assumption that there is no correlation betweenthe measurements and predicted states, these measurements areagainst this assumption. This can lead to an underestimation ofthe covariance, which means the situation when the estimatedcovariance is smaller than the true one. This problem also appliesto conventional methods, but it is more serious in our approachas it has the additional effect of decreasing the depth uncertaintyincorrectly. To avoid this, a measurement should not be updatedwhen the robot does not move.

The values of εcn1 and εcn2 depend approximately on thetracking error, calibration error, and the depth such that

εcn ≈

ε2trk + ε2cal

dfp

(16)

where εtrk is the standard deviation of the image feature trackingerror, εcal is the standard deviation of the camera calibration error,and fp is the focal length in pixel units.

After determining Rv0,Rv is determined by solving the follow-ing equation,

Rv = MctRv0MTct , (17)

where Mct is the rotation matrix that makes the x-axis parallel tothe ct-direction.Mct can be calculated from ct as follows:

Mct =

cosψct − sinψct 0sinψct cosψct 0

0 0 1

cosϕct 0 sinϕct0 1 0

− sinϕct 0 cosϕct

, (18)

where ψct and ϕct are defined as follows

cosψct =ct1

c2t1 + c2t2, sinψct =

ct2c2t1 + c2t2

(19)

cosϕct =

c2t1 + c2t2

c2t1 + c2t2 + c2t3, sinϕct =

−ct3c2t1 + c2t2 + c2t3

. (20)

In the next step, it is necessary to transform the features of thestate vector to the camera frame. The measurement model of thej-th feature is given as

hj(xν, yj) = Mcw(yj − xvp) (21)

where Mcw is the rotation matrix from the world frame to the

camera frame. The Jacobian matrix Hj of function hj is given as

Hj = [−Mcw|δMc

w,j|Mcw] (22)

δMcw,j =

∂Mc

w

∂θ∆j

∂Mcw

∂ϕ∆j

∂Mcw

∂ψ∆j

(23)

where

∆j = yj − xvp. (24)

At this point, we can obtain h and H by arranging hj and Hj. Themeasurement update equations are well known, as follows:

xk|k = xk|k−1 + K(zk − h(xk|k−1)) (25)

Pk|k = (I − KH)Pk|k−1 (26)

K = Pk|k−1HT (HPk|k−1HT+ Rk)

−1. (27)

2.5. Linearization error analysis

EKF SLAM has a drawback in that linearization errors can makea systemunstable. The trigonometric functions and the perspectiveprojection are the main factors that cause this. Eqs. (3)–(4)in the prediction process and Eq. (21) in the measurement processhave nonlinearities due to the trigonometric functions of the robotorientation. When the orientation error is small, the trigonometriclinearization error does not cause much trouble. A precise dead-reckoning system or a high update rate can be helpful to keep theorientation error small. Our dead-reckoning system is designedwell enough to give about 1.5 degree of yaw angle error after10 rotations [1]. Furthermore, multi-map approaches [12,13]can reduce these trigonometric nonlinearities, as they maintainsmall orientation errors in local maps. The iterated Kalman filterframework [14] can also decrease the overall nonlinearities byapplying an optimization theory to EKF, but this increases thecomputation time in keeping with the increased number ofiterations.

Especially with regard to feature initialization, the nonlinearityof the perspective projection is more critical because the depthdirectional uncertainty is very large. Therefore, we focus on this inthis paper. Our method is compared with two previous methods:one is the conventional method to use measurement modelexpressed in the image plane, which includes most of delayedfeature initialization SLAM and multiple hypothesis method [6,7]for undelayed feature initialization. The other is the inverse depthmethod [8–10].

Fig. 4 shows measurement models and their Jacobians. Tosimplify this analysis, functionh in Eq. (25) is decomposed into twofunctions:

h(xv, y) = hz(yc) = hz(hy(xv, y)) = hz ◦ hy(xv, y). (28)

hz is introduced to transform yc into the measurement z and hyis a function to satisfy yc = hy(xv, y).

C in (11) is assumed to be a simple pinhole cameramodel basedon our coordinate system defined in Fig. 1:

C =

0 fu 0 00 0 fv 01 0 0 0

(29)

where fu and fv are the focal length in pixel units with respectto each values of u and v in the image plane. For the inversedepth method in Fig. 4, xρ, θρ, ϕρ and ρ are the position of thecamera frame, the feature’s azimuth, the elevation and the inversedepth w.r.t. the world frame at the first observation of a feature,respectively. m is the unit depth directional vector. (Refer to [10]for more details.)

To analyze nonlinearity for only the perspective projection, thefollowing assumption is made:

Assumption 1. The robot’s orientation has no error.


Fig. 4. EKF measurement model comparison.

This implies that ψ and Mcw are assumed to be constant; thus,

only xvp and y in the state are variables. Using the chain rule, theJacobian matrix of h can be derived as

Jh(xvp, y) = Jhz(yc)Jhy(xvp, y). (30)

If all elements in the Jacobian matrix are constant, it is aperfectly linear function.

First, we analyze our method. The Jacobian of h based on theproposed method is

Jph(xvp, y) =−Mc

w | Mcw

. (31)

All its elements are obviously constant. Consequently, it is clearthat our measurement model has no perspective linearizationerror under Assumption 1.

Second, the Jacobian of h based on the conventional method is

Jch(xvp, y) =

−fuycx2c

fuxc

0

−fvzcx2c

0fvxc

−Mc

w | Mcw

. (32)

Its elements include highly nonlinear functions of xc . Whenfeatures are first observed and xc-axis is parallel to the depthdirection, xc has significantly high uncertainty. This will lead tothe divergence of the EKF SLAM. To reduce the depth directionaluncertainty, the multiple hypothesis approach divides one featurewith large uncertainty into several features with much lessuncertainty.

Lastly, we will obtain the Jacobian of h of the inverse depthmethod. Among the six parameters of the inverse depth method,the others except ρ have much smaller uncertainties than ρ. Forthis reason, ∂h/∂θρ and ∂h/∂ϕρ are neglected here. Then, theJacobian of h based on the inverse depth method is

Jρh (xvp, xρ, ρ)

=

−fuycx2c

fuxc

0

−fvzcx2c

0fvxc

−Mc

w | Mcw | −

1ρ2

Mcwm

. (33)

This appears more nonlinear than Eq. (32). To examine thelinearization error of the Jacobian with respect to ρ, we neglect∂h/∂xvp and ∂h/∂xρ by introducing the following assumption.

Assumption 2. ρ is the only variable in the Jacobian.

For simple explanation, the following definitions are introduced.

mc =

mcxmcymcz

= Mc

wm, x0c =

x0cy0cz0c

= Mc

w(xρ − xvp). (34)

Using (34), yc can be derived as

yc =

xcyczc

= Mc

w

xρ +

1ρm − xvp

= x0c +

1ρmc . (35)

With Assumption 2, the Jacobian can be determined by

Jρh (ρ) =

−fuycx2c

fuxc

0

−fvzcx2c

0fvxc

−

mc

ρ2

=

fuycx2cρ2

mcx −fu

xcρ2mcy

fvzcx2cρ2

mcx −fv

xcρ2mcz

. (36)

Finally, inserting (35) into (36), the Jacobian with respect to ρdbecomes

Jρh (ρ) =

fumcxy0c − mcyx0c(x0cρ + mcx)2

fvmcxz0c − mczx0c(x0cρ + mcx)2

. (37)

This still appears to be nonlinear. However, it should be noted thatif the dimensionless term x0c ρ is much smaller than mcx, Eq. (37)becomes a linear function, as follows.

Assumption 3. x0cρ ≪ mcx.

This assumption leads to

Jρh (ρ) ≈

fumcxy0c − mcyx0c

m2cx

fvmcxz0c − mczx0c

m2cx

. (38)

This shows that Jρh (ρ) is approximately independent of ρ andconstant under Assumptions 1–3. The validity of the Assumption 3needs to be checked. The minimum value of mcx is 0.5 with 120°field of view, for example. ρ becomes very small when a feature


is very distant. Thus, the inverse depth method will perform wellwith distant features. In addition, x0c is zero if the camera ismovingonly in the directions of the yc-and zc-axes. In contrast, moving inthe direction of the xc-axiswill increase x0c . Thus, the inverse depthmethod will perform better in case of lateral movement comparedto forward and backward movements.

In conclusion, the conventional method has too much non-linearity to initialize a feature without a delay. The multiplehypothesis method divides the nonlinearity by the number ofhypotheses, making an undelayed initialization possible. The in-verse depth method can be considered as nearly linear underAssumptions 1–3. The proposedmethod is perfectly linear only un-der Assumption 1. This explains why the proposed EKF SLAM doesnot diverge in spite of such a large error covariance. These analyseswill be verified with the simulation in Section 4.3.

3. Implementation of SLAM

In the following,we briefly introduce the implementation of theproposed EKF-based SLAM.

3.1. Undelayed feature initialization

In the previous section, we proposed a new approach to updatethe measurement in EKF. Our feature initialization concept worksalong the same lines but with one additional step. The otherprocesses are identical to how the measurement update operates.The additional step is transformation from the camera frame tothe world frame. Overall, the newly observed feature v will betransformed to vc and then to y for the initialization. vc can beobtained from v using Eqs. (12)–(13). Its covariance matrix Rv canalso be determined by Eqs. (14)–(20). Subsequently, vc and Rv aretransformed to y and Py, as follows:

y = xvp + Mwc v

c (39)

Py = Twc PxTwcT

+ Mwc RvM

wcT (40)

where Mwc is the rotation matrix from the camera frame to the

world frame, and

Twc =I3×3|δMw

c

(41)

δMwc =

∂Mw

c

∂θvc

∂Mwc

∂ϕvc

∂Mwc

∂ψvc

. (42)

After obtaining y and Py, they are added to the state vector x. Itscovariance matrix P is updated as follows:

Pyj = Py (43)

Pyjx = Twc Px, Pxyj = PTyjx (44)

Pyjyi = Twc Pxyi , Pyiyj = PTyjyi (45)

where j is the index of the feature to add, and i is one of the otherfeature indices except j.

This initialization method is much simpler than any othermethod. For example, the inverse depth method [9,10] needs anadditional step inwhich a feature’s state vector (of six dimensions)is reduced to one simple position vector (of three dimensions)after the feature’s uncertainty has become small enough. But theproposed method does not require such a post-processing step.We think this simplicity is the most important advantage of theproposed method. It is the measurement update in the cameraframe that enables such a simple initialization.

From a different point of view, the proposed method does notappear to express the error distribution accurately. Its ideal shapeis conic as shown in Fig. 5. However, the shape expressed by a

Fig. 5. Covariance analysis.

covariance matrix in the Cartesian coordinate is essentially anellipse. Multiple hypothesis approaches use several hypothesesto make the shape as conical shape as possible. The inversedepth approach also initializes a feature with conically shapeduncertainties using six parameters. Ours, in contrast, has an ovalshape instead of a conical shape. This misrepresentation of theerror distribution can induce problems such as underestimation oroverestimation. These underestimated and overestimated regionsare shown in Fig. 5. Theunderestimation region becomes smaller asd increaseswhile the opposite is true for the overestimation region.The underestimation may be worse than the overestimationbecause the former can fail SLAM somewhat, whereas the lattermainly increases the convergence time and causes wider searchregion. Thus, we recommend that d is given themaximumdistanceof theworkspace if there is noprior information. On the other hand,this implies that our method does not work well for features atinfinity. Therefore, our method is recommended for use in indoorenvironments.

3.2. Feature extraction

Image features should be persistent and reliable to performlong-term tracking. There are many methods that can be used toselect salient image patches from an image. Although the SIFT-based algorithm (Scale-Invariant Feature Transform) is one of themost commonly used methods at present, it could not be chosenhere as it requires more computational power than our targetsystem (ARM processor) can supply. Therefore, we adopted theHarris corner detector [15] due to its high computation efficiency.The number of features in one image is maintained between 5and 10.

3.3. Search region

To track the features efficiently, we set the search region onthe image. By reducing the search region, we can reduce the timenecessary for the tracking procedure. However, if the region is toosmall, feature tracking failure occurs more frequently. The searchregion is expressed as the estimated position of the feature in theimage plane and its error covariance matrix.

3.4. Feature tracking

As a tracking method, we basically utilize the Lucas–Kanademethod [16] with only 2-DOF translation using patches of 15 ×

15 pixels. The initial points are selected as Harris corners insidea search region. The Lucas–Kanade method finds the point thathas the minimum SSD (Sum of Squared Difference) value. SSDis not robust against illumination changes. Thus, we modify theLucas–Kanade method to find the minimum point of the zero-mean SSD and validate it with NCC (Normalized Cross Correlation),which makes data association much more robust.


(a) k = 0 (feature initialization). (b) k = 1 (measurement update).

Fig. 6. Simulation of one step update of EKF (diamond: the true position of the landmark, circle: the position that SLAM estimates, triangle: robot position).

Cleaning robots move along a cleaning path in the manner of aseed sowing machine. This consists of moving forward followedby 90° rotation. Moving forward involves feature scale changes,which lead to long-term tracking failure. Therefore, we adopted afeature-scale-prediction scheme. It is similar to predictive multi-resolution descriptors [17] which utilize the descriptors of theimages of different resolutions. However, we directly utilize ascale-transformed patch image with an estimated scale withoutusing descriptors.

3.5. Removing outliers

The above tracking method gives robust matching pairs.However, because moving objects or repeated patterns can exist,matching errors, also known as outliers, are inevitable and inducecritical errors in EKF-SLAM. To remove these, there are twopopular methods: RANSAC (Random Sample Consensus) [18] andJCBB (Joint Compatibility Branch and Bound) [19,20]. In homeenvironments, the number of features is occasionally small. JCBBcan better deal with a smaller number of matching pairs using theerror covariance of the features compared to RANSAC. Moreover,as our tracking method gives only one candidate or nothing foreach feature, we use the simplified version of JCBB [20]. However,if there are less than three inliers, JCBB does not work well either.In this case, EKF should skip updating with the current frame andgo to the next frame.

4. Simulation and experiment

We performed a number of simulations and experiments tovalidate our approach. The simulations focused on looking intohow exactly the positions of the landmarks are calculated incomparison with the previous methods. Additionally, the overallperformance of SLAM was validated through actual experiments.

4.1. Simulation in ideal situation

This simulation is intended to showhow accurately ourmethodcan estimate the location of landmarks in an ideal situation.Thus, the camera is assumed to be ideal with infinite resolutions,no noise, and no lens distortion. In addition, feature tracking isassumed to be perfect. Referring to Fig. 6(a), the robot is locatedin the initial position and a landmark is observed initially. Theobserved landmark is initialized to yj,0 with an assumed depth.Then, the robot moves forward 15 cm and the landmark position isupdated to yj,1 (Fig. 6(b)), which is compared to the true position

Fig. 7. Simulation result in an ideal situation. The true landmark positions, theestimated positions and the robot location are shown (εct = 1e4 cm, assumeddepth = 2000 cm).

yj,true. Fig. 7 shows the position difference between yj,1 and yj,true.Ten landmarks are located approximately 10 m away in front ofthe robot. yj,true is marked with blue diamonds and red circlesrepresent yj,1. The red ellipses are the 1σ bounds of the errorcovariances. Fig. 7 shows very small errors between the true andthe estimated even though EKF updates only one time.

In Fig. 8, the RMS (root mean square) error of (yj,true − yj,1)for ten landmarks is shown varying εct from 200 to 100,000. Also,the difference between the assumed depth and the true depth ischanged from 0 to 30 m. As εct is increased, the smaller error offeature location canbe obtained. The result shows that a large valueof εct can lead to nearly zero error with only one step calculation,even though the assumed value of the depth differs greatly fromthe true one.

4.2. Simulation with noise

To simulate the more realistic situation, we adopted twonoise sources of feature tracking error and yaw orientation error.Normally distributed random noise was added into the trackedfeature point v to include feature tracking error. The standarddeviation of the noise was set to 0.2 pixels. To include the yaworientation error, Gaussian noise is added to1ψc of the predictionmodel of Eq. (4) at every 15 cmmovement of the robot. Its standard


Fig. 8. Simulation result in an ideal situation using only two imageswhile the robotmoves forward 15 cm.

Fig. 9. Simulation result with noise using eleven frames while the robot movesforward 150 cm.

deviationwas set to 0.1°, whichwas chosen from the experimentaldata of our yaw gyro sensor. Contrary to the fact that consecutivetwo frames give a very small error in ideal situation, eleven framesare used in noisy situation while the robot moves forward 150 cm.The other simulation parameters were identical to the previousones of ideal situation. As shown in Fig. 9, εct with a value of morethan 3000 cmdoes not improve the accuracy any further. From thissimulation result, we recommend a value ofmore than 3000 cm forεct . This, of course, depends on the accuracy of the vision systemand the size of the environment. On a more accurate vision systemin a wider workspace, a larger value than 3000 cm can be chosen.

4.3. Comparison with previous methods

The proposed method was analytically compared to the earliermethods in Section 2.5, and this is validated through simulationsin this section. A feature point is placed from 500 to 1500 cm infront of the initial position of the robot. Its azimuth and elevationare 22° and 15°, respectively. The assumed depth is fixed at1000 cm. The assumed inverse depth is naturally 1/1000 (1/cm).There is no noise. To confirm the performance difference of theinverse depth method between lateral and forward (or backward)

movements, the robot moves forward 30 cm in Fig. 10(a) andmoves to the left 30 cm laterally without changing the yaw anglein Fig. 10(b). As for the simulation parameters of the proposedmethod, εct and εcn are set to 105 cm and 0.5 cm, respectively. Inthe inverse depth method, the standard deviation of the inversedepth (ρ) is set to 0.005 (1/cm) as recommended in the earlierstudy [10], and the standard deviation of the measurement errorin the image plane is set to 0.1 pixels. In the conventional method,a feature is initialized as our method and then updated in theimage plane. Fig. 10(a) shows that our method has almost noerror even when the true depth of the feature is 1500 cm andthe assumed depth is 1000 cm. The conventional method has verylarge error of feature locationwhen the error between the assumeddepth and the true depth becomes large. This explains why themultiple hypothesis method was introduced to cover the largedistance uncertainty. The multiple hypothesis method can dividethe error of the conventionalmethod by the number of hypotheses.According to the graph, the multiple hypothesis method requiresapproximately five hypotheses until it reaches a similar error asthe inverse depth method within the horizontal axis range of thegraph. Fig. 10(b) shows an interesting result, in which the inversedepth method has no error, similar to the proposed method. Thisprecisely coincides with our analysis in Section 2.5, which statedthat if there is no movement in the direction of the xc-axis, theinverse depthmethod is virtually linear. Consequently, consideringthe linearization error, the proposed method is the best, whereasthe inverse depth method appears to be better than the multiplehypothesis method.

Fig. 11 shows the comparison between the inverse depthmethod and the proposed method using experimental data. Therobot moves forward 130 cm and the Kalman filter updates 11times. 8 landmarks are located on front wall as shown in Fig. 13.The true positions of landmarks were not measured during theexperiment. The converged locations of the landmarks of the twoalgorithms are the same, so we assume that the converged pointsare true positions. In the proposed method, εct and εcn are setto 105 cm and 10 cm, respectively and the assumed depth d is1000 cm. In the inverse depth method, the standard deviation ofthe inverse depth (ρ) is set to 0.005 (1/cm) and the standarddeviation of the measurement error in the image plane is set to2 pixels. The assumed inverse depth ρ is 1/1000 (1/cm) at theinitial time. The average value of feature location error is plottedin Fig. 11. Two methods have almost the same error after a fewupdates of Kalman filters. It should be noticed that the proposedmethod has advantages on reduced computation time due toreduced number of states (three). In the inverse depth approach,a feature is represented by the state vector of six dimensions atthe first observation—three for the camera position and two forthe direction of the ray, and one for the reciprocal depth. This is anover-parameterization compared to dimension 3 for EuclideanXYZcoding and hence increases the computational cost, although theyused three variables after convergence.We just use three variablesto represent a feature from the start, which simplifies the overallcomputation.

4.4. Experimental results

We conducted experiments in an indoor environment. TheSamsung cleaning robot VC-RE70V, which is commercially avail-able in the market, is used as a robot platform [1]. The robot hasone camera of 320 × 240 resolution that faces upward in order tocapture ceiling images.We adjusted the camera orientation to cap-ture front view images as shown in Fig. 12. The dead-reckoning sys-tem includes two incremental encoders that measure the rotationof each motor, and an IMU (Inertial Measurement Unit) that pro-vides measures of the robot’s linear accelerations and yaw angular


(a) 30 cm forward movement. (b) 30 cm left movement.

Fig. 10. One step estimation for landmark location (assumed depth = 1000 cm).

Fig. 11. Comparison between the inverse depth method and the proposedmethodusing experimental data.

rate. 3-axis linear accelerations were used to improve the dead-reckoning performance during the wheel slip [2]. XV-3500 gyro-scope from EPSON was used to measure the yaw angle.

Ground truth data are obtained by a Hawk digital motiontracker system produced by Motion Analysis Inc., which hasaccuracy of less than 5 mm. A monochrome image sequence isacquired when the robot moves along the cleaning path at thespeed of 20 cm/s. At the same time, the data from the dead-reckoning system and the motion tracker system are saved. Thesampling rates of the encoder and gyroscope were 20 and 100 Hz,respectively. The frame rate of the camera is only 1–2 frames persecond. Our system does not require a high frame rate due to thedead-reckoning system. The SLAM algorithm is run off-line usingMatlab.

The results are shown in Figs. 13 and 14 and the accompanyingvideo clip entitled slam_result_01.avi is also available. The top leftimage in Fig. 13 shows the seven initialized features in the secondframe. The top right image shows the converged features afterseven frames of measurement. The red triangle in (0, 0) positionmeans the start position of the robot and the red circle representsthe current position of the robot. The bottom left image showsthe features after twelve frames. These figures show how thefeatures converge while the robot moves forward. The ellipses inblue are the 1σ bounds of the error covariances. The bottom rightimage shows features in the final frame. Full frames can be seenin the accompanying video, in which the green boxes representthe inliers and red boxes denote the outliers. These boxes showhowwell JCBB removes outliers. The accuracy of the robot positionestimate is compared to that of the dead-reckoning system in

Fig. 12. Robot system used in the experiments.

Fig. 14(a). As shown in the figure, SLAM is more accurate than thedead-reckoning in terms of position estimation. The final positionerror of SLAM is only 1.4 cm, whereas that of the dead-reckoningis 20.5 cm. Fig. 14(b) shows the z-axis robot position estimate. Thefluctuation between−1.5 and 1 cm is quite reasonable consideringthe probable slope of the floor.

5. Conclusions and future work

In this paper, we presented a 3D measurement model inthe camera frame and an undelayed initialization method formonocular EKF SLAM. The 2D pixel coordinates from camerameasurements are converted to 3D points in the camera framewith an assumed arbitrary depth. To account for depth uncertainty,the element of the measurement noise covariance correspondingto the depth of the feature is set to a very large value. This doesnot cause the divergence of the EKF because the 3D measurementmodel has very little linearization error. Compared with otherundelayed initialization methods, the proposed scheme is verysimple to implement and computationally efficient. Furthermore,we presented the efficacy of an EKF SLAM systemwith a low framerate camera plus a dead-reckoning system based on odometryand a low-cost MEMS gyro. This system is so practical as to becommercialized for a vacuum cleaning robot.

Although the proposed method cannot deal with features atinfinity in the framework of the numerical Kalman filter algorithm,


Fig. 13. Upper rows: images captured by a camera, lower rows: estimated robot and landmark positions in the XY -plane (cm).

(a) X–Y plane movement. (b) X–Z plane movement.

Fig. 14. The accuracy of the robot’s position estimation.


it can be applied for most of indoor robot applications. In futureworks, the authors plan to study the feasibility of extending thisalgorithm to tackle features at infinity.

Appendix. Supplementary data

Supplementary material related to this article can be foundonline at doi:10.1016/j.robot.2012.02.002.

References

[1] H. Myung, H.K. Lee, K. Choi, S. Bang, Mobile robot localization with gyroscopeand constrained Kalman filter, International Journal of Control, Automation,and Systems 8 (3) (2010) 667–676.

[2] H. Lee, K. Choi, J. Park, Y. Kim, S. Bang, Improvement of dead reckoning accuracyof a mobile robot by slip detection and compensation using multiple modelapproach, in: Proc. of IROS, 2008, pp. 1140–1147.

[3] A.J. Davison, Real-time simultaneous localization and mapping with a singlecamera, in: Proc. International Conference on Computer Vision, 2003.

[4] T. Bailey, Constrained initialization of bearing-only SLAM, in: Proc. of the IEEEInternational Conference on Robotics and Automation, 2003.

[5] A. Costa, G. Kantor, H. Choset, Bearing-only landmark initialization withunknown data association, in: Proc. of the IEEE International Conference onRobotics & Automation, 2004.

[6] N.M. Kwok, G. Dissanayake, An efficientmultiple-hypothesis filter for bearing-only SLAM, in: IEEE/SRJ International Conference on Intelligent Robotics andSystems, Sendai, Japan, 2004.

[7] J. Sola, A. Monin, M. Devy, T. Lemaire, Undelayed initialization in bearing onlySLAM, in: IEEE/RSJ International Conference on Intelligent Robots and Systems,2005.

[8] E. Eade, T. Drummond, Scalable monocular SLAM, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2006.

[9] J.M.M. Montiel, J. Civera, A.J. Davison, Unified inverse depth parameterizationfor monocular SLAM, in: Proc. Robotics Science and Systems, Philadelphia,USA, 2006.

[10] J. Civera, A.J. Davison, J.M.M. Montiel, Inverse depth parameterization formonocular SLAM, IEEE Transactions on Robotics 24 (4) (2008) 932–945.

[11] David H. Titterton, John L. Weston, Strapdown Inertial Navigation Technology,second ed., The Institution of Electrical Engineers, 2004.

[12] J.D. Tardós, J. Neira, P.M. Newman, J.J. Leonard, Robust mapping andlocalization in indoor environments using Sonar data, International Journal ofRobotics Research 21 (3) (2002) 311–330.

[13] L.M. Paz, J.D. Tardós, J. Neira, Divide and conquer: EKF SLAM in O(n), IEEETransactions on Robotics 24 (4) (2008) 1107–1120.

[14] S. Tully, H. Moon, G. Kantor, H. Choset, Iterated filters for bearing-only SLAM,in: IEEE International Conference on Robotics and Automation, 2008.

[15] C. Harris, M. Stephens, A combined corner and edge detector, in: Proceedingsof the 4th Alvey Vision Conference, 1988, pp. 147–151.

[16] S. Baker, I. Matthews, Lucas–Kanade 20 years on: a unifying framework Part 1,International Journal of Computer Vision 56 (2) (2004) 221–225.

[17] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, A. Calway, Real-time and robustmonocular SLAM using predictive multi-resolution descriptors, in: 2ndInternational Symposium on Visual Computing, 2006.

[18] D. Nister, O. Naroditsky, J. Bergen, Visual odometry, in: IEEE Conference onComputer Vision and Pattern Recognition, 2004.

[19] J. Neira, J.D. Tardos, Data association in stochastic mapping using the jointcompatibility test, IEEE Transactions onRobotics andAutomation17 (5) (2001)890–897.

[20] L.A. Clemente, A.J. Davision, I.D. Reid, J. Neira, J.D. Tardos, Mapping largeloops with a single hand-held Camera, in: The Robotics: Science and SystemsConference, US, 2007.

Kiwan Choiwas born in Seoul, Korea, in 1975. He receivedthe M.S. degree in Mechanical Design and ProductionEngineering from Seoul National University in 1999. Heis currently a Senior Researcher with the Sensor Group,Advanced Institute of Technology, Samsung ElectronicsInc.

His current research interests include simultaneouslocalization and mapping (SLAM), inertial navigationsystems, and mobile robotics.

Jiyoung Park was born in Kyoungsan, Korea, in 1981. Shereceived the M.S. degree in Electronics & Electrical En-gineering from KAIST in 2006. She is currently a SeniorResearcher with the Sensor Group, Advanced Institute ofTechnology, Samsung Electronics Inc.

Her current research interests include inertial naviga-tion systems and mobile robotics.

Yeon-Ho Kim (M’09) was born in Donghae, Korea, in1966. He received the B.S. degree from Seoul NationalUniversity in Electrical Engineering, the M.S. degree fromthe Korea Advanced Institute of Science and Technology(KAIST) in Electrical and Electronic engineering, both inSeoul, Korea, and the Ph.D. degree from Purdue Universityin Electrical and Computer Engineering, inWest Lafayette,Indiana, USA, in 1989, 1991, and 2005, respectively.Since 2005, he has worked in the Sensor Group atAdvanced Institute of Technology, Samsung ElectronicsInc.

Currently, his research interests include robot vision, especially 3D spatialperception, 3D map building for mobile robots, and human–machine interaction.

Hyoung-Ki Lee was born in Seoul, South Korea, in 1966.He received the Ph.D. degree in Robotics from the KoreaAdvanced Institute of Science and Technology, in 1998. In1999, he worked as a Postdoctoral Fellow in the Mechani-cal Engineering Lab., AIST, in Japan. In 2000, he joined theSamsungAdvanced Institute of Technology and is develop-ing a vacuum cleaning robotwith a localization function asa part of the home service robot project.

His research interests include SLAM (Simultaneous Lo-calization And Mapping) for home robots fusing differentkinds of sensors (inertial sensors, cameras, range finders,

etc.) and developing new low-cost sensor modules such as the MEMS gyro sensormodule and structured light range finder.

http://dx.doi.org/doi:10.1016/j.robot.2012.02.002

Documents

Monocular SLAM with undelayed initialization for an indoor robot