Monocular SLAM with undelayed initialization for an indoor robot

  • Published on

  • View

  • Download


  • ez



    tMobile robots method with computational efficiency in indoor environments. 2012 Elsevier B.V. All rights reserved.

    1. Introduction

    Samsung Electronics has been developing vacuum cleaningrobots for the home environment that generate a cleaning pathwith minimal overlap of the traveled region. To do this, the robotsrequire the capability of localization and map building. For thelocalization capability, our robots are equipped with a camerafor external position measurements. They execute a visual SLAM(Simultaneous Localization and Mapping) algorithm based on anEKF (Extended Kalman Filter). In home environments, the needfor the visual features for a SLAM algorithm does not permitrobots to travel under a bed, for example. Thus, our robot hasa dead-reckoning system of two wheel encoders and a low-costgyro for relative position calculations when the visual features arenot available. A calibrated MEMS gyro gives a good orientationestimate for a few minutes and very accurate relative positioninformation in the frame of Kalman filter [1,2]. Thanks to thisdead-reckoning system, the required frame rate of images forvisual feature tracking can be reduced to 1 or 2 Hz while previousmonocular SLAM with only one camera needs the high frame rateof more than 20 Hz. The sensor combination of a low-cost MEMSgyro and a low frame rate camera leads to a practical system forvacuum cleaning application.

    Davison [3] presented the feasibility of real-time monocularSLAM with a standard low-cost camera using an EKF framework.

    Corresponding author. Tel.: +82 31 280 1761; fax: +82 31 280 9257.E-mail addresses:,

    (H.-K. Lee).

    In his work, the camera obtains only bearing information, meaningthat multiple observations from different points of view mustbe processed to achieve accurate depth information. Davisonused an extra particle filter to estimate the feature depths.However, this method causes a delay in feature initialization.Bailey [4] also proposed a delayed initialization method forbearing-only SLAM. This method used a Gaussian approximationapproach similar to EKF. Costa et al. [5] presented another delayedinitialization method which calculated the initial uncertainty offeature locations. Their method projected measurements fromthe sensor space onto the Cartesian world frame and assumedthat their depth directional uncertainties had large values. Theirmethod was also a delayed method. Such delayed initializationschemes have the drawback that new features, held outsidethe main probabilistic state, are not able to contribute to theestimation of the camera position until finally included in themap.Further, features that retain low parallax over many frames (thosevery far from the camera or close to themotion epipole) are usuallyrejected completely because they never pass the test for inclusion.

    Kwok and Dissanayake [6] presented a multiple hypothesisapproach that outlined their undelayed feature initializationscheme. Each landmark is initialized in the form of multiplehypotheses with different depths distributed along the directionof the bearing measurement. Using subsequent measurements,the depth of the landmark is estimated in an EKF framework andhypothesized landmarks with different depths are tested by asequential probability ratio test. Invalid hypotheses are removedfrom the system state. Sola et al. [7] improved computationalefficiency of this approach by approximating a Gaussian sum filter.Robotics and Autonomous S

    Contents lists available a

    Robotics and Auto

    journal homepage: www.

    Monocular SLAM with undelayed initialiKiwan Choi, Jiyoung Park, Yeon-Ho Kim, Hyoung-KiMicro Systems Lab., Samsung Advanced Institute of Technology, Samsung Electronics Inc.,

    a r t i c l e i n f o

    Article history:Received 13 May 2010Received in revised form3 February 2012Accepted 6 February 2012Available online 13 February 2012

    Keywords:Feature initializationLinearization errorsMono visionSLAM

    a b s t r a c t

    This paper presents a new feaSimultaneous Localization arather than 2D pixel coordinbearing sensor, of which themeasurement are convertedmeasurement noise covarianit is shown that the proposedfor the EKF performance. Fura low-cost gyro, and low framprice of the processor. This svacuum cleaning application0921-8890/$ see front matter 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.robot.2012.02.002ystems 60 (2012) 841851

    t SciVerse ScienceDirect

    nomous Systems

    ation for an indoor robotLee Yongin-Si, Gyeonggi-Do, 449-712, South Korea

    ture initializationmethod formonocular EKF SLAM (Extended Kalman Filterd Mapping) which utilizes a 3D measurement model in the camera frameates in the image plane. The key idea is to regard a camera as a range andrange information contains numerous uncertainties. 2D pixel coordinates ofo 3D points in the camera framewith an assumed depth. The element of thece corresponding to the depth of the feature is set to a very high value. Andmeasurement model has very little linearization error, which can be criticalhermore, this paper proposes an EKF SLAM system that combines odometry,e rate (12 Hz) monocular vision. Low frame rate is crucial for reducing theystem combination is cost-effective enough to be commercialized for a real. Simulations and experimental results show the efficacy of the proposed

  • 842 K. Choi et al. / Robotics and Autono

    Eade andDrummond [8] estimate the inverse depths of featuresrather than their depths. The inverse depth concept has anadvantage in terms of linearity, allowing linear Kalman filteringto operate successfully. While this approach utilizes a FastSLAM-type particle filter, Montiel et al. [9,10] applied this concept toEKF SLAM. Using an inverse depth entry in the state vector anda modified measurement model allows unified modeling andprocessing within the standard EKF without a special initializationprocess. Measurement equations have little linearization errorfor distant features, and the estimation uncertainty is accuratelymodeled as the Gaussian in inverse depth. A feature is representedby the state vector of six dimensions at the first observationthreefor the camera position, two for the direction of the ray and one forthe reciprocal depth. This is an over-parameterization comparedto three dimensions for Euclidean XYZ coding and hence increasesthe computational cost.

    This paper presents a new method for initializing landmarksdirectly from the first measurement without any delay. Ourapproach updates themeasurement in the 3D camera frame. This isin contrast to previous methods that utilize 2D image coordinatesin the image plane. The proposed method was inspired by themeasurement model of 3D cameras (such as the SR4000 by MesaImaging) and by 3D range sensors that measure the range andbearing simultaneously. A camera is considered as a 3D rangesensor butwith a considerable amount of depth uncertainty.Whena feature is initially measured, the depth value of the feature in the3D coordinates is set to an arbitrary value with a sufficiently largecovariance. It should be noted that the arbitrary value has littleeffect on the EKF performance due to its large covariance. Anotheradvantage of the proposed scheme is its high degree of linearity.A more linear measurement equation leads to better performancefrom the EKF. Our approach is much simpler and more efficientthan earlier approaches that usemultiple hypotheses or an inversedepth technique.

    This paper is organized as follows. Section 2 presents our EKFframework with the new measurement update model. Section 3explains the undelayed initialization scheme using our EKFframework and introduces the methods adopted to implementSLAM. Section 4 shows the simulation and experimental results.Section 5 closes the paper with a conclusion.

    2. EKF framework

    2.1. Coordinate frames

    For a clear explanation, we will start with a definition of theCartesian coordinate frames. See Fig. 1. The letters w, b and c denote the world frame, the body frame and the camera frame,respectively. The origin of the camera frame is the focal point of thecamera and that of the body frame is the center of the two drivingwheels. The body frame and the camera frame are in the same rigidbody.

    2.2. State vector definition

    A robot moving on the irregular floor of a home environmentinvolves attitude changes. Thus, a 6-DOFmodel is used despite thefact that the robot moves with 3-DOF most of the time.

    The state vector to represent the position and orientation of therobot in the world frame is given by

    xv =xvp

    = x y z T (1)

    where , and are the roll, pitch, and yaw of the Euler angles

    respectively; xvp is the position vector of the origin of the cameraframe and is the orientation vector of the camera. It should bemous Systems 60 (2012) 841851

    Fig. 1. Coordinate frames.

    noticed that the robot position and orientation are represented bythe camera frame rather than the body frame, which can reducethe level of complexity. Because the pitch angle of our robotis always between /2 and /2, the Euler angles are morecomputationally efficient than the quaternion which is commonlyadopted for visual SLAM algorithms to use a hand-held camera.Because a dead-reckoning system can provide good predictioninformation, a velocity term doesnt appear in the state vector. Thestate vector includes the 3D position vector y of the landmark inthe world frame, as do the conventional SLAM frameworks. Thus,the state vector x and its error covariance matrix P for our EKFframework are given as

    x =


    , P =

    Px Pxy1 PxynPy1 Py1yn

    .... . .

    ...Pynx Pyn

    (2)where n is the number of registered landmarks.

    2.3. Prediction model

    Our dead-reckoning system consists of a yaw gyroscope andtwo wheel encoders. It can calculate a translation in the XY -planeof the body frame and the rotation around the z-axis of the bodyframe. Given that it cannot provide any information about thetranslation in the Z direction and the roll and pitch rotations,they are assumed to remain unchanged. Instead, the uncertainty inthese values will increase as the robotmoves. Feature locations arealso assumed to be stationary. Thus the prediction model at step kcan be given as

    xvp,k|k1 = xvp,k1|k1 + Cwc 1xc (3)

    k|k1 = ,k1|k1 +1 sin tan cos tan0 cos sin 0 sin sec cos sec

    1c (4)

    yi,k|k1 = yi,k1|k1 (5)where Cwc is the transformation matrix from the camera frame tothe world frame [11].1xc and1c are the incremental changes inthe position vector and the orientation vector in the camera framerespectively. They are simply calculated from the incrementalchanges in the body frame given by the dead-reckoning system.Eq. (4) is a simple forward difference representation of the partialdifferential equation which relates the Euler angle to the bodyrates. It is acceptable if1c is small enough and if the roll and pitchangles change within small values. EKF predicts the state vector xand its covariance matrix P, as follows:xv,k|k1 = f(xv,k1|k1,uk) (6)

  • cin the camera frame. As shown in the lower block in Fig. 2, thistechnique fundamentally removes the linearization error inducedby transformation from the camera frame to the image plane. Thisis described in greater detail in the linearization error analysis inthe next section.

    Our measurement and its covariance matrix are

    z =v


    , R =Rv1 0. . .

    0 Rvm

    (10)where m is the number of observed features, vc is the featureposition in the camera frame, and Rv is the error covariancematrixof vc . The observation vector z requires the 3D coordinates of thefeature, referring to Fig. 3. This involves the problem of how toconvert the 2D information of the camera measurement to 3Dinformation.

    In addition, vc can be obtained by

    vc = d ct|ct | (13)where d is an unknown depth. Here, ct is in the camera frame. Ifprior information about the depth is available, d can be estimatedwith it. Otherwise, d is set to an arbitrary value. For example, whenwe observe it for the first or second time, there is no information.Therefore, it can be assumed to be the maximum distance of theworkspace or another proper value. Instead, the covariance matrixRv will be used through careful consideration.

    To determine Rv , an error covariance of the feature in the(ct , cn1, cn2) frame is introduced as

    Rv0 =2ct 0 00 2cn1 0

    0 0 2cn2

    (14)K. Choi et al. / Robotics and Autono

    Fig. 2. Measur

    Px,k|k1 = fxv



    + fu




    Pxy,k|k1 = fxv

    Pxy,k1|k1 (8)


    uk =1xc1c



    and Q is the covariance matrix of u.

    2.4. Measurement model

    In the conventional monocular SLAM, observations are trackedfeature points in the image plane. Naturally, all in the case of theprevious monocular SLAM update the measurements in the imageplane. This means that the measurement vector z is expressed inthe image plane. In contrast, all components of the state vector xare in theworld frame. To perform EKF, the feature positions of thestate vectormust inevitably be transformed into those in the imageplane. The upper block in Fig. 2 shows this process. Particularly,the transformation from the camera frame to the image plane cancause a critical linearization error due to the large uncertainty ofthe depth.

    To overcome this, we propose a different approach inwhich themeasurement is represented in the camera frame rather than in theimage plane. In Fig. 2, y is the 3D position vector of the landmarkHere, we explain how to obtain vc from the observed value of v.The 3 4 camera projection matrix which describes the mappingmous Systems 60 (2012) 841851 843

    ement model.

    Fig. 3. Landmark initialization.

    of a camera from 3D points in the camera frame to 2D points in theimage plane is written as

    C = KC [M | t] (11)whereM is the 3 3 rotation matrix, KC is the camera calibrationmatrix and t is the translation vector which is actually the zerovector because the origin of the camera frame is the focal point.The directional vector ct in Fig. 3 can be obtained by the followingequation:

    ct =ct1ct2ct3

    = (KCM)1


    . (12)where ct , cn1 and cn2 are the standard deviations of the error inthe ct , cn1 and cn2 directions, respectively (Fig. 3). ct , cn1 and cn2

  • 844 K. Choi et al. / Robotics and Autono

    are orthogonal to each other and cn1 and cn2 can be chosen in anydirection.

    The most important point is that the value of ct must be setas large enough to reflect the uncertainty of the depth. The idealvalue of ct is an infinite value, but this is not feasible for numericalcomputation. In our system, a value over 105 cm occasionallycaused problems during the process of inverting a matrix, andSLAM resulted in failure. It is recommended that

    3000 cm ct 105 cm. (15)The values above were obtained empirically. They serve simply toprovide a guideline when selecting a proper value of ct .

    Because the value of ct cannot actually be infinite, there isone problem that must be addressed. The same observed imagescan lead to erroneous results. For example, when a robot doesnot move, it observes nearly identical images and tracks the...


View more >