18
IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006 119 Vision-Based 3-D Trajectory Tracking for Unknown Environments Parvaneh Saeedi, Member, IEEE, Peter D. Lawrence, Member, IEEE, and David G. Lowe, Member, IEEE Abstract—This paper describes a vision-based system for 3-D lo- calization of a mobile robot in a natural environment. The system includes a mountable head with three on-board charge-coupled device cameras that can be installed on the robot. The main em- phasis of this paper is on the ability to estimate the motion of the robot independently from any prior scene knowledge, landmark, or extra sensory devices. Distinctive scene features are identified using a novel algorithm, and their 3-D locations are estimated with high accuracy by a stereo algorithm. Using new two-stage feature tracking and iterative motion estimation in a symbiotic manner, precise motion vectors are obtained. The 3-D positions of scene features and the robot are refined by a Kalman filtering approach with a complete error-propagation modeling scheme. Experimental results show that good tracking and localization can be achieved using the proposed vision system. Index Terms—Feature-based tracking, robot vision, three- dimensional (3-D) localization, three-dimensional (3-D) trajectory tracking, vision-based tracking, visual motion tracking, visual trajectory tracking. I. INTRODUCTION R EMOTELY controlled mobile robots have been a subject of interest for many years. They have a wide range of ap- plications in science and in industries such as aerospace, ma- rine, forestry, construction, and mining. A key requirement of such control is the full and precise knowledge of the location and motion of the mobile robot at each moment of time. This paper describes on-going research at the University of British Columbia (Vancouver, BC, Canada) on the problem of real-time purely vision-based 3-D trajectory estimation for out- door and unknown environments. The system includes an inex- pensive trinocular stereo camera that can be mounted anywhere on the robot. It employs existing scene information and requires no prior map, nor any modification to be made in the scene. Spe- cial attention is paid to the problems of reliability in different environmental and imaging conditions. The main assumptions here are that the scene provides enough features for matching, and that most of the scene objects are static. Moreover, it is as- sumed that the velocity of the robot is limited in such a way that Manuscript received August 24, 2004; revised February 21, 2005. This paper was recommended for publication by Associate Editor F. Chaumette and Editor I. Walker upon evaluation of the reviewers’ comments. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, and in part by the Canadian IRIS/PRECARN Network of Centers of Excellence. P. Saeedi is with MacDonald Dettwiler and Associates, Ltd., Richmond, BC V6V 2J3 Canada (e-mail: [email protected]). P. D. Lawrence is with the Department of Electrical and Computer Engi- neering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada (e-mail: [email protected]). D. G. Lowe is with the Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada (e-mail: [email protected]). Digital Object Identifier 10.1109/TRO.2005.858856 there is some overlap between each two consecutive frames. The system is mainly designed for use in autonomous navigation in natural environments, where a map or prior information about the scene is either impossible or impractical to acquire. A. Previous Work In visual motion and trajectory tracking, the relative motion between objects in a scene and the camera is determined through the apparent motion of objects in a sequence of images. One class of visual trajectory-tracking methods, motion- based approaches, detect motion through optical flow tracking and motion-energy estimation. They operate based on extracting the velocity field and calculating the temporal derivatives of images. Methods based on this approach are fast, however, they cannot be used where the camera motion is more than a few pixels. Moreover, they are subject to noise, leading to imprecise values, and often the pixel motion is detected but not quantified [1], [2]. Another class of visual trajectory-tracking methods, feature- based approaches, recognize an object or objects (landmarks or scene structures) and extract the position in successive frames. The problem of recognition-based camera localization can be divided into two general domains, as follows. 1) Landmark-Based Methods: Motion tracking in these methods is performed by detecting landmarks, followed by camera-position estimation based on triangulation. These methods employ either predesigned landmarks that must be placed at different but known locations in the environment, or they automatically extract naturally occuring landmarks via a local distinctiveness criterion from the environment during a learning phase. Such systems usually require an a priori map of the environment. For example, Sim and Dudeck [3] used regions of the scene images with a high number of edges as natural landmarks. MINERVA [4] is a tour-guide robot that uses camera mosaics of the ceiling along with several other sensor readings for the localization task. The main advantage of land- mark-based methods is that they have a bounded cumulative error. However, they require some knowledge of the geometric model of the environment, either built into the system in advance, or acquired using sensory information during move- ment, the learning phase, or sometimes a combination of both. This requirement seriously limits the approach’s capability in unknown environments. More examples of landmark-based methods can be found in [5]–[7]. 2) Natural Feature-Based Methods: Natural feature-based approaches track the projection of preliminary features of a scene in a sequence of images. They find the trajectory and motion of the robot by tracking and finding relative changes 1552-3098/$20.00 © 2006 IEEE

<![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

  • Upload
    dg

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006 119

Vision-Based 3-D Trajectory Trackingfor Unknown Environments

Parvaneh Saeedi, Member, IEEE, Peter D. Lawrence, Member, IEEE, and David G. Lowe, Member, IEEE

Abstract—This paper describes a vision-based system for 3-D lo-calization of a mobile robot in a natural environment. The systemincludes a mountable head with three on-board charge-coupleddevice cameras that can be installed on the robot. The main em-phasis of this paper is on the ability to estimate the motion of therobot independently from any prior scene knowledge, landmark,or extra sensory devices. Distinctive scene features are identifiedusing a novel algorithm, and their 3-D locations are estimatedwith high accuracy by a stereo algorithm. Using new two-stagefeature tracking and iterative motion estimation in a symbioticmanner, precise motion vectors are obtained. The 3-D positionsof scene features and the robot are refined by a Kalman filteringapproach with a complete error-propagation modeling scheme.Experimental results show that good tracking and localization canbe achieved using the proposed vision system.

Index Terms—Feature-based tracking, robot vision, three-dimensional (3-D) localization, three-dimensional (3-D) trajectorytracking, vision-based tracking, visual motion tracking, visualtrajectory tracking.

I. INTRODUCTION

REMOTELY controlled mobile robots have been a subjectof interest for many years. They have a wide range of ap-

plications in science and in industries such as aerospace, ma-rine, forestry, construction, and mining. A key requirement ofsuch control is the full and precise knowledge of the locationand motion of the mobile robot at each moment of time.

This paper describes on-going research at the University ofBritish Columbia (Vancouver, BC, Canada) on the problem ofreal-time purely vision-based 3-D trajectory estimation for out-door and unknown environments. The system includes an inex-pensive trinocular stereo camera that can be mounted anywhereon the robot. It employs existing scene information and requiresno prior map, nor any modification to be made in the scene. Spe-cial attention is paid to the problems of reliability in differentenvironmental and imaging conditions. The main assumptionshere are that the scene provides enough features for matching,and that most of the scene objects are static. Moreover, it is as-sumed that the velocity of the robot is limited in such a way that

Manuscript received August 24, 2004; revised February 21, 2005. This paperwas recommended for publication by Associate Editor F. Chaumette and EditorI. Walker upon evaluation of the reviewers’ comments. This work was supportedin part by the Natural Sciences and Engineering Research Council of Canada,and in part by the Canadian IRIS/PRECARN Network of Centers of Excellence.

P. Saeedi is with MacDonald Dettwiler and Associates, Ltd., Richmond, BCV6V 2J3 Canada (e-mail: [email protected]).

P. D. Lawrence is with the Department of Electrical and Computer Engi-neering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada(e-mail: [email protected]).

D. G. Lowe is with the Department of Computer Science, University ofBritish Columbia, Vancouver, BC V6T 1Z4, Canada (e-mail: [email protected]).

Digital Object Identifier 10.1109/TRO.2005.858856

there is some overlap between each two consecutive frames. Thesystem is mainly designed for use in autonomous navigation innatural environments, where a map or prior information aboutthe scene is either impossible or impractical to acquire.

A. Previous Work

In visual motion and trajectory tracking, the relative motionbetween objects in a scene and the camera is determined throughthe apparent motion of objects in a sequence of images.

One class of visual trajectory-tracking methods, motion-based approaches, detect motion through optical flow trackingand motion-energy estimation. They operate based on extractingthe velocity field and calculating the temporal derivatives ofimages. Methods based on this approach are fast, however, theycannot be used where the camera motion is more than a fewpixels. Moreover, they are subject to noise, leading to imprecisevalues, and often the pixel motion is detected but not quantified[1], [2].

Another class of visual trajectory-tracking methods, feature-based approaches, recognize an object or objects (landmarks orscene structures) and extract the position in successive frames.The problem of recognition-based camera localization can bedivided into two general domains, as follows.

1) Landmark-Based Methods: Motion tracking in thesemethods is performed by detecting landmarks, followed bycamera-position estimation based on triangulation. Thesemethods employ either predesigned landmarks that must beplaced at different but known locations in the environment, orthey automatically extract naturally occuring landmarks via alocal distinctiveness criterion from the environment during alearning phase. Such systems usually require an a priori mapof the environment. For example, Sim and Dudeck [3] usedregions of the scene images with a high number of edges asnatural landmarks. MINERVA [4] is a tour-guide robot that usescamera mosaics of the ceiling along with several other sensorreadings for the localization task. The main advantage of land-mark-based methods is that they have a bounded cumulativeerror. However, they require some knowledge of the geometricmodel of the environment, either built into the system inadvance, or acquired using sensory information during move-ment, the learning phase, or sometimes a combination of both.This requirement seriously limits the approach’s capability inunknown environments. More examples of landmark-basedmethods can be found in [5]–[7].

2) Natural Feature-Based Methods: Natural feature-basedapproaches track the projection of preliminary features of ascene in a sequence of images. They find the trajectory andmotion of the robot by tracking and finding relative changes

1552-3098/$20.00 © 2006 IEEE

Page 2: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

120 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

in the position of these features. The type of feature is highlydependent on the working environment that the system isdesigned for [8]. For instance, the centroid and diameter ofcircles are used by Harrell et al. [9] for a fruit-tracking robotfor harvesting. Rives and Borrelly [10] employ edge featuresto track pipes with an underwater robot. The road-followingvehicle of Dickmanns et al. [11] is also based on edge tracking.The main advantage of using local features is that they corre-spond to specific physical features of the observed objects, andonce these are correctly located and matched, they provide veryaccurate information concerning the relative position betweencamera and scene. Also, in systems based on landmarks ormodels, it is possible that no landmark is visible, so the motionestimation cannot be accurate for some percentage of thetime, while estimations based on scene features are potentiallyless likely to fail due to the large number of features that canbe available from any point of view. The accuracy of thesemethods, however, is highly dependent on the accuracy ofthe features. Even a small amount of positional uncertaintycan eventually result in a significant trajectory drift. Due tolimitations of processing and sensory technologies, early workusing natural scene features was limited to 2-D estimations[12], [13]. Later attempts were directed toward 3-D trackingusing monocular images [14], [15]. These methods had pooroverall estimation, limited motion with small range tolerance,and large long-term cumulative error. Recently, however, moreaccurate systems have been developed using monocular camerasystems [16], [17]. The use of multiple cameras, stereoscopy,and multiple sensor fusion provided new tools for vision-basedtracking methods [18], [19]. Jung and Lacroix [20] represent asystem for high-resolution terrain mapping using stereo imagesand naturally occurring terrain feature points. Recently, Se[21] introduced an indoor system using scale-invariant fea-tures, observed by a stereo camera system, and combining thereadings of an odometer with those of the vision system. Se’suse of an odometer has the advantage that it helps to reducethe search space for finding feature-match correspondences. Ithas the disadvantage, though, that any slippage would increasethe error in the position estimate and enlarge the search space.Since the proposed work was ultimately intended for outdoorapplications, we anticipated considerable wheel slippage. Inaddition, outdoor environments have a large number of cor-ners due to foliage, for example, compared with most indoorenvironments. Not only can outdoor environments have morecorner features, but a significant number of the corners can bemoving (e.g., leaves blowing in the wind), a second issue notaddressed in the paper by Se et al..

Also recently, Olson et al. [44] suggested an approach to nav-igation that separated translational and rotation estimates. Thetranslation estimates were determined from a vision system,and it was proposed that the rotation be obtained from someform of orientation sensor, since the error in orientation esti-mates with vision alone grew rapidly. Since various means oforientation sensing are also susceptible to vibration induced byrough terrain, and based upon previous work we had done withnarrow-angle cameras (field of view (FOV) ) yieldingsimilar problems as those experienced by Olson et al., we se-lected a wider angle of view of the camera (FOV ), which

better separates rotation from translation (e.g., when image fea-tures in a narrow-angle camera translate to the left, it is hard toestimate whether that is due to a translation to the right, or a ro-tation to the right). Also not addressed in the work of Olson et al.are large numbers of foliage corners and their dynamic motion.

Unlike either Se et al. or Olson et al., in the approach of thispaper, in which the 3-D reconstruction of feature points in spacewas carried out by interpolation of the warped images, the accu-racy of estimated motion was improved by about 8% over doinginterpolation in the unwarped space, and leads to low errors inboth translation and rotation in a complex outdoor scene.

B. Objective

The design presented in this paper is an exploration of therelevant issues in creating a real-time on-board motion-trackingsystem for a natural environment using an active camera. Thesystem is designed with the following assumptions.

• The scene includes mostly static objects. If there are a fewmoving objects, the system is able to rely on static objectinformation, while information from moving objects canbe discarded as statistical outliers.

• The camera characteristics are known. In particular, focallength and the baseline separation of the stereo camerasare assumed to be known.

• The motion of the robot is assumed to be limited in ac-celeration. This allows the match-searching techniques towork on a predictable range of possible matches.

• The working environment is not a uniform scene, and itincludes a number of objects and textures.

The primary novelty of this paper is a methodology for ob-taining camera trajectories for outdoors in the presence of pos-sibly moving scene features, without the need for odometers orsensors other than vision.

C. Paper Outline

The basis of a novel binary corner detector, that was devel-oped for this work, is explained in Section II. Section III de-scribes the approach for the 3-D world reconstruction problem,in which the positional uncertainty resulting from the lens dis-tortion removal process is minimized. Section IV addresses atwo-stage approach for tracking world features that improvesthe accuracy by means of more accurate match correspondencesand a lower number of outliers. The 3-D motion estimation isthen described in Section V. Section VI represents the errormodeling for the robot and features. Finally, the experimentalresults are reported in Section VII. Conclusions and future workare presented in Section VIII.

II. BINARY FEATURE DETECTION

An important requirement of a motion-tracking system isits fast performance. Processing all the pixels of an image,from which only a small number carry information aboutthe camera’s motion, may not be possible with the real-timerequirement for such systems. Therefore, special attention ispaid to selecting regions with higher information content.

Page 3: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 121

A. Features

Deciding on the feature type is critical, and depends greatlyon the type of input sensors used. Common features that aregenerally used include the following [22]:

• raw pixel values, i.e., the intensities;• edges, surfaces, and contours that correspond to real 3-D

structures in the scene;• salient features, such as corners, line intersections, and

points of locally maximum curvature on contour lines;• statistical features, such as moment invariance, energy,

entropy, and color histograms.

Choosing simple features within the scene increases the relia-bility of the solution for motion tracking, and enables the systemto find answers to problems most of the time, unless the sceneis very uniform. In the search for a feature type that suits ourapplication, a natural, unstructured environment with varyinglighting conditions, corners were chosen because they are dis-crete and partially invariant to scale and rotational changes.

B. Binary Corner Detection (BCD)

In our previous work [23], the Harris corner detector [24]was used. The Harris corner detector involves several Gaussiansmoothing processes that not only may displace a corner fromits real position, but make the approach computationally expen-sive. A corner detector with higher positional accuracy, SUSAN,was developed in [25]. A faster corner detector with more pre-cise localization can lead to a more accurate and/or faster mo-tion estimation, since the changes between consecutive framesare smaller. While the SUSAN corner detector provides a moreprecise corner location, computationally, it is more expensive. Inorder to take advantage of the positional accuracy of the SUSANcorner detector, a novel binary corner detector was developed[26]. This corner detector defines corners similar to SUSANusing geometrical descriptions. Its main emphasis, however, ison exploiting binary images and substituting arithmetic opera-tions with logicals.

To generate a binary image, first a Gaussian smoothing isapplied to the original image. A of 0.8 is chosen for thesmoothing process. By using this value for , the 1-D kernelof the filter can be approximated by . Usingthis kernel every four multiplications can be substituted by fourshift operations. The Laplacian is then approximated at eachpoint of the smoothed intensity image by

(1)

represents the image intensity value at row and column .The binary image is then generated by the sign of the Laplacianvalue at each point

ifotherwise.

(2)

Next, a circular mask is placed on each point of the binaryimage in the same manner as in the SUSAN corner detector. The

binary value of each point inside the mask is compared with thatof the central point

ifif

(3)

represents the binary image value at location . Nowa total running sum is generated from the output of

(4)

represents the area of the mask where the sign of the Laplacianof the image is the same as that of the central point. For eachpixel to be considered a potential corner, the value of must besmaller than at least half the size of the mask in pixels. Thisvalue is shown by in the corner response

ifotherwise.

(5)

Similar to SUSAN, for each candidate with , a centerof gravity (centroid) is computed

(6)

where

(7)

The center of gravity provides the corner direction, as well asa condition to eliminate points with random distributions. Ran-domly distributed binary patches tend to have a center of gravityfairly close to the center of the patch. Therefore, all points withclose centers of gravity are filtered out of the remaining process

(8)

It was found that the two conditions in (5) and (8), proposed in[25], do not (by themselves) provide enough stability for cornerdeclaration. Therefore, in this paper, a new inspection is per-formed by computing the directional derivative of the centroidcell. First, the vector that connects the center of gravity to thecentroid of the cell is computed. Next, the above vector is ex-tended to pass , and then the intensity variation is examined.If the intensity variation is small, then is not a corner other-wise it is announced as a corner. That is, if

(9)

where represents the brightness variation threshold, a corneris detected.

Fig. 1 displays the output of the proposed method on oneof the sample outdoor images. The binary corner detector wascompared with the Harris and SUSAN corner detectors in thesame manner as the one introduced by [27]. Harris exceeds theBCD repeatability rate by 20%. In scenes like Fig. 1 with alarge number of features, the loss does not affect overall perfor-mance. However, BCD performs 1.6 times faster than Harris and7.2 times faster than SUSAN with a running time of 23.293 mson a 1.14-GHz AMD Athlon processor.

III. 3-D WORLD RECONSTRUCTION

Systems with no prior information about a scene require the3-D positions of points in the scene to be determined. This sec-tion describes the problem of optical projection, 3-D world po-

Page 4: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

122 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

Fig. 1. Corners are found using BCD.

sition reconstruction for feature points, and considerations forincreasing the system accuracy.

A. Camera Model

A camera can be simply modeled using the classic pinholemodel. This leads to perspective projection equations for calcu-lating where on an image plane a point in space will appear. Theprojective transformations that project a world pointto its image point are

and (10)

where and represent the horizontal and vertical focallengths of the camera. Since a camera exhibits nonideal be-havior, precise measurements from an image that are necessaryin the 3-D reconstruction process require a more sophisticatedcamera model than the ideal model.

B. Camera Calibration

A camera model consists of extrinsic and intrinsic parame-ters. Some of the camera intrinsic parameters are and(horizontal and vertical focal lengths), and and (imagecenters). The camera model transforms real-world coordinatesinto their ideal image coordinates, and vice versa.

Using camera intrinsic parameters, a lookup table is gener-ated that transfers each pixel on the distorted image onto its loca-tion on the corresponding undistorted location. Fig. 2(a) showsan image, acquired by our camera system, that has a 104 FOV.In this image, the distortion effect is more noticeable on thecurved bookshelves. Fig. 2(b) shows the same image after re-moval of the lens distortion. It can be clearly seen that the curvedshelves on the original image are now straightened.

C. Stereo Correspondence Matching Rules

The camera system, Digiclops [28], includes three stereocameras that are vertically and horizontally aligned. The dis-placement between the reference camera and the horizontaland the vertical cameras is 10 cm. To fully take advantage ofthe existing features in the three stereo images, the followingconstraints are employed in the stereo matching process.

Fig. 2. (a) Warped image. (b) Corresponding cut unwarped image.

• Feature stability constraint I: For each feature in the ref-erence image that is located in the common regions amongthe three stereo images, there should be two correspon-dences, otherwise, the feature gets rejected. 3-D locationsof the features that pass this constraint are estimated bythe multiple baseline method [29]. The multiple baselinemethod uses the two (or more) sets of stereo images toobtain more precise distance estimates, and to eliminatefalse match correspondences that are not persistent in thetwo sets of stereo images.

• Feature stability constraint II: Features located on theareas common to only the reference and horizontal or tothe reference and vertical images are reconstructed if theypass the validity check by Fua’s method [30]. The validitycheck adds a consistency test via which false match cor-respondences can be identified and eliminated from thestereo process.

• Disparity constraint: The disparities of each featurefrom the vertical and horizontal images to the referenceimage have to be positive, similar (with a maximumdifference of 1 pixel), and smaller than 90 pixels. Thisconstraint allows the construction of the points as close as12.5 cm from the camera for the existing camera systemconfiguration.

• Epipolar constraint: The vertical disparity between thematched features in the horizontal and reference imagesmust be within one pixel. The same rule applies to thehorizontal disparity for matches between the vertical andreference match correspondences.

• Match uniqueness constraint: If a feature has more thanone match candidate that satisfies all the above conditions,

Page 5: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 123

it is considered ambiguous, and gets omitted from the restof the process.

The similarities between each feature and its correspondingcandidates are measured by employing the normalized mean-squared differences metric [31]. After matching the features, asubset of features from the reference image is retained, and foreach one, its 3-D location with respect to the current cameracoordinate system is obtained using (10).

D. Depth Construction With Higher Accuracy

One necessary step in the stereo process is the unwarpingprocess, in which the images are corrected for the radial lensdistortion. During a conventional unwarping process, the fol-lowing occurs:

1) the image coordinates of each pixel, integer values, aretransformed into the corresponding undistorted image co-ordinates in the floating point, Fig. 3(2);

2) an interpolation scheme is used to reconstruct the imagevalues at an integer, equally spaced, mesh grid, Fig. 3(3);

3) the resultant image is cut to the size of the original rawimage, Fig. 3(4).

Each one of these steps, although necessary, could add someartifacts that can increase the uncertainty of the depth construc-tion process. For instance, for our camera system, 28.8% of theunwarped image pixels would be merely guessed at by the in-terpolation of the neighboring pixels. This could create consid-erable distortion of the shape of smaller objects located near thesides, and increase the inaccuracy of the 3-D world reconstruc-tion and the overall system.

To minimize the error associated with the radial lens distor-tion-removal process, instead of using the conventional method,we employed a partial unwarping process. This means that wefind the feature points in the raw (warped) images first. Theimage coordinates of each feature are then unwarped using un-warping lookup tables. For constructing the 3-D positions of thefeatures, the unwarped locations are used. However, when latermeasuring the similarity of features, raw locations in the warpedimage content are used. Performing a partial unwarping proce-dure for a small percentage of each image also improves theprocessing time of the system.

The 3-D reconstruction of the feature points can be summa-rized as having the following steps:

1) two projection lookup tables using intrinsic camera pa-rameters for raw image projection onto the unwarpedimage, and vice versa;

2) detection of features in raw images;3) disparity measurement in raw images using the projection

lookup table;4) 3-D reconstruction of image features using the constraints

in (10).

IV. FEATURE TRACKING

The measurement of local displacement between the 2-D pro-jection of similar features in consecutive image frames is thebasis for measuring the 3-D camera motion in the world co-ordinate system. Therefore, world and image features must be

Fig. 3. Conventional unwarping process. (1) The raw image. (2) The rawimage right after the calibration. (3) The calibrated image after the interpolation.(4) The final cut unwarped image.

tracked from one frame (at time ) to the next frame (at time).

In order to take advantage of all the information acquiredwhile navigating in the environment, a database is created. Thisdatabase includes information about all the features seen since

frames before (a value of is used for our system). Foreach feature, the 3-D location in the reference coordinate systemand the number of times it has been observed are recorded. Eachdatabase entry also holds a covariance matrix that repre-sents the uncertainty associated with the 3-D location of thatfeature. The initial camera frame is used as the reference coor-dinate system, and all the features are represented in relation tothis frame. After the world features are reconstructed using thefirst frame, the reference world features, as well as the robot’sstarting position, are initialized. By processing the next frame, inwhich a new set of world features are created, relative to the cur-rent robot position, new entries are created in the database. Thisdatabase is updated as the robot navigates in the environment.

A. Similarity Measurement

In order to measure the similarity of a feature with a set of cor-respondence candidates, normalized mean-squared differences[31] are employed. Each feature and its candidates are first pro-jected onto their corresponding image planes. The normalizedmean-squared differences function, (11), is then estimated foreach pair, as shown at the bottom of the next page. Here,and are average gray levels over image patches of andwith dimensions of (a value of is used in oursystem). After evaluation of the similarity metric for all pairs,the best match with the highest similarity is selected.

The highest similarity, as estimated by the cross-correlationmeasurement, does not, by itself, provide enough assurance fora true match. Since the patch sizes are fairly small, there may becases where a feature (at time ) and its match correspondence(at time ) do not correspond to an identical featurein the space. In order to eliminate such falsely matched pairs,

Page 6: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

124 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

Fig. 4. Validity checking reduces the number of false match correspondences. In this figure, feature points are shown in white dots. (a) Matched features in thefirst frame with no validity check are shown in circles. (b) Match features in the second frame are shown with arrows that connect their previous positions withtheir current positions. (c) Matched features in the first frame with validity check are shown in circles. (d) Match features with validity check in the second frameare shown with arrows that connect their previous positions with their current positions.

a validity check is performed. In this check, after finding thebest match for a feature, the roles of the match and the featureare exchanged. Once again, all the candidates for the match arefound on the previous frame (at time ). The similarity metricis evaluated for all candidate pairs, and the most similar pair ischosen. If this pair is exactly the same as the one found before,then the pair is announced as a true match correspondence. Oth-

erwise, the corner under inspection gets eliminated from the restof the process.

A comparison of the validity check of the number of correctmatch correspondences for two consecutive outdoor images isshown in Fig. 4. Fig. 4(a) and (b) shows the match correspon-dence without the validity check. Fig. 4(c) and (d) displays theresults of the matching process for the same images in the pres-

(11)

Page 7: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 125

ence of a validity check. Clearly, the number of false matchesare reduced after the validity check.

B. Feature Matching

The objective of the feature-matching process is to find and tomatch the correspondences of a feature in the 3-D world on twoconsecutive image planes (the current and previous frames) ofthe reference camera. At all times, a copy of the previous imageframe is maintained. Therefore, database feature points in thereference global coordinate system are transformed to the lastfound robot (camera) position. They are then projected onto theprevious unwarped image plane using the perspective projec-tion transformation. Using the inverse calibration lookup table,the corresponding locations on the raw image planes are found,if their coordinates fall inside the image boundaries (columns

and rows ). With two sets of feature points, one inthe previous image frame and one in the current image frame,the goal becomes to establish a one-to-one correspondence be-tween the members of both sets. The matching and trackingprocess is performed using a two-stage scheme.

1) The position of each feature in the previous frame is usedto create a search boundary for corresponding match can-didates in the current frame. For this purpose, it is as-sumed that the motion of features from the previous frameto the current frame do not have image-projection dis-placements more than pixels in all four directions. Avalue of , used for this paper, allows a featurepoint to move up to 70 pixels between frames. If a featuredoes not have any correspondences, it cannot be used atthis stage, and therefore is ignored until the end of the firststage of tracking. The normalized pixels cross-cor-relation with validity check, as explained in Section IV-A,is then evaluated over windows of a searchspace. Using the established match correspondences be-tween the two frames, the motion of the camera is esti-mated. Due to the large search window, and therefore, alarge number of match candidates, some of these matchcorrespondences may be false. In order to eliminate inac-curacy due to faulty matches, the estimated motion is usedas an initial guess for the amount and direction of the mo-tion to facilitate a more precise motion estimation in thenext stage.

2) Using the found motion vector and the previous robot lo-cation, all the database features are transformed into thecurrent camera coordinate system. Regardless of the mo-tion type or the distance of the features from the coordi-nate center, features with a persistent 3-D location end upin a very close neighborhood to their real matches on thecurrent image plane. Using a small search window (

), the best match correspondence is found quickly withhigher accuracy. If there are more than one match candi-dates in the search window, the normalized cross-corre-lation and the image intensity values in the previous andcurrent frames are used to find the best match correspon-dence. The new set of correspondences are used to esti-mate a motion-correction vector that is added to the pre-vious estimated motion vector to provide the final cameramotion.

Fig. 5(a) and (b) shows match correspondences on the twoframes for the first step, and Fig. 5(c) and (d) shows the

matches using the initial motion estimation from the first step.Not only does the number of false matches decrease when arough motion estimate is used, but the total number of matchesincreases dramatically.

V. MOTION ESTIMATION

Given a set of corresponding features between a pair of con-secutive images, motion estimation becomes the problem of op-timizing a 3-D transformation that projects the world cornersfrom the previous image coordinate system onto the next image.With the assumption of local linearity, the problem of 3-D mo-tion estimation is a promising candidate for the application ofNewton’s minimization method.

A. Least-Squares (LS) Minimization

Rather than solving this directly for the camera motion withsix degrees of freedom (DOFs), the iterative Newton methodis used to estimate a correction vector with three rotationaland three translational components, that if subtracted from thecurrent estimate, results in the new estimate [32]. If is thevector of parameters for iteration , then

(12)

Given a vector of error measurements between the projection of3-D world features on two consecutive image frames , a vector

is computed that minimizes this error [33]

(13)

The effect of each correction vector element on error mea-surement is defined by

(14)

Here, is the error vector between the predicted location ofthe object and the actual position of the match found in imagecoordinates. represents the number of matched features. Since(13) is usually overdetermined, is estimated to minimize theerror residual [34]

(15)

includes two rotational and translational vector componentsof .

B. Setting Up the Equations

With the assumption that the rotational components of themotion vector are small, the projection of the transformed point

in space on the image plane can be approximated by

(16)

where , and are the incremental translations, and isthe focal length of the camera. The partial derivatives in rowsand of the Jacobian matrix in (13), that corresponds tothe th matched feature, are calculated as shown in [35]. Aftersetting up (15), it is solved iteratively until a stable solution isobtained.

Page 8: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

126 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

Fig. 5. Two-stage tracking improves the accuracy of the estimated motion. Motion vectors for the matched features are shown in the previous and current images.(a) and (b) show these vectors in the first stage. The white arrows show correct matches, and the black arrows show false match correspondences. (c) and (d) showthe same images with their match correspondences, using the estimated motion of the first stage. Not only does the number of correct match correspondencesincrease, but the number of false match correspondences is decreased.

C. Implementation Consideration

In order to minimize the effect of faulty matches or scenedynamic features on the final estimated motion, the followingconsiderations are taken into account during implementation.

• The estimated motion is allowed to converge to amore stable state by running the first three consecutiveiterations.

• At the end of each iteration, the residual error for eachmatched pair in both coordinate directions and arecomputed.

• From the fourth iteration, the motion is refined by elimi-nation of outliers. For a feature to be considered an outlier,it must have a large residual error . On each it-eration, at most 10% of the features with the residual errorhigher than 0.5 pixels will be discarded as outliers.

• The minimization is repeated for up to 10 iterations ifchanges in the variance of the error residual vector is morethan 10%. During this process, the estimation moves grad-ually toward the best solution.

• The minimization process stops if the number of inliersdrops to 40 or fewer matches.

It is important to note that if the number of features from dy-namic objects is more than that of the static features, the robust-ness of the system could be compromised, and therefore falsetrajectory estimation will result.

D. Motion-Estimation Results

The results of an entire motion-estimation cycle, for a dis-tance of about 5 cm in the outdoor environment, is presented inTable I.

Page 9: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 127

TABLE IITERATION RESULTS ALONG WITH THE ERROR RESIDUAL IN

PIXELS FOR ONE MOTION ESTIMATION

As shown in this table, the error is reduced in a consistentmanner, and the final error residual is less than a pixel. Gener-ally, the error residual is only a fraction of a pixel.

E. Feature Update

After the motion parameters are found, the database informa-tion must be updated. This is performed based on the predictionand observation of each feature and the robot’s motion.

• The position and uncertainty matrix for features that areexpected to be seen and have corresponding matches areupdated. Their count increases by one.

• Features that are expected to be seen but have no uniquematches are updated. The uncertainty for these featuresincreases by a constant rate of 10%, and their count de-creases by one.

• New features, with no correspondence in the referenceworld, are initialized in the database and their count is setto one.

F. Feature Retirement

After updating global feature points, an investigation is car-ried out to eliminate those features that have not been seen oversome distance. For this purpose, feature points with a countvalue of , indicating that they have not been observed for atleast five consecutive frames (which for our system correspondsto a distance of 50 cm), are eliminated. This condition removessome of the remaining unstable features that falsely pass the sta-bility and disparity conditions in the stereo matching process, inspite of their poor conditions.

VI. POSITION-ERROR MODELING

The noise associated with an image is considered to be whiteand Gaussian [36]. The disparity measurement using such animage inherits this noise. Since the 3-D estimation of eachfeature in space is a linear function of the inverse disparity,a Kalman filter estimator seems to be a promising model forreducing the system error associated with the existing noise[37]. Therefore, a Kalman filtering scheme is incorporatedinto the system that uses the many measurements of a feature

over time and smooths out the effects of noise in the featurepositions, as well as in the robot’s trajectory.

For each feature, an individual Kalman filter is generated.Each Kalman filter includes a mean position vector,and a covariance matrix , that respectively repre-sent the mean position and the positional uncertainty associatedwith that feature in space. A Kalman filter is also created forthe robot-mounted camera that includes position and the uncer-tainty associated with it.

A. Camera-Position Uncertainty

The robot’s position is updated using the estimated motionvector found by the LS minimization. For this purpose, a simpleKalman filter model is employed. The assumption that is madefor this model is that the robot moves with a constant velocity.The following equations represent the Kalman filter model forthis application:

(17)

(18)

where represents the state variable at frame

(19)

and is a constant matrix and is defined by

(20)In matrix is the sampling rate and is set to one. andare, respectively, the (unknown) system and observation noises.

is a matrix and is defined by , where is aidentity matix and is a zero matrix.

B. Prediction

Using the standard Kalman filter notation [38], the state pre-diction is made by

(21)

If represents the process covariance, the process covari-ance prediction is

(22)

shows the noise associated with the process covariance andis defined by (23), shown at the bottom of the next page. Ma-trix is a constant matrix and is found experimentally. In this

Page 10: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

128 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

matrix, the associated uncertainties with the rotational compo-nents of the state variable are defined to be smaller than those ofthe translational parameters. This is mainly due to the fact thatthe accuracy of the estimated rotational parameters of the mo-tion is higher. These values, however, are defined to be largerthan the estimated uncertainties associated with the measure-ments, as shown in (26) at the bottom of the page. Such largeruncertainties emphasize the fact that the measurement valuesare more reliable under normal circumstances. However, if forany reason, the LS minimization for estimating the motion pa-rameters fails, then the covariance matrix of the measurementsin (26) is changed to an identity matrix, forcing the system togive larger weight to the predicted values.

C. Measurement

The measurement prediction is computed as

(24)

The new position of the robot is obtained by updating itsprevious position , using estimated camera motion pa-rameters by the LS minimization from (15),

(25)

The covariance for the measurement is obtained by com-puting the inverse of [32] in Section V-A.

Matrix (26) represents a typical that is computed by oursystem during one of the tracking processes, and is shown in(26). If, for any reason, a feasible result for the measurementvector is not found by the LS minimization procedure,is set to a identity matrix. A with larger components,compared with , causes the system to give the prediction values

Fig. 6. Camera position Kalman filtering model.

a higher weight than the unknown measurements that are set tozero.

D. Update

The Kalman filtering process can be presented by the fol-lowing set of relationships:

Fig. 6 represents a graphical presentation of the Kalman filteringmodel that is used for the linear motion of the camera.

E. Feature Position Uncertainty

Uncertainties in the image coordinates , and disparityvalues , of the features from the stereo algorithm propagateto uncertainty in the features’ 3-D positions. A first-order error

(23)

(26)

Page 11: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 129

propagation model [39] is used to compute the positional uncer-tainties associated with each feature in space

(27)

, and represent the image center, stereo cameraseparations, and camera focal length. ,and are the variances of , and , respectively.Based on the results given in Section V-C, where the mean oferror in the LS minimization is less than one pixel, assumptionsare made that and

. Therefore, variances of each feature’s 3-D position, inthe current camera frame coordinate, are computed accordingto the above error-propagation formula.

F. Feature Update

Each time a feature is observed, a new set of measurementsis obtained for that feature in the space. Therefore, at the endof each frame and after estimating the motion, world featuresfound in the current frame are used to update the existing globalfeature set. This requires that these features be transformed intothe global coordinate system first. Next, the positional mean andcovariance of each feature are combined with correspondingmatches in the global set. The 3-D uncertainty of a feature in thecurrent frame is computed as described in (27). However, whenthis feature is transformed into the global coordinate system, theuncertainty of the motion estimation and robot position prop-agates to the feature’s 3-D position uncertainty in the globalframe. Therefore, before combining the measurements, the 3-Dpositional uncertainties of the feature are updated first.

From the LS minimization procedure, the current robot pose,as well as its covariance, can be obtained [32]. The current po-sition of the features can be transfered into the reference frameby

(28)

where and are the observed 3-D position of a featurein the current frame and the corresponding transformed positionin the reference frame, respectively. , and repre-sent the location and orientation of the camera head system inthe reference frame, respectively.

G. Feature Covariance Update

The goal is to obtain the covariance of the features in the refer-ence coordinate system , given the diagonal uncertaintymatrix for each observed feature in the current frame consistingof , and . Since each feature point undergoes a rotationand a translation when transforming from the local coordinatesystem to the global coordinate system, the corresponding co-variances of , and must be transformed using the same

Fig. 7. Positional uncertainties associated with features in the image plane.

transformation. Considering that each motion estimation con-sists of a rotational and a translational component, the updatedcovariance of each feature after the transformation is defined by

(29)

where the first term, , represents the covariance dueto rotation, and the second term represents the translationalcovariance. Components of the translational covariance matrix

are the translational uncertainties associated withthe estimated motion by the LS minimization.

Details for computation of are presented in theAppendix.

H. Feature Position Update

To update the 3-D position of a feature [40], the transformedcovariance matrix is combined with the existing covari-ance of the matching global feature to obtain the new co-variance matrix

(30)

The new global position of the feature, , is then found usingthe covariances, the transformed position [using (28)] and theprevious position

(31)

I. Experimental Results

Fig. 7 shows the projection of estimated uncertainties associ-ated with world features on the image plane. In this figure, theuncertainties associated with closer objects are very small, andtherefore appear as bright dots. As expected, farther features, forinstance, features around windows in the upper right corner ofthe scene, have larger projected uncertainties. Some of the closerfeatures also have large positional uncertainties associated with

Page 12: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

130 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

Fig. 8. Original and updated positional uncertainties of world features.

them. These large positional uncertainties imply incorrect depthestimation for those features.

To get a closer look at 3-D scene features and their positionaluncertainties, Fig. 8 is generated. It demonstrates a top view ofworld features and their associated uncertainties. As clearly dis-played, associated positional uncertainties with features growin dimensions, as these features move away (in depth) from thecamera plane and as they move to the sides (from the cameracenter). In this figure, dotted ellipsoids show the original uncer-tainties, and the solid ellipsoids show the updated uncertainties.

Results of the Kalman filters incorporated into the trajectorytracking system are presented in Section VII.

VII. EXPERIMENTAL RESULTS

This section contains the experimental results obtained fromimplementing the solution strategies put forth in previoussections.

A. Camera System: Digiclops

The Digiclops stereo vision camera system is designed andimplemented by Point Grey Research (Vancouver, BC, Canada)[28]. It provides real-time digital image capture for differentapplications. It includes three monochrome cameras, each aVL-ICX084 Sony charge-coupled device (CCD) with VL-20202.0-mm Universe Kogaku America lenses, and a softwaresystem with the IEEE-1394 interface. These three cameras arerigidly positioned so that each adjacent pair is horizontally orvertically aligned.

For this work, the intrinsic camera parameters are also pro-vided by Point Grey Research. The camera system capturesgray-scale images of pixels. In order to reduce theambiguity between the yaw rotation and lateral translation, a setof wide angle lenses with a 104 FOV is used. These lenses in-corporate information from the sides of the images that behavedifferently under translational and rotational movements.

Fig. 9. Outdoor scene for Experiment 1.

B. Trajectory Estimation

The performance of the system is evaluated based on its cu-mulative trajectory error or positional drift. For this purpose, ex-periments are performed on closed paths. On a closed path, therobot starts from an initial point with an initial pose. After rovingaround, it returns to the exact initial point. To ensure returning tothe exact initial pose, an extra set of images are acquired at thestarting position, right before the robot starts its motion. This setis used as the last set, and with it, the starting and ending pointsare projected onto an exact physical location. In an ideal condi-tion, the expected cumulative positional error must be zero, andtherefore, anything else represents the system’s drift.

C. Experiment 1

In this experiment, the robot moves along an outdoor path.The scene was a natural environment including trees, leaves,and building structures that were located in distances 0.1–20 mfrom the camera image plane. The traversed path was 6 m long,and along the path, 172 raw (warped) images were captured andprocessed. Fig. 9 shows the scene in this experiment. In thisfigure, the traversed path is highlighted with a dark line, and thefacing of the camera is shown using a white arrow. The robotstarts the forward motion from point A to point B. At point B,the backward motion begins, until point A is reached.

Although the scene includes some structures from thebuilding, over 90% of the features belong to the unstructuredobjects of the scene. Fig. 10 shows the overall estimated trajec-tory along the entire path. In this figure, the gradual motion ofthe camera system is displayed using a graphical interface thatis written in Visual C and VTK 4. The estimated trajectory ateach frame is shown with the dark sphere, and the orientation isdisplayed with the light color cone. The center of the referencecamera is considered the center of the motion. Table II displaysthe cumulative trajectory error in this experiment. From thistable, the translation error is about 2.651 cm, which is only0.4% of the overall translation.

Page 13: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 131

Fig. 10. Graphic representation of the traced path in Experiment 1 usingVisualization Toolkit 4.

TABLE II3-D DRIFT FOR A 6-m LONG TRANSLATION

D. Experiment 2

In this experiment, the robot moves on a closed circular path,including a full 360 yaw rotation. The orientation of the camerais toward the ground. During this experiment, 101 raw imagesare captured and processed. Fig. 11 represents the overview ofthe environment in this scenario.

Fig. 12 represents the overall estimated trajectory from acloser distance. The cumulative error in this case is representedin Table III.

From this table, the overall rotational error in this experimentis about 3.341 or 0.9%, and the translational error is 1.22 cmor 0.6%.

E. Trajectory Estimation Refinement by Kalman Filtering

Comparison of the estimated trajectory with and without aKalman filtering scheme is represented through the comparisonof the cumulative error in 3-D trajectory parameters. This com-parison is studied for the represented case in VII-C, in which thetraversed path is 6 m long. Table IV represents the results of thiscomparison. In this table, and represent overall transla-tional and rotational errors. The overall translational error withKalman filter is considerably less than that without the Kalmanfiltering algorithm. The rotational error with the Kalman fil-tering is slightly more. However, both these values, 1.01 and0.39 , are very small, and they can easily be due to the noise inthe estimation process. Fig. 13 represents the estimated trajec-tories in the presence of the Kalman filtering scheme. As shown

Fig. 11. Outdoor scene used for the rotational motion.

Fig. 12. Closer view of the circular path with a radius of 30 cm inExperiment 2.

TABLE III3-D DRIFT FOR MOTION WITH 360 ROTATION ON A

CIRCULAR PATH WITH A 60-cm RADIUS

at the top of this figure, the robot moves along for 3 m andthen it returns to its starting point. The overall translational errorfor this experiment is about 2.66 cm.

Fig. 14 represents the estimated orientation for this experi-ment. The cumulative orientation error for this experiment isabout 1 . As represented in the second row of Table IV, the po-sitional error is increased to 6.31 cm when the Kalman filter isturned off.

Page 14: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

132 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

TABLE IVCOMPARISON OF REDUCTION OF 3-D DRIFT FOR A

6-m LONG PATH USING KALMAN FILTER

Fig. 13. Estimated distances for a 6-m long path with Kalman filter.

Fig. 14. Estimated orientations for a 6-m long path with Kalman filter on.

F. Trinocular and Binocular Stereo Comparison

Establishing accurate match correspondences in a stereosystem is a key issue in 3-D reconstruction and trajec-tory-tracking problems. The physical arrangement of thecameras in stereo vision plays an important role in the cor-respondence matching problem. The accuracy of the depthreconstruction has a direct relationship with the baseline, and itcan be improved by choosing a wider separation between the

stereo lenses. On the other hand, a narrower baseline facilitatesa faster search scheme when establishing the correspondencesin the stereo image pair. The use of more than one stereo camerawas originally introduced to compensate for the tradeoff be-tween the accuracy and ease of the match correspondences[41].

The stereo baselines are almost identical in length for theDigiclops camera system. Therefore, the improvement of theaccuracy by means of multibaseline stereo matching is not ex-pected. However, the noncollinear arrangement of the lensesadds a multiple view of the scene that could improve the ro-bustness, and therefore, the long-term system accuracy. This ismainly because:

• generally, a wider scope of the scene is viewable by thethree images, increasing the number of the features;

• moreover, the third image is used for a consistency check,eliminating a number of unstable features that are due toshadows and light effects.

The above improvement, however, could potentially cause aslow down in the system, as each time, there is one extra imageto be processed.

To assess the effectiveness of trinocular stereo versus binoc-ular, an experiment was undertaken in which a closed path (oflength 3 m) is traversed. Once again, the cumulative error isused as a measure of system performance. Table V shows theresultant error in both cases. In this table, and representoverall translational and rotational errors. These values clearlyshow the similarity of estimated motions by the two systems.Considering that the cost and the complication of a binocularsystem is less than a trinocular stereo, the binocular stereo mightbe a better solution for some applications.

G. Computational Cost

The presented system is implemented in the MicrosoftVisual C 6.0 language, on a 1.14-GHz AMD Athlon pro-cessor under the Microsoft Windows operating system. Thecamera system captures gray-scale images of pixels.An effort has been made to optimize the code and modularizethe system, in order to obtain fast subsystems with less com-munication cost and required memory.

The most severe drawback of the system is its high compu-tational requirement. Currently, for outdoor scenes, it has a rateof 8.4 s per frame, and for indoor scenes, it performs at a rateof 2.4 Hz [42]. The number of features has a great impact onthe speed of our system. If represents the number of corners,the stereo matching and construction stages have complexitiesof . Tracking 3-D features from one frame to anotherhas also a complexity of . This is due to the fact that bothtracking and stereo processes are heavily involved in the use ofthe normalized mean-squared differences function for the pur-pose of measuring similarities. For instance, when we movedfrom indoors to outdoors, the number of our features (1200corner points) became four times larger than the indoor scene(300 corner points). This factor increases the running time of thetracking and stereo tasks alone by a minimum factor of 16. Asexpected, the running times of these two procedures increasedto 5.1 and 1.35 s (from 0.21 and 0.057 s for our indoor scene).

Page 15: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 133

TABLE VCOMPARISON OF THE CUMULATIVE ERROR FOR TRINOCULAR AND BINOCULAR STEREOS

Theoretically, having three correct matches must be enoughto provide an answer for the motion-estimation problem usingLS minimization. However, during our work, we noticed that aminimum number of 40 match inliers are necessary for a reliablesolution.

It is important to see the tradeoff between the system pro-cessing rate with the motion rate and search window dimensionsin the tracking process. A smaller motion, between two consec-utive frames, results in smaller displacements of image featuresin two corresponding image frames. In such conditions, corre-sponding features can be found by searching over smaller re-gions. Smaller windows speed up the system processing rate.Therefore, through a slower moving robot, a faster performancecan be achieved.

The computational cost may be reduced by creating an imageresolution pyramid. Features can be detected on the coarserlevel, and using them, a rough motion estimation is obtainedthat can be refined by moving to a finer pyramid level. Anotherway to improve the processing rate is to select and process onlyselected patches of each image instead of the entire image.Employment of specific hardware (e.g., field-programmablegate arrays) that allows the system to perform bitwise paralleloperations can also improve the speed of the system.

VIII. CONCLUSION

This paper has presented the successful development of a gen-eral-purpose 3-D trajectory-tracking system. It is applicable tounknown indoor and outdoor environments, and it requires nomodifications to be made to the scene. The primary novelty ofthis paper is a methodology for obtaining camera trajectoriesfor outdoors, in the presence of possibly moving scene features,without the need for odometry or sensors other than vision. Con-tributions of this paper can be summarized as follows.

• A novel fast feature-detection algorithm named the BCDhas been developed. A 60% performance improvement isgained by substituting arithmetic operations with logicalones. Since the main assumption for the whole system hasbeen that temporal changes between consecutive framesare not large, a faster feature detector leads to fewer tem-poral changes between the consecutive frames, and there-fore, results in a higher accuracy in the overall system.

• Due to imperfect lenses, the acquired images includesome distortions that are corrected through the calibra-tion process. Not only is the image calibration at eachframe for the trinocular camera images a time-consumingprocess, but it could add positional shifts (error) to imagepixels. This process degrades 3-D reconstruction results,and increases the cumulative error in the overall trajec-tory-tracking process. To remove this undesired effect,a calibration map for each of the cameras is constructed

that defines the relationship between the integer positionof the uncalibrated pixels with the corresponding floatingpoint location on the calibrated image. Operating on thewarped (raw) images allows one to work with sharperdetails. It also provides a faster processing time by elimi-nating the calibration process for three individual images.

• Correct identification of identical features depends onseveral factors, such as search boundaries, similaritymeasurement window size, and a robot’s motion range.Expanding search boundaries and the window size forsimilarity evaluation can improve the accuracy by addingmore correct matches. They can, however, slow downthe performance, leading to a larger motion for the samecamera speed between two consecutive frames. A largermotion introduces more inaccuracy into the system. Toimprove the accuracy, a two-stage tracking scheme isintroduced, in which the match correspondences arefirst found using a large search window and a smallersimilarity measurement window. Through this set ofcorrespondences, a rough estimation of the motion isobtained. These motion parameters are used in the secondstage to find and track identical features with higheraccuracy. The process increases the tracking accuracy byup to 30%.

APPENDIX

Feature Uncertainty Computation

Rotational Covariance Computation: Given two pointsand with the following relationship:

(32)

where is a transformation matrix, rotation matrix in ourcase, we would like to compute the uncertainty associated with

given the uncertainties associated with the .Here, the old and new positions and are vectors of .If there are errors associated with both and (covariance for ) and ( covariance for ), thecovariance of the resulting vector is computed by [45]

(33)

In (33), the first matrix is , the second is , andthe third, which is the transpose of the first matrix, is amatrix. With the assumption that at each time the three rotationangles are small, and therefore independent, the transformationproceeds, in order, for rotations (roll), (pitch), and(yaw) first. Variances of , and are already found

Page 16: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

134 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

during the last motion estimation. Required transformations foreach stage and how the positional uncertainties propagate areexplained next.

1) Roll Transformation: The roll transformation is definedby

(34)

With the assumption that noise associated with the rotationalangles is Gaussian and of zero mean, the covariance matrixfor the roll transformation is computed by

(35)

where

Variance (36)

Variance (37)

The expected value of is computed [46], with the as-sumption that has a Gaussian distribution, by

(38)

(39)

therefore

(40)

(41)

Using (33) and the rotational transformation equations, the co-variance matrix after the roll rotation is computed from (42),shown at the bottom of the page. Here, is the 3-D global

location of the feature in the current frame. Since this is the firsttime the transformation is carried out,as the initial covariance matrix of the observation is a diagonalmatrix. Applying the roll transformation to the initial positionof the feature provides the transformed position of the feature.This new position is used for the next stage. The uncertaintyassociated with this position is the recent covariance matrix of

.2) Pitch Transformation: Given the pitch transformation

(43)

the covariance matrix for the pitch rotation, , is com-puted in a similar way, as shown in (35). Once again, substitutingthe pitch rotational transform in (33) and the covariance matrixin (43), the new covariance matrix can be defined by (44), shownat the top of the next page, where and are defined by

(45)

(46)

In this formula, is found from the last motion estimation.is the transformed 3-D location of the feature after the

roll transformation, and , and are fromthe covariance matrix . Applying the pitch transform pro-vides the transformed position of the feature, which is used inthe next stage, together with this new feature covariance.

3) Yaw Transformation: The covariance matrix afterthe yaw rotation

(47)

is computed by (48), shown at the top of the next page. andare defined by

(49)

(50)

and is the variance of estimated earlier in the motion-es-timation process. is the transformed 3-D location of the

(42)

Page 17: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

SAEEDI et al.: VISION-BASED 3-D TRAJECTORY TRACKING FOR UNKNOWN ENVIRONMENTS 135

(44)

(48)

feature after the pitch transformation, and ,and are from the covariance matrix .

REFERENCES

[1] C. Shin and S. Inokuchi, “Estimation of optical-flow in real time fornavigation robots,” in Proc. IEEE Int. Symp. Ind. Electron., 1995, pp.541–546.

[2] D. Murray and A. Basu, “Motion tracking with an active camera,” IEEETrans. Pattern Anal. Machine Intell., vol. 16, no. 5, pp. 449–459, May1994.

[3] R. Sim and G. Dudek, “Mobile robot localization from learnedlandmarks,” in Proc. IEEE Int. Conf. Intell. Robots Syst., 1998, pp.1060–1065.

[4] S. Thrun, M. Bennewitz, W. Burgard, A. Cremers, F. Dellaert, D. Fox, D.Haehnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz, “MINERVA:A second generation mobile tour-guide robot,” in Proc. IEEE Int. Conf.Robot. Autom., 1999, pp. 1999–2005.

[5] A. Munoz and J. Gonzalez, “Two-dimensional landmark-based positionestimation from a single image,” in Proc. IEEE Int. Conf. Robot. Autom.,1998, pp. 3709–3714.

[6] R. Basri and E. Rivlin, “Localization and homing using combinations ofmodel views,” Artif. Intell., vol. 78, pp. 327–354, Oct. 1995.

[7] P. Trahanias, S. Velissaris, and T. Garavelos, “Visual landmark extractionand recognition for autonomous robot navigation,” in Proc. IEEE/RSJInt. Conf. Intell. Robots Syst., 1997, pp. 1036–1042.

[8] M. Lhuillier and L. Quan, “Quasidense reconstruction from image se-quence,” in Proc. Eur. Conf. Comput. Vis., 2002, pp. 125–139.

[9] R. Harrell, D. Slaughter, and P. Adsit, “A fruit-tracking system forrobotic harvesting,” Mach. Vis. Appl., vol. 2, no. 2, pp. 69–80, 1989.

[10] P. Rives and J. Borrelly, “Real time image processing for image-based vi-sual servoing,” in Proc. Notes for Workshop WS2, IEEE Int. Conf. Robot.Autom., May 1998, pp. 99–107.

[11] E. Dickmanns, B. Mysliwetz, and T. Christians, “An integrated spatio-temporal approach to automatic visual guidance of autonomous vehi-cles,” IEEE Trans. Syst., Man, Cybern., vol. 20, no. 11, pp. 1273–1284,Nov. 1990.

[12] I. Sethi and R. Jain, “Finding trajectories of feature points in a monocularimage sequence,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no.1, pp. 56–73, Jan. 1987.

[13] M. Betke, E. Haritaoglu, and L. Davis, “Highway scene analysis in hardreal-time,” in Proc. IEEE Conf. Intell. Transp. Syst., 1997, pp. 812–817.

[14] R. Basri, E. Rivlin, and I. Shimshoni, “Image-based robot navigationunder the perspective model,” in Proc. IEEE Int. Conf. Robot. Autom.,1999, pp. 2578–2583.

[15] C. Harris, Geometry from Visual Motion. Cambridge, MA: MIT Press,1992.

[16] D. Nistér, “Preemptive ransac for live structure and motion estimation,”in Proc. Int. Conf. Comput. Vis., 2003, pp. 199–206.

[17] , “An efficient solution to the five-point relative pose problem,” inProc. Comput. Vis. Pattern Recog. Conf., 2003, pp. 756–777.

[18] N. Ayache, O. Faugeras, F. Lustman, and Z. Zhang, “Visual navigationof a mobile robot,” in Proc. IEEE Int. Workshop Intell. Robots, 1988, pp.651–658.

[19] D. Wettergreen, C. Gaskett, and A. Zelinsky, “Development of a visu-ally-guided autonomous underwater vehicle,” in Proc. IEEE Int. OceansConf., 1998, pp. 1200–1204.

[20] I. Jung and S. Lacroix, “High-resolution terrain mapping using low al-titude aerial stereo imagery,” in Proc. Int. Conf. Comput. Vis., 2003, pp.946–956.

[21] S. Se, D. Lowe, and J. Little, “Mobile robot localization and mappingwith uncertainty using scale-invariant visual landmarks,” Int. J. Robot.Res., vol. 21, pp. 735–758, Aug. 2002.

[22] R. I. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 2000.

[23] P. Saeedi, P. Lawrence, and D. Lowe, “3D motion tracking of a mobilerobot in a natural environment,” in Proc. IEEE Int. Conf. Robot. Autom.,2000, pp. 1682–1687.

[24] C. Harris and M. Stephens, “A combined corner and edge detector,” inProc. 4th Alvey Vis. Conf., 1988, pp. 147–151.

[25] S. M. Smith and J. M. Brady, “SUSAN—A new approach to low-levelimage processing,” Int. J. Comput. Vis., vol. 23, pp. 45–78, May 1997.

[26] P. Saeedi, D. Lowe, and P. Lawrence, “An efficient binary corner de-tector,” in Proc. 7th Int. Conf. Control, Autom., Robot., Vis., Dec. 2002,pp. 338–343.

Page 18: <![CDATA[Vision-based 3-D trajectory tracking for unknown environments]]>

136 IEEE TRANSACTIONS ON ROBOTICS, VOL. 22, NO. 1, FEBRUARY 2006

[27] C. Schmid, R. Mohr, and C. Bauckhage, “Comparing and evaluatinginterest points,” in Proc. Int. Conf. Comput. Vis., 1998, pp. 230–235.

[28] “Triclops Stereo Vision System,” P. G. R. Inc., Dept. Computer Science,Univ. of British Columbia, Vancouver, BC, Canada, www.ptgrey.com,Tech. Rep., 1997.

[29] M. Okutomi and T. Kanade, “A multiple-baseline stereo,” IEEE Trans.Pattern Anal. Machine Intell., vol. 15, no. 4, pp. 353–363, Apr. 1993.

[30] P. Fua, “A parallel stereo algorithm that produces dense depth maps andpreserves image features,” in Machine Vision and Applications. NewYork: Springer-Verlag, 1993.

[31] Machine Vision and Applications. New York: Springer-Verlag, 1993.[32] D. Lowe, “Robust model-based motion tracking through the integration

of search and estimation,”, Tech. Rep. TR-92-11, 1992.[33] , “Fitting parameterized three-dimensional models to images,”

IEEE Trans. Pattern Anal. Machine Intell., vol. 13, no. 5, pp. 441–450,May 1991.

[34] A. Gelb, Applied Optimal Estimation. Cambridge, MA: MIT Press,1974.

[35] D. Lowe, Artificial Intelligence, Three-Dimensional Object Recognitionfrom Single Two-Dimensional Images. Amsterdam, The Netherlands:Elsevier, 1987.

[36] M. J. Buckingham, Noise in Electronic Devices and Systems, ser. Elec-trical and Electronic Engineering. New York: Springer, 1983.

[37] M. Grewal and A. Andrews, Kalman Filtering, Theory and Prac-tice. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[38] C. Chui and G. Chen, Kalman Filtering with Real-Time Applica-tions. New York: Springer-Verlag, 1991.

[39] P. Bevington and D. Robinson, Data Reduction and Error Analysis forthe Physical Sciences. New York: McGraw-Hill, 1992.

[40] L. Shapiro, A. Zisserman, and M. Brady, “3D motion recovery via affineepipolar geometry,” Int. J. Comput. Vis., vol. 16, pp. 147–182, Oct. 1995.

[41] S. Kang, J. Webb, C. Zitnick, and T. Kanade, “A multibaseline stereosystem with active illumination and real-time image acquisition,” inProc. 5th Int. Conf. Comput. Vis., Jun. 1995, pp. 88–93.

[42] P. Saeedi, D. Lowe, and P. Lawrence, “3D localization and tracking inunknown environments,” in Proc. IEEE Int. Conf. Robot. Autom., vol. 1,Sep. 2003, pp. 1297–1303.

[43] S. Se, D. Lowe, and J. Little, “Vision-based mobile robot localizationand mapping using scale-invariant features,” in Proc. IEEE Int. Conf.Robot. Autom., 2001, pp. 2051–2058.

[44] C. F. Olson, L. H. Matthies, M. Schoppers, and M. W. Maimone, “Rovernavigation using stereo ego-motion,” Robot. Auton. Syst., vol. 43, pp.215–229, Jun. 2003.

[45] J. Clarke, “Modeling Uncertainty: A Primer,” Univ. Oxford, Dept. Eng.Sci., Oxford, U.K., Tech. Rep., 1998.

[46] S. M. Kay, Fundamentals of Statistical Signal Processing: EstimationTheory. Englewood Cliffs, NJ: Prentice-Hall, 1993.

Parvaneh Saeedi (M’05) received the B.A.Sc.degree in electrical engineering from the Iran Uni-versity of Science and Technology, Tehran, Iran, andthe Master’s degree in 1998, and the Ph.D. degreein 2005 in electrical and computer engineering fromthe University of British Columbia, Vancouver, BC,Canada.

Her research interests include computer vision,motion tracking, and automated systems for fast andhigh-precision DNA image processing.

Peter D. Lawrence (S’64–M’73) received theB.A.Sc. degree in electrical engineering from theUniversity of Toronto, Toronto, ON, Canada, in1965, the Masters’s degree in biomedical engi-neering from the University of Saskatchewan,Saskatoon, SK, Canada, in 1967, and the Ph.D.degree in computing and information science fromCase Western Reserve University, Cleveland, OH, in1970.

He was a Guest Researcher at Chalmers Univer-sity’s Applied Electronics Department, Goteborg,

Sweden, between 1970 and 1972, and between 1972 and 1974 was a ResearchStaff Member and Lecturer in the Mechanical Engineering Department,Massachusetts Institute of Technology, Cambridge. Since 1974, he has beenwith the University of British Columbia, Vancouver, BC, Canada, and is aProfessor in the Department of Electrical and Computer Engineering. Hismain research interests include the application of real-time computing in thecontrol interface between humans and machines, image processing, and mobilehydraulic machine modeling and control.

Dr. Lawrence is a registered Professional Engineer in the province of BritishColumbia, Canada.

David G. Lowe (M’85) received the Ph.D. degreein computer science from Stanford University, Stan-ford, CA, in 1984.

He is currently a Professor of Computer Sciencewith the University of British Columbia, Vancouver,BC, Canada. From 1984 to 1987, he was an AssistantProfessor with the Courant Institute of MathematicalSciences, New York University, New York. Hisresearch interests include object recognition, localinvariant features for image matching, robot local-ization, object-based motion tracking, and models

of human visual recognition. He is on the Editorial Board of the InternationalJournal of Computer Vision, and was Co-Chair of ICCV 2001 in Vancouver,BC, Canada.