6
Indoor Localization and 3D Scene Reconstruction for Mobile Robots Using the Microsoft Kinect Sensor Yuhua Zou * , Weihai Chen , Xingming Wu and Zhong Liu § School of Automation Science and Electrical Engineering Beihang University, Beijing, China 100191 Email: * [email protected], [email protected], [email protected], § [email protected] Abstract—In this paper we present an approach to indoor localization and 3D scene reconstruction using the Microsoft Kinect sensor. The proposed system can simultaneously estimates the position and orientation of a hand-held Kinect and generates a dense 3D model of the indoor environment. Furthermore, the robustness and processing time for four different feature descriptors ( SURF, ORB, Shi-Tomasi and FAST) are evaluated. The experiment results demonstrate that our system can robustly deal with complicated data in common indoor scenarios while running in semi-real-time. I. I NTRODUCTION The ability to quickly acquire true estimated 3D models of the observed environment and to estimate the camera pose and position with respect to the environment model is necessary in many relevant applications in robotics and computer vision. While navigating between places, for example, a robot needs to know its location in the world. This problem is a classical and challenging problem because localizing the camera in the world requires the 3D model of the world, and building the 3D model in turn requires the position and orientation of the camera. Hence, both the camera trajectory and the 3D model need to be estimated concurrently. With the introduction of RGB-D cameras, such as the Microsoft Kinect and the Asus Xtion, a new type of sensor has appeared on the market that provides both color images and dense depth maps at full video frame rate. This allows us to create a novel approach to simultaneous localization and mapping (SLAM) that combines the metric information of the depth maps with the strengths of visual features to create dense 3D environment models. In this paper we present a method to build more complete models of both the robot’s trajectory and its surroundings. We utilize a Microsoft Kinect sensor along with an SLAM algorithm that creates a 3D model of the robot and its surround- ings (Fig. 1). The model can be examined and manipulated by the operator to achieve better situation understanding. Using RGB-D-data from Kinect as input, the SLAM algorithm generates the 3D model of the observed environment with four processing steps illustrated in Fig. 2. First, visual features from the incoming color images are extracted. Then these features are matched against features from previous images. By evaluating the depth images at the locations of these feature points, a set of point-wise 3D correspondences between any two frames is generated. Based on these correspondences, Fig. 1: 3D viewpoint from proposed system the relative transformation between the frames are estimateed using RANSAC. The remainder of this paper is organized as follows. The related approaches are briefly reviewed in Sec. II. Then popular 3D imaging techniques and our SLAM algorithm are discussed in Sec. III and their performance are demonstrated in Sec. IV. Finally, a short conclusion and outlook are presented in Sec. V. II. RELATED WORK The general problem of SLAM has been a hot area of research for several years in robotics [1]–[7]. The mainstream approach designed to build 3D maps of the environment employ laser scanners or Time-of-Flight (ToF) cameras to produce dense point clouds of an environment. Many modern variants apply the iteratively closest point (ICP) algorithm [8]–[10] for aligning pairs of local point clouds to establish constraints between observations. These constraints are then used to find maximum likelihood map. Visual SLAM systems [11]–[13] in the computer vision literature often referred to as structure and motion estimation [14], [15] typically extract sparse keypoints from the camera images. Visual feature points have the advantage of being more informative which simplifies data association. Relevant feature descriptors include SIFT [16], SURF [17], and the recently introduced ORB features [18], as well as parallelized versions thereof like SIFTGPU [19]. In the monocular setting, the abso- lute scale of the map cannot be determined, so that additional normalization steps are required during optimization. Stereo SLAM systems [20], [21] do not suffer from this limitation as the depth can be calculated from the disparity between the two images. In general, however, the disparity can only be estimated for distinctive points in the image, i.e., surfaces with little or no texture cannot be matched easily. Novel RGB-D sensors that are based on structured light like the Microsoft Kinect directly provide dense depth maps 978-1-4673-0311-8/12/$31.00 ©2012 IEEE 1182

[IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

  • Upload
    zhong

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

Indoor Localization and 3D Scene Reconstruction forMobile Robots Using the Microsoft Kinect Sensor

Yuhua Zou∗, Weihai Chen†, Xingming Wu‡ and Zhong Liu§School of Automation Science and Electrical Engineering

Beihang University, Beijing, China 100191Email: ∗[email protected], †[email protected], ‡[email protected], §[email protected]

Abstract—In this paper we present an approach to indoorlocalization and 3D scene reconstruction using the MicrosoftKinect sensor. The proposed system can simultaneously estimatesthe position and orientation of a hand-held Kinect and generatesa dense 3D model of the indoor environment. Furthermore,the robustness and processing time for four different featuredescriptors ( SURF, ORB, Shi-Tomasi and FAST) are evaluated.The experiment results demonstrate that our system can robustlydeal with complicated data in common indoor scenarios whilerunning in semi-real-time.

I. INTRODUCTION

The ability to quickly acquire true estimated 3D models ofthe observed environment and to estimate the camera pose andposition with respect to the environment model is necessary inmany relevant applications in robotics and computer vision.While navigating between places, for example, a robot needsto know its location in the world. This problem is a classicaland challenging problem because localizing the camera in theworld requires the 3D model of the world, and building the3D model in turn requires the position and orientation of thecamera. Hence, both the camera trajectory and the 3D modelneed to be estimated concurrently.

With the introduction of RGB-D cameras, such as theMicrosoft Kinect and the Asus Xtion, a new type of sensorhas appeared on the market that provides both color imagesand dense depth maps at full video frame rate. This allowsus to create a novel approach to simultaneous localization andmapping (SLAM) that combines the metric information of thedepth maps with the strengths of visual features to create dense3D environment models.

In this paper we present a method to build more completemodels of both the robot’s trajectory and its surroundings.We utilize a Microsoft Kinect sensor along with an SLAMalgorithm that creates a 3D model of the robot and its surround-ings (Fig. 1). The model can be examined and manipulatedby the operator to achieve better situation understanding.Using RGB-D-data from Kinect as input, the SLAM algorithmgenerates the 3D model of the observed environment with fourprocessing steps illustrated in Fig. 2. First, visual featuresfrom the incoming color images are extracted. Then thesefeatures are matched against features from previous images.By evaluating the depth images at the locations of these featurepoints, a set of point-wise 3D correspondences between anytwo frames is generated. Based on these correspondences,

   

  Fig. 1: 3D viewpoint from proposed system

the relative transformation between the frames are estimateedusing RANSAC.

The remainder of this paper is organized as follows. Therelated approaches are briefly reviewed in Sec. II. Then popular3D imaging techniques and our SLAM algorithm are discussedin Sec. III and their performance are demonstrated in Sec. IV.Finally, a short conclusion and outlook are presented in Sec.V.

II. RELATED WORK

The general problem of SLAM has been a hot area ofresearch for several years in robotics [1]–[7]. The mainstreamapproach designed to build 3D maps of the environmentemploy laser scanners or Time-of-Flight (ToF) cameras toproduce dense point clouds of an environment. Many modernvariants apply the iteratively closest point (ICP) algorithm[8]–[10] for aligning pairs of local point clouds to establishconstraints between observations. These constraints are thenused to find maximum likelihood map.

Visual SLAM systems [11]–[13] in the computer visionliterature often referred to as structure and motion estimation[14], [15] typically extract sparse keypoints from the cameraimages. Visual feature points have the advantage of being moreinformative which simplifies data association. Relevant featuredescriptors include SIFT [16], SURF [17], and the recentlyintroduced ORB features [18], as well as parallelized versionsthereof like SIFTGPU [19]. In the monocular setting, the abso-lute scale of the map cannot be determined, so that additionalnormalization steps are required during optimization. StereoSLAM systems [20], [21] do not suffer from this limitationas the depth can be calculated from the disparity between thetwo images. In general, however, the disparity can only beestimated for distinctive points in the image, i.e., surfaces withlittle or no texture cannot be matched easily.

Novel RGB-D sensors that are based on structured lightlike the Microsoft Kinect directly provide dense depth maps

978-1-4673-0311-8/12/$31.00 ©2012 IEEE 1182

Page 2: [IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

and color images. Note that in general SLAM approachesthat operate on RGB-D images are structurally different fromstereo systems as the input is dense RGB-D instead of twocolor images. Fioraio and Konolige [22] recently presenteda system that uses bundle adjustment to align the densepoint clouds of the Kinect directly however without furtherexploiting the RGB images. Most similar to our work is theapproach of Henry et al. [23]. Their approach uses sparsekeypoint matches between consecutive color images as aninitialization to ICP. In their experiments, they found howeverthat often the computationally expensive ICP step was notnecessary. Therefore, they improved the algorithm so that ICPwas only used if few (or none) keypoint matches could beestablished. While Henry et al. use sparse bundle adjustment[24] for the optimization of the 2.5D reprojection errors inRGB-D image space, we optimize the 3D pose graph usingthe g2o [25] framework. Finally, the post-processing of the twoapproaches is different: Henry et al. post-process the resultingpoint cloud into a surfel representation, while we create avolumetric voxel representation [26] that can directly be usedfor robot localization, path planning and navigation [27].

III. METHOD

A. The Kinect Sensor

While there are a variety of commercial 3D image sensorsavailable, their prices range from $1000 to over $100, 000. TheMicrosoft Kinect gaming peripheral is the first commercial offthe shelf sensor system that retails for less than $200. Despitethe low cost, the Kinect provides robust 3D visual data whichmakes it a compelling choice as a sensor for our system.

The Kinect is a depth camera, (also called an RGB-Dcamera). It provides both a 640× 480 pixel color image froman RGB camera and a depth image provided by an infrared(IR) camera supported by an IR laser projector, all at 30 framesper second (fps) with each pixel correlated between sensors. Inthis paper the depth data will also be referred to as the depthimage. Depth is measured by emitting a pattern of structuredinfrared dots and determining the parallax shift of those dotsfor each pixel in the IR camera. Each pixel of the depth imagehas a resolution of 11 bits for up to 2048 raw disparity values.This provides a reliable range from approximately 0.5 m upto 5 m or more.

A more detailed description of the Kinect’s inner workingscan be found at [28]. All algorithms were implemented usingthe open source software package - OpenKinect [29] to accessthe Kinect’s RGB and depth data streams via a standard USBhardware interface.

B. The RGB-D SLAM Algorithm

This section gives a detailed description of our SLAMalgorithm. A schematic overview of our system is given in Fig.2. The trajectory estimation is divided into a front-end anda back-end. Whereas the front-end extracts spatial relationsbetween individual observations, the back-end optimizes theposes of these observations in a so-called pose graph and withrespect to a non-linear error function.

Depth Images RGB Images

Pairwise Feature Matching / Tracking

( SURF, ORB / Optic Flow )

Pairwise Transformation Estimation

( RANSAC )

Kinect

Global Pose Graph Optimization

( g2o )

3D Point Clouds

Map Building and Voxelization

( OctoMap )

3D Occupancy Grid Map

Fron

t-End

Back-En

d

Fig. 2: Sketch map of our SLAM algorithm. The extractedvisual features are associated to 3D point clouds. Subsequently,the image frames are matched and registered to built a posegraph, which is optimized using g2o. Finally, a textured voxeloccupancy map is generated using the OctoMapping approach.

In the front-end, we use the visual image of the RGB-Dsensor to detect keypoints and extract descriptors. These arematched to previously extracted descriptors and the relativetransformation between the sensor poses is computed usingRANSAC. Together with the depth information this allows usto register dense point clouds in a common coordinate system.

To compute globally optimal poses for the sensor positionsw.r.t. the estimated relative transformations, we use a graph-based optimization routine. After reconstruction of the trajec-tory, we compute an occupancy voxel grid map.

The algorithm developed for the proposed system operatesin a pair-wise fashion on a series of input frames, where thecurrent input frame (Fk+1) is compared to the previous frame(Fk) to obtain the output result. In this paper, a frameconsistsof a single RGB image and its corresponding depth image.

The output data consists of two parts. First, the estimatedlocation and orientation (pose) of the Kinect in (Fk+1) withrespect to the global coordinate system established by the firstframe, (F0). This information is represented by a rotation andtranslation transformation (RT) matrix. Second, an accumulat-ed 3D point cloud model representing the environment that thesystem has moved through thus far.

There are three major steps in this algorithm:1) Combining the Depth and RGB images: In order to

create a 3D point cloud relative to the Kinect camera, we needto combine the depth and RGB images of the current frame.Although depth data is obtained directly from the IR camera,the IR and RGB images can be mapped to one another in thesame manner as a traditional stereo camera setup. However,

1183

Page 3: [IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

the cameras must first be calibrated. The calibration methodused in our system is the common checkerboard calibrationmethod, adapted for the Kinect by Nicolas Burrus [30].

Additionally, the depth data is not usable in its raw state.Equation 1, introduced by Stephane Magnenat in a forum [31],is used to convert the raw 11-bit depth value to a depth valuein meters.

depth = 0.1236 · tan (raw disparity/2842.5 + 1.1863) (1)

This formula yields the distance along the positive Z axis,which is perpendicular to the image plane. The X and Yaxis values were found in a similar manner [30]. As withany stereo system, a re-projection error will be present afterthe calibration. For the Kinect sensor, it was measured to beapproximately 1.5 pixels. Using the above methods, each depthdata value is converted to an (X,Y,Z) triplet in a Cartesiancoordinate system (where the depth camera lies at (0, 0, 0)),transformed into the coordinate system of the RGB camera,and then projected onto the RGB image. At the end of Step 1,a list of (u, v) − (X,Y, Z) correspondences has been createdfor the RGB image. Any pixel (u, v) in the RGB image withouta 3D point correspondence is discarded.

2) Feature Extraction and Matching: Step 2 is the majorpart of our SLAM front-end, which is responsible for estab-lishing spatial relations from the sensor data. There are in twophases in this step: the extraction of interest points or featuresfrom the RGB image of the current frame (converted to grayscale), and then matching or tracking those points back to theRGB image in the previous frame. An additional constraintmust be added to the feature matching algorithm, that is, eachmatching pair of features must have a corresponding 3D pointin their respective frames.

This step is the most computationally expensive in thealgorithm when implemented on mobile hardware. Thereforethe speed of the system, as a whole, is dependent on themethods used for feature detection and matching. Severalwell-known feature detection algorithms and feature track-ing/matching methods were examined, including ORB [18],SURF [17], Shi-Tomasi (ST for short) corners [32], andFAST corners [33], [34]. Features were tracked using opticflow [35], and descriptor matching in the case of SURF andORB. Although GPU accelerated implementations of featuredetectors are becoming more and more popular, we do notinclude them in our testing. The reason for this is that smaller,single-board computer systems often found on mobile robotscannot yet accommodate them. We select methods that can bemore universally applied.

ORB is a new keypoint detector and feature descriptorcombination recently introduced by Rublee et al. [18]. Itis based on the FAST detector [18] and the BRIEF [36]descriptor. ORB computes an unambiguous orientation fromthe FAST corners and uses it for descriptor extraction, thusmaking the combination robust to viewpoint changes.

For the SURF keypoint detector, we apply the self-adjustingvariant which increases or decreases the threshold on theHessian to keep the number of keypoints roughly constant.

While this slows down keypoint detection in case of fluctuatingscene properties, variations in the number of keypoints with afixed threshold can lead to inaccurate or even failed motionestimations. Too many features slow the system down in thematching step and may lead to many false positives.

After the detection of the keypoints, we project the featurelocations from the image to 3D using the depth measurementat the center of the keypoint, hence result in a list of 3Dpoint correspondences. The 3D correspondences is utilized tocalculate an estimated pose for the Kinect. The system adoptsthe Random Sample and Consensus (RANSAC) algorithm [37]by randomly selecting four correspondences at a time. Usingthese four correspondences, an estimate of the 3D rotationand translation from frame (Fk+1) to (Fk) is calculated usingSingular Value Decomposition (SVD). Once the transformationis estimated, it is applied to all the features in (Fk+1), and theresult is checked against their matching points in (Fk). TheEuclidean distance between the point in (Fk) and the resultof the transformation is calculated, and if the error is within acertain threshold, it is called an inlier For our application, wefound an error threshold of 3-5 cm to yield the best results.

The RANSAC loop continues to select different sets of fourcorrespondences at random until a transformation that yieldsmore than 80% inliers is discovered, or the maximum numberof iterations has been reached. Through experimentation, itwas discovered that 200 maximum iterations was more thansufficient. Most often, the loop meets the 80% inlier conditionwithin the first three iterations. In situations where the match-ing is poor, however, the RANSAC loop may perform a muchhigher number of iterations.

Once a suitable transformation estimate was found, a least-squares SVD was performed on all the inliers of that transfor-mation, yielding a final refined estimate of the transformationmatrix. This is then applied to all the points of (Fk+1) (notjust the correspondences) to build up the 3D model of theenvironment. In order to avoid the final model from becomingoverly dense, the system only adds point clouds after eithera translation of 75 cm or a rotation of 30 degrees. However,these parameters may be application specific.

Since the transformation is always applied backwards, thatis, from (Fk+1) back to (Fk), every frame is transformed intothe coordinate system of the original frame, (F0).

While the described procedure is very fast (exact timingdepends on the number of features and the outlier percentage),after few seconds the number of past frames is too high tocompare a new frame against all previous frames. Therefore,we select a subset of twenty frames, consisting of the 3most recent frames and uniformly sampled earlier frames, andcompute the pairwise transformations in parallel using threads.

If a frame could be matched to any predecessor, it is addedas a node to the pose graph of the SLAM back-end, thepairwise spatial relations connecting it to the existing posegraph. The process applied when no relation to previous nodescan be found, depends on the application. If it is tolerable thatthe map is fragmented, a sensible action would be to keep thenode even though disconnected, starting a new map fragment

1184

Page 4: [IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

possibly to be connected later on through a loop closure.For evaluation purposes, we do not allow fragmentation inour experiments. If a frame cannot be matched, it will beconnected to the prior node in the pose graph under theassumption of a constant motion model with high uncertainty.While this usually leads to higher error values than evaluationon the biggest fragment, it facilitates the comparison to otherapproaches, by reducing the comparison to fully connectedtrajectories.

3) Graph Optimization and Map Building: The pairwisetransformations between sensor poses, as computed by thefront-end, form the edges of a pose graph. Due to estimationerrors, the edges form no globally consistent trajectory. Tocreate a globally consistent trajectory we optimize the posegraph using the g2o framework [25]. The g2o framework isan easily extensible graph optimizer that can be applied to awide range of problems including several variants of SLAMand bundle adjustment. It performs a minimization of a non-linear error function that can be represented as a graph, asfor example the one created by the SLAM front-end describedabove. For more details on the error function, we refer theinterested reader to [25].

Global optimization is especially beneficial in case of a loopclosure, i.e., when revisiting known parts of the map, sincethe loop closing edges in the graph allows to diminish theaccumulated error. Unfortunately, large errors in the motionestimation step may impede the accuracy of large parts ofthe graph. This is primarily a problem in areas of highlyambiguous features. We therefore use a threshold to pruneedges with high error values after the initial convergence andcontinue the optimization.

The system described above computes a globally consistenttrajectory. Using this trajectory, the original data can be usedto construct a representation of the environment. Projecting allpoint measurements into a global 3D coordinate system leadsto a straightforward point-based representation. Such a model,however, is highly redundant and requires vast computationaland memory resources. To overcome these limitations, we use3D occupancy grid maps to represent the environment. In ourimplementation, we use the octree-based mapping frameworkOctoMap [26]. Voxels are managed in an efficient tree structurethat leads to a compact memory representation and inherentlyallows for map queries at multiple resolutions. The use of prob-abilistic occupancy estimation furthermore provides a means ofcoping with noisy measurements and errors in pose estimation.In contrast to a point-based representation, it represents freespace and unmapped areas explicitly which is essential forrobot navigation and exploration tasks.

IV. EXPERIMENTS

A. Computing Hardware

We wish to provide a usable 3D model of a robot and itssurroundings to a human operator in order to improve controlof the robot. In order for 3D data to be usable, it must beprocessed and displayed at an acceptable rate. Such a systemwould be pointless if the rate of data return was so slow as

to impair the operator and decrease the effectiveness of therobot. Therefore, execution speed was of utmost importancewhen considering the various algorithms to be used in thissystem.

We also wished to design a system that could be adapted todifferent types of robots of various sizes and means of mobility.Smaller size robots are limited by their payload capacity, andwill therefore have much stricter restraints on their computingpower. The experiment was designed to test the speed of thesystem on different mobile computing platform as well as theaccuracy of the output data. Table I shows the specificationsof the four hardware configurations used in the experiments.

TABLE I: Computing Hardware Used

Processor(cores)

Speed(GHz)

Power(W)

Weight(Kg)

Dell Inspiron 1501 Turion TL-50 (2) 1.6 150 2.8SBS PC/104+ Pentium M (1) 2.0 8 0.4

B. Test Environment

All vision-based navigation and mapping systems rely onfeature detection and matching in order to determine camerapose. This can be problematic in environments that are notrich in natural features. Hence, the algorithms were tested intwo different environments that were both rich and lacking infeatures. Environment 1 is a lab where there are many featuresat depths within the Kinect’s optimal range. Environment 2 isa long, narrow hallway with sparse features and good depthdata. Sample images of the different environments can be seenbelow in Fig. 3 and Fig. 4.

Fig. 3: Environment 1 (laboratory).

Fig. 4: Environment 2 (narrow hallway).

1185

Page 5: [IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

C. Test Method

In order to maintain consistency, raw data was recordedfrom each of the two environments using a hand-held Kinect.The data was then fed into the system as if it were live input.The recorded data was tested on all the combinations of featureextraction and matching algorithms and on each of the listedcomputing platforms. The results are presented in section D.

D. Results

Ideally, the output of the algorithms would be comparedwith the ground truth at every transformation. However, it isvirtually impossible to find the ground truth in this manner.The most accurate points that can be measured are the startingand ending points of the recorded datasets. Although metricsare definitive, the accuracy of the algorithms can be judgedby visually inspecting of the final 3D model created. Thismay seem backwards, but for a human operator it is enoughto simply have a model that at least closely resembles theenvironment. Unlike robots operating on their own, our brainsare capable of filling in any missing details relevant to theimmediate task at hand. The results are shown in Table II.

Table II shows which algorithms were able to process theentire data set without any failures. A score of partial meansthe algorithm broke at one or more points, but still managedto create separate models representing different parts of thescene. A mark of fail means the algorithm failed to extractany sort of meaningful model from the data.

Contrary to the accuracy test, the speed test is easilymeasured. The detectors were tested on the feature-rich Labdata set. This ensures that the results are not affected by failedpose extraction, and provides a good example of a situation

TABLE II: SLAM Completeness with Different Configurations

EnvironmentDescriptor Optic Flow

SURF ORB SURF ORB ST FAST1 Done Done Done Done Done Partial2 Done Done Done Done Partial Fail

TABLE III: Feature Extraction and Tracking Speed Compari-son

Algorithm speed in frame per second (FPS)Descriptor Optic Flow

SURF ORB SURF ORB ST FASTDell

Inspiron 1501(640x480)

1.12 3.68 3.09 6.72 7.58 9.16 (1)

SBSPC/104+

(640x480)0.42 1.54 1.09 2.07 3.06 4.16 (2)

DellInspiron 1501

(320x240)3.62 8.74 8.15 12.72 9.05 10.73 (2)

SBSPC/104+

(320x240)1.31 2.43 2.17 3.11 3.83 4.92 (3)

where fast execution speed can be difficult to obtain due tothe high number of features. Additionally, we also tested ata lower resolution of 320x240 and reduced the number ofRANSAC iterations by half in order to try and improve speeds.The results can be seen in Table III below, and are measured inframes per second. A frame rate result followed by a number inbrackets indicates the number of times the system failed to finda transformation. That is, it failed to find the required numberof useable matches in order to compute the 3D transformation.In such a case, the system simply inserts a break point andbegins a fresh model from that frame onward.

The experiments yielded interesting results. Optic Flow isa fast feature tracking algorithm that works well with anytype of feature extractor as long as there are enough features.SURF is a robust feature extractor that works in environmentswith dense or sparse features, and 3D models created usingSURF features tended to be more accurate than other featureextractors. However, SURF is computationally expensive. Itis significantly slower than other feature extractors, making itunsuitable for real-time mobile applications. The Shi-Tomasicorner finder algorithm is fast, but provides fewer featuresthan SURF. When used with Optic Flow, the model accuracyis comparable to SURF with descriptor matching. It is farless effective in bland environments, however, often failingto find enough features for registration and therefore causingthe modeling system to fail. FAST appears to be speedier onlybecause it yields far fewer features, thus speeding up OpticFlow. Balancing between speed and model accuracy, it appearsthat ORB is an appropriate choice to be used in modelingcomplicated environments.

V. CONCLUSION AND OUTLOOK

We presented an approach for indoor SLAM using RGB-Dsensors. Our approach extracts visual keypoints from the colorimages and uses the depth images to localize the keypoints in3D. We use RANSAC to robustly estimate the transformationsbetween RGB-D frames and optimize the pose graph usingnon-linear optimization. Finally, we generate a volumetric 3Dmap of the environment that can be used for robot localization,navigation and path planning.We found that ORB detector anddescriptor is suitable for indoor SLAM considering speed andmapping accuracy. In the near future, we plan to develop orutilize novel keypoint detector, and to optimize the keypointmatching algorithm by adding a feature dictionary and pruningnever matched features.

ACKNOWLEDGMENT

This work was supported by National Nature Science Foun-dation of China under the research project 61075075, andNational 863 Program of China under the research project2011AA040902.

REFERENCES

[1] S. Thrun, “Robotic mapping: A survey,” Exploring Artificial Intelligencein the New Millennium, vol. 1, pp. 1–35, 2003.

[2] A. Nuchter, K. Lingemann, J. Hertzberg, and H. Surmann, “6d slamwith approximate data association,” in Proc. of the 12th Int. Conf. onAdvanced Robotics (ICAR). IEEE, 2005, pp. 242–249.

1186

Page 6: [IEEE 2012 10th IEEE International Conference on Industrial Informatics (INDIN) - Beijing, China (2012.07.25-2012.07.27)] IEEE 10th International Conference on Industrial Informatics

[3] G. Grisetti, S. Grzonka, C. Stachniss, P. Pfaff, and W. Burgard, “Efficientestimation of accurate maximum likelihood maps in 3d,” in Proc. ofIEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). IEEE,2007, pp. 3472–3478.

[4] F. Dellaert and M. Kaess, “Square root sam,” in Proc. of Robotics:Science and Systems (RSS), 2005, pp. 177–184.

[5] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental smooth-ing and mapping,” IEEE Transactions on Robotics, vol. 24, no. 6, pp.1365–1378, 2008.

[6] U. Frese, P. Larsson, and T. Duckett, “A multilevel relaxation algorithmfor simultaneous localization and mapping,” IEEE Transactions onRobotics, vol. 21, no. 2, pp. 196–207, 2005.

[7] E. Olson, J. Leonard, and S. Teller, “Fast iterative alignment of posegraphs with poor initial estimates,” in Proc. of IEEE Int. Conf. onRobotics and Automation (ICRA). Ieee, 2006, pp. 2262–2269.

[8] P. Besl and N. McKay, “A method for registration of 3-d shapes,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 14,no. 2, pp. 239–256, 1992.

[9] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,”in Proc. of The 3rd Int. Conf. on 3-D Digital Imaging and Modeling.IEEE, 2001, pp. 145–152.

[10] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp,” in Proc. ofRobotics: Science and Systems (RSS), 2009.

[11] A. Davison, “Real-time simultaneous localisation and mapping with asingle camera,” in Proc. of the 9th IEEE Int. Conf. on Computer Vision,vol. 2. Ieee, 2003, pp. 1403–1410.

[12] G. Klein and D. Murray, “Parallel tracking and mapping for small arworkspaces,” in Proc. of the 6th IEEE and ACM Int. Symposium onMixed and Augmented Reality (ISMAR). IEEE, 2007, pp. 225–234.

[13] H. Strasdat, J. Montiel, and A. Davison, “Scale drift-aware large scalemonocular slam,” in Proc. of Robotics: Science and Systems (RSS),vol. 2, no. 3, 2010, p. 5.

[14] H. Jin, P. Favaro, and S. Saotto, “Real-time 3d motion and structureof point features: a front-end system for vision-based control andinteraction,” in Proc. of IEEE Int. Conf. on Computer Vision and PatternRecognition (CVPR), vol. 2. IEEE, 2000, pp. 778–779.

[15] D. Nister, “Preemptive ransac for live structure and motion estimation,”Machine Vision and Applications, vol. 16, no. 5, pp. 321–329, 2005.

[16] D. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,2004.

[17] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robustfeatures (surf),” Computer Vision and Image Understanding, vol. 110,no. 3, pp. 346–359, 2008.

[18] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: an efficientalternative to sift or surf,” in Proc. of IEEE Int. Conf. on ComputerVision (ICCV). IEEE, 2011, pp. 2564–2571.

[19] C. Wu, “Siftgpu: A gpu implementation of scale invariant featuretransform (sift),” http://cs.unc.edu/∼ccwu/siftgpu, 2007.

[20] K. Konolige, M. Agrawal, R. Bolles, C. Cowan, M. Fischler, andB. Gerkey, “Outdoor mapping and navigation using stereo vision,” inExperimental Robotics. Springer, 2008, pp. 179–190.

[21] L. Paz, P. Pinies, J. Tardos, and J. Neira, “Large-scale 6-dof slam withstereo-in-hand,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 946–957, 2008.

[22] N. Fioraio and K. Konolige, “Realtime visual and point cloud slam,”in Proc. of the RGB-D Workshop on Advanced Reasoning with DepthCameras at Robotics: Science and Systems Conf.(RSS), 2011.

[23] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “Rgb-d mapping:Using depth cameras for dense 3d modeling of indoor environments,”in Proc. of the 12th Int. Symposium on Experimental Robotics (ISER),2010.

[24] M. I. A. Lourakis and A. A. Argyros, “Sba: A software packagefor generic sparse bundle adjustment,” ACM Trans. Math. Softw.,vol. 36, no. 1, pp. 2:1–2:30, Mar. 2009. [Online]. Available:http://doi.acm.org/10.1145/1486525.1486527

[25] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,“g2o: A general framework for graph optimization,” in Proc. of IEEEInt. Conf. on Robotics and Automation (ICRA). IEEE, 2011, pp. 3607–3613.

[26] K. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard,“Octomap: A probabilistic, flexible, and compact 3d map representationfor robotic systems,” in Proc. of the ICRA 2010 Workshop on Best

Practice in 3D Perception and Modeling for Mobile Manipulation, vol. 2,2010.

[27] A. Hornung, K. Wurm, and M. Bennewitz, “Humanoid robot localizationin complex indoor environments,” in Proc. of IEEE/RSJ Int. Conf. onIntelligent Robots and Systems (IROS). IEEE, 2010, pp. 1690–1695.

[28] Wikipedia, “Kinect,” http://en.wikipedia.org/wiki/kinect, 2011.[29] OpenKinect, “Openkinect project,” http://openkinect.org/wiki/Main

Page, 2011.[30] N. Burrus, “Kinect calibration,” http://nicolas.burrus.name/index.php/

Research/KinectCalibration, 2011.[31] S. Magnenat, “Imaging information,” http://openkinect.org/wiki/

Imaging Information, 2011.[32] J. Shi and C. Tomasi, “Good features to track,” in Proc. of IEEE

Computer Society Conf. on Computer Vision and Pattern Recognition(CVPR). IEEE, 1994, pp. 593–600.

[33] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection,” Computer Vision–ECCV 2006, pp. 430–443, 2006.

[34] E. Rosten and T. Drummond, “Fusing points and lines for high perfor-mance tracking,” in Computer Vision, 2005. ICCV 2005. Tenth IEEEInternational Conference on, vol. 2. IEEE, 2005, pp. 1508–1515.

[35] B. Lucas and T. Kanade, “An iterative image registration technique withan application to stereo vision,” in Proc. of the 7th Int. Conf. on ArtificialIntelligence, 1981, pp. 674–679.

[36] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robustindependent elementary features,” in Computer Vision C ECCV 2010,ser. Lecture Notes in Computer Science, K. Daniilidis, P. Maragos,and N. Paragios, Eds. Springer Berlin / Heidelberg, 2010, vol.6314, pp. 778–792. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-15561-1 56

[37] M. Fischler and R. Bolles, “Random sample consensus: A paradigmfor model fitting with applications to image analysis and automatedcartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395,1981.

1187

Powered by TCPDF (www.tcpdf.org)