3D Indoor Environment Construction from …fuh/personal/3DIndoor...problem based on the iterative least square method to reconstruct the scene points Asusof the indoor environment

3D Indoor Environment Construction from Monocular Camera on

Quadricopter

1 Teng-Hsiang Yu (虞登翔), 2 Chiou-Shann Fuh (傅楸善),

3 Feng-Li Lian (連豊力)

1 Dept. of Electronical

Engineering,

National Taiwan University,

Taipei, Taiwan 10617

E-mail: [email protected]

2 Dept. of Computer Science

and Information Engineering,

National Taiwan

University,



3 Dept. of Electronical

Engineering,

National Taiwan University,



ABSTRACT

In this paper, we implement the Structure from

Motion (SfM) by monocular camera equipped with the

quadricopter to reconstruct the 3D indoor environment.

To achieve robustness and efficient matching of features,

we propose the Feature from Accelerated Segment Test

(FAST) algorithm to detect the corner feature, and find

the corresponding of the feature descriptors of Scale-

Invariant Feature Transform (SIFT) algorithm, hence,

the feature matching can be efficiently accomplished

based on the RANdom SAmple Conesus (RANSAC)

algorithm. Those matching pairs must satisfy the

epipolar constraint between two images. Moreover, we

can estimate the camera pose by the matching features

and the camera matrix. Then, solve the triangular

problem based on the iterative least square method to

reconstruct the scene points of the indoor environment.

Keywords — AR. Drone, structure from motion,

monocular camera.

1. INTRODUCTION

In recent years, there are many related

applications of the autonomous vehicle have been

developed to use in transportation tasks, exploration and

surveillance. With the Unmanned Aerial Vehicle (UAV)

are commonly use, the much more dimension than the

unmanned ground vehicle (UGV), it have more

flexibility to accommodate to the tough environment,

such as deep mud and the stairwell. Moreover, the

quadricopter can vertical takeoff-and-landing, hovering,

and maneuvering capabilities such that it is suitable to

use in indoor environment. With four rotors, the

quadricopter is more stable than the helicopter.

One of the ability of autonomous vehicle is to

know its position and movement in the environment. It

usually encounter the accumulations of mechanical

errors during localization, because the inaccurate

measurement of odometer, gyro, or accelerator in the

vehicle. Also, when the vehicles have not the pre-

requirement information of environment, it needs to

learn from its environment. This ability is considered as

one of important and fundamental problem of robotics,

such as SLAM (Simultaneous Localization and

Mapping), visual odometry [10], and so on.

Various sensing equipment such as laser range

finder is used to acquire the depth information to

construct the map. The drawback of laser range finder is

that it can only scan a 2-D plane at a specific altitude,

however, the 3-D map is more instinctive to human

being, and hence, the stereo camera and the RGB-D

camera are usually used in recent research. In particular,

RGB-D SLAM [4] is popular indeed following the well-

developed RGB-D camera (e.g. Microsoft Kinect and

Asus Xtion). But, they only can be used by the wire-

connected computer instead of wireless communication.

Thus, the monocular camera with lower resolution of

information is a good choice of UAV.

The goal of this study is to solve two frames SfM

problem between time steps to reconstruct the 3-D

indoor environment using a vision-based quadricopter

that consists of four processing steps. First, we calibrate

the camera by the chessboard. Then in order to find the

corresponding points between pair of frames, we

assume the indoor environment is filled with colorful

feature instead of pure white wall. Base on the matching

correspondences from previous images, we can estimate

the relative camera pose through the epipolar geometry.

Therefore, the scene point can be evaluated by the

triangulation method, which is solve the crossing point

of two rays re-projected by the camera matrix from the

image.

The remainder of this paper is organized as

follows. Section II introduces the hardware of the

quadricopter. Whereas the detection and refine matching

of corresponding points are described in section III. We

solve the triangulation problem in section IV.

2. HARDWARE CONFIGURATION OF THE

VISION SYSTEM

2.1. Visual sensor and System Configuration

We make use of a low-cost quadricopter of AR

Drone 2.0 in Figure 1 (a) made by Parrot Inc. as a test

plane. The UAV is equipped with the camera has HD

720p resolution and 92◦ field of view. Communication

between quadricopter and computer is connected by Wi-

Fi within a 50m range, then the range is sufficiently use

in indoor environment.

The flowchart of the proposed vision system is

shown in Figure 2. The main stages in proposed 3-D

indoor environment mapping by solving the SfM

problem include: (1) Find the corresponding points

between image pairs; (2) Estimate the camera pose and

compare with the evaluated pose by IMU; (3) Two-view

SfM.

(a) (b)

Figure 1 (a) AR. Drone 2.0. (b) Wide angle camera.

Figure 2 Configuration of the overall vision system.

2.2. Camera Calibration

Before acquiring the image, the distortion as

Figure 1 (b) may make larger re-projection error that

will be mentioned in section 4. The other main purpose

for calibrate the camera is shown in Figure 3 that we can

transform the object coordinates to the camera

coordinates by extrinsic camera parameter. Moreover,

the intrinsic parameter can transform the camera

coordinates to the pixel coordinates. Therefore, we use

the 2-D plane based calibration method [14] to get the

undistortion images and the camera intrinsic parameter.

First, this method is required to observe a sequence of

planar chessboard images as Figure 4. Then minimizing

the following function:

∑ ∑‖𝑚𝑖𝑗 − �̂�(𝐴, 𝑅𝑖, 𝑡𝑖 , 𝑀𝑗)‖2

𝑚

𝑗=1

𝑛

𝑖=1

,

which can be consider as an index of performance. We

estimate the error that is approximate 0.2 from Figure 4,

and the error can be accepted from 0.1 to 05 generally.

Figure 3 The coordinate transformed by the camera

matrix.

Figure 4 There are 15 different scenes of the

chessboard.

3. ESTIMATE CORRESPONDING FEATURE

3.1. Feature Detection and Classification

In order to estimate the camera pose, first we need

to find the corresponding points between two frames.

Following [11], Wagner et al. use a combination of the

Features from Accelerated Segment Test (FAST) [10]

corner detector and the Scale-Invariant Feature

Transform (SIFT) descriptor [9] to make matching

faster and more robust

The FAST algorithm which is known to be one of

the fastest corner detectors is considering a circle of 16

pixels around the corner candidate p, if there has a set of

n contiguous pixels in the circle that are all brighter or

darker than the intensity of corner candidate p (a

threshold t), then we consider p as a corner feature. As

illustrated in Figure 5, n is chosen to be 12 in order to

admit a high-speed test. The intensity of pixels of

number 1 to 12 is brighter than the pixel at p by more

than a threshold, then it means the p is the corner feature.

Figure 5 Twelve-point segment test corner detection in a

part of image. The pixel p is the candidate corner, and

the squares with number are the pixels used in corner

detection.

(a) (b)

Figure 6 For this image (a), 1156 feature points are

calculated by the 12-FAST corner detector shown in

(b).

(a) Gradient of image (b)Feature point descriptor

Figure 7 (a) Descriptor is created by computing the

gradient magnitude and orientation at every pixel in

four regions around the feature point. (b) The gradient

information of each pixel is then accumulated into

orientation histograms summarizing the content over

4x4 subregions.

After detecting corner feature points, the features

with SIFT descriptor which is estimating the local

image gradient at the selected scale in the region around

each feature. Figure 7 illustrates the estimation of the

descriptor of feature points. Around the location of

feature points, the image gradient magnitudes and

orientations are sampled. The coordinates and gradients

orientations rotate to the relative feature point

orientation. Next, we classify the two sets of corner

feature points with the SIFT descriptor by 2 nearest-

neighbor algorithm [2]. When the matching points are

estimated, keeping the feature if the ratio of distances is

less than a threshold. Finally, we can check the

matching points between images pair are one-to-one

mapping or not to ensure that the matching is correct.

3.2. Refine Matching and Pose Estimation

The first step is to avoid the outlier affecting the

result of refine matching. Thus, we use the RANSAC

algorithm [5], because of it can estimate the confidence

of each sample point from the model as Figure 8. More

than that, we compute a candidate fundamental matrix

using the eight-point algorithm [7], which is non-linear

refinement during each RANSAC iteration. Figure 9

shows that there are 635 inliers by the above methods.

Figure 8 The program of Matlab tool box [9]

randomly generates 500 points based on the black

line as a model and includes 300 inliers. There are

303 inliers estimated by RANSAC algorithm and the

result is shown as red line.

(a)

(b)

Figure 9 (a) The input image pairs. (b) Feature matched

by the RANSAC algorithm and eight-point algorithm.

4. TRIANGULATION PROBLEM

The triangulation problem is that we can use the

camera matrix to project the corresponding points

into rays, then find the intersection that is the

position of the corresponding point in the real 3D

scene as Figure 10.

-30 -20 -10 0 10 20 30 40-50

-40

-30

-20

-10

0

10

20

30

40

50

DATA

Real Model

RANSAC

Figure 10 The triangulation problem is about

finding the intersection of rays in X.

Figure 11 The noise makes the rays miss crossing.

Unfortunately, the digital camera always has

some noise of digitalization error that makes the 3D

scene point projected to the image plane may not be

at the same position of what we really see. On the

other hand, the re-projected rays of corresponding

pixel of image might not cross, in Figure 11.

Poly 14

Linear-Eigen 3

Iterative-Eigen 5

Midpoint 2

Poly-Abs 30

Linear-LS 2

Iterative-LS 3

Table 1 Relative execution time of

several methods.

Figure 12 The error between the perpendicular

projection and the distorted projection.

In order to solve this problem, there are several

methods [7] to use. In Table 1, we can observe that

the Linear-Eigen, Midpoint, Linear-Least square, and

Iterative-Least Square methods are faster. Moreover,

the Iterative-Least square method is more accurate.

Thus, for the purpose of solving the SfM problem in

real time, we implement the Iterative-Least square

method to solve the triangulation problem.

𝑑(𝑢, �̂�)2 + 𝑑(𝑢′, �̂�′)2 (1)

First, in [7], we define the cost function that can

represent the error between the perpendicular projection

and the distorted projection in Eq. (1), where 𝑑(∗,∗)

represents the Euclidean distance. Thus the main task is

to minimize the cost function.

𝑢 = 𝑃𝑋 (2)

Then we recall the projection equation from the

camera matrix Eq. (2), where u is the point in image

with homogeneous coordinates (u, v, 1)T; P is the

camera matrix; and the X is the scene point in real world

also with homogeneous coordinates (x, y, z, 1)T. In fact,

the point in image plane has an unknown scale factor w,

so that u is w(u, v, 1)T. Moreover, we define PiT as the i-

th row of camera matrix P. Then Eq. (2) becomes:

𝑢𝑤 = P1T𝑋, 𝑤𝑣 = P2TX, 𝑤 = P3

TX

Replace the scale factor with above equation, then

we get

𝑢P3TX = P1T𝑋, 𝑣P3TX = P2TX (3)

After getting Eq. (3) from two images, the four

equations can be written in the form

𝐴𝑋 = 0

There are four dimensions in X vector with three

unknowns, and a set of four nonhomogeneous equations

in A, so that we can only compute a least square

solution.

Figure 13 Adaptively update the weighting function.

However, the least square solution does not

correspond to the cost function. Moreover, there will be

an error 𝜀 = 𝑢P3TX - P1

T𝑋 , but what we want to

minimize is the difference between the 𝑢 and �̂�, which

is given by P1Tx / P3

Tx. This means that the equation of

the error get weighted by the factor 1/ 𝑤, where 𝑤 = P3

Tx.

5. EXPERIMENTAL RESULTS

5.1. Epipolar Geometry Test

By [8], the epipole is another camera center

projected to the current image plane. Moreover, we can

confirm that the feature points all lie on the epipolar line.

For the first experiment, we only move the UAV

horizontally to the right along the edge of table to get

the images. The Camera pose 2 shows that the UAV

approximately moves to the direction of negative x-axis,

although the scale of result has not been determined

precisely. In the second experimental, we stick the black

line on the floor, and put the white box and the UAV

along the line. Moreover, we move the UAV to near the

box to get the images shown in figure 15 (b). The

epipolar lines are overlaid in both cases. Note the

motion of the UAV make the relative placement of

white box is the same direction of the epipolar line. The

Camera pose 2 of figure 15 (d) is in our expected.

(a) (b)

Camera Pose 1:

[1 0 00 1 00 0 1

000

]

(c)

Camera Pose 2:

[1 0 00 0.999 00 0 0.999

−100

]

(d)

Figure 14 (a) The UAV shift to the right. The c and c’ are

the camera center. (b) The relative coordinate. (c)(d)

Epipolar line of images, and the relative camera pose.

(a)

(b)

Camera Pose 1:

[1 0 00 1 00 0 1

000

]

(c)

Camera Pose 2:

[0.999 0 0.003

0 0.999 0−0.003 0 0.999

0.016

−0.002−0.999

]

(d) Figure 15 (a) The UAV move forward. The c and c’ are

the camera center. The e and e’ are the epipole in the

images. (b) The original image. (c)(d) Epipolar line of

images, and the relative camera pose.

5.2. 3D Reconstruction

Following Figure 9 (a), finally, we use the iterative

linear least square method to solve the triangulation

problem. The pattern of the Tetra Pak can be seen easily

from Figure 16 (a). However, from the side view in

Figure 16 (b), we can observe that the depth of the point

cloud is inaccuracy.

(a)

(b)

Figure 16 3D reconstruction result. (a) Front view of

point cloud. (b) Side View of point cloud.

6. CONCLUSIONS AND FUTURE WORKS

In recent years, although there are several sensors

fully developed in robotics (e.g. Laser Range Finder and

Microsoft Kinect), they only can be used by the wire-

connected computer instead of wireless communication.

Thus, the monocular camera with lower resolution of

information is a good choice of this scenario. Moreover,

3D is more instinctive to human being, we solve the

SfM problem by those algorithms mentioned in previous

section to try to reconstruct the 3D environment, also

can localize the UAV. Although the result in Figure 16

is inaccurate, we may try to use the local or global

optimal algorithms (e.g. Bundle Adjustment) to

optimize the 3D point in the future. In order to handle

the execution time of the UAV, we will develop a

strategy to implement the real-time monocular camera

visual odometry by SfM.

REFERENCES

[1] G. Bebis, “Geometric Camera Parameters,”

http://www.cse.unr.edu/~bebis/CS791E/Notes/

CameraParameters.pdf, 2004.

[2] T. M. Cover, P. E. Hart, "Nearest neighbor

pattern classification," IEEE Transactions on

Information Theory, Vol. 13, No. 1, pp. 21–27,

1967.

[3] D. J. Crandall, A. Owens, N. Snavely, and D. P.

Huttenlocher, “SfM with MRFs: Discrete-

Continuous Optimization for Large-Scale

Introduction to Structure from Motion,” IEEE

Transactions on Pattern Analysis and Machine

Intelligence, Vol. 35, No. 12, pp. 2841-2853,

2013.

[4] F. Endres, J. Hess, N. Engelhard, J, Sturm, D.

Cremers, and W. Burgard, “An Evaluation of

the RGB-D SLAM System,” IEEE

International Conference on Robotics and

Automation, Minnesota, USA, pp. 1691-1696,

2012.

[5] M. A. Fischler and R. C. Bolles, “Random

Sample Consensus: A Paradigm for Model

Fitting with Applications to Image Analysis and

Automated Cartography,” Comm. of the ACM,

Vol. 24, No. 6, pp. 381-395, 1981.

[6] R. I. Hartley, “In Defense of the Eight-Point

Algorithm,” IEEE Transaction on Pattern

Analysis and Machine Intelligence, Vol. 19, No.

6, pp. 580-593, 1997.

[7] R. I. Hartley and P. Sturm, “Triangulation,”

Computer Vision and Image Understanding,

Vol. 68, No. 2, pp. 146-157, 1997.

[8] R. I. Hartley and A. Zisserman, Multiple View

Geometry in Computer Vision, 2nd ed.,

Cambridge University Press, Cambridge,

England, pp. 239-260, 2004.

[9] M. Kirill, G. Victor, K. Anton, and V. Vladimir,

GML RANSAC Matlab Toolbox. Available at

http://graphics.cs.msu.ru/en/science/research/m

achinelearning/ransactoolbox.

[10] D. G. Lowe, “Distinctive image features from

scale-invariant keypoints,” Int. J. of Computer

Vision, Vol. 60, No. 2, pp. 91–110, 2004.E.

[11] Rosten and T. Drummond, “Machine Learning

for High-Speed Corner Detection,” Proceedings

of European Conference on Computer Vision,

Graz, Austria, Vol. 3951, pp. 430-443, 2006.

[12] D. Scaramuzza and F. Fraundorfer, “Visual

Odometry [Tutorial],” IEEE Robotics and

Automation Magazine, Vol. 18, No. 4, pp. 80-92,

2011.

[13] D. Wagner, G. Reitmayr, A. Mulloni, T.

Drummond, and D. Schmalstieg “ Pose

Tracking from Natural Features on Mobile

Phones,” in International Symposium on Mixed

and Augmented Reality (ISMAR), Cambridge,

UK, 2008, pp. 125–134.

[14] Z. Zhang, “A Flexible New Technique for

Camera Calibration,” IEEE Trans. Pattern

Analysis and Machine Intelligence, Vol. 22, No.

11, pp. 1330-1334, 2000.

Documents

3D Indoor Environment Construction from …fuh/personal/3DIndoor...problem based on the iterative least square method to reconstruct the scene points Asusof the indoor environment