Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
3D Indoor Environment Construction from Monocular Camera on
Quadricopter
1 Teng-Hsiang Yu (虞登翔), 2 Chiou-Shann Fuh (傅楸善),
3 Feng-Li Lian (連豊力)
1 Dept. of Electronical
Engineering,
National Taiwan University,
Taipei, Taiwan 10617
E-mail: [email protected]
2 Dept. of Computer Science
and Information Engineering,
National Taiwan
University,
Taipei, Taiwan 10617
E-mail: [email protected]
3 Dept. of Electronical
Engineering,
National Taiwan University,
Taipei, Taiwan 10617
E-mail: [email protected]
ABSTRACT
In this paper, we implement the Structure from
Motion (SfM) by monocular camera equipped with the
quadricopter to reconstruct the 3D indoor environment.
To achieve robustness and efficient matching of features,
we propose the Feature from Accelerated Segment Test
(FAST) algorithm to detect the corner feature, and find
the corresponding of the feature descriptors of Scale-
Invariant Feature Transform (SIFT) algorithm, hence,
the feature matching can be efficiently accomplished
based on the RANdom SAmple Conesus (RANSAC)
algorithm. Those matching pairs must satisfy the
epipolar constraint between two images. Moreover, we
can estimate the camera pose by the matching features
and the camera matrix. Then, solve the triangular
problem based on the iterative least square method to
reconstruct the scene points of the indoor environment.
Keywords — AR. Drone, structure from motion,
monocular camera.
1. INTRODUCTION
In recent years, there are many related
applications of the autonomous vehicle have been
developed to use in transportation tasks, exploration and
surveillance. With the Unmanned Aerial Vehicle (UAV)
are commonly use, the much more dimension than the
unmanned ground vehicle (UGV), it have more
flexibility to accommodate to the tough environment,
such as deep mud and the stairwell. Moreover, the
quadricopter can vertical takeoff-and-landing, hovering,
and maneuvering capabilities such that it is suitable to
use in indoor environment. With four rotors, the
quadricopter is more stable than the helicopter.
One of the ability of autonomous vehicle is to
know its position and movement in the environment. It
usually encounter the accumulations of mechanical
errors during localization, because the inaccurate
measurement of odometer, gyro, or accelerator in the
vehicle. Also, when the vehicles have not the pre-
requirement information of environment, it needs to
learn from its environment. This ability is considered as
one of important and fundamental problem of robotics,
such as SLAM (Simultaneous Localization and
Mapping), visual odometry [10], and so on.
Various sensing equipment such as laser range
finder is used to acquire the depth information to
construct the map. The drawback of laser range finder is
that it can only scan a 2-D plane at a specific altitude,
however, the 3-D map is more instinctive to human
being, and hence, the stereo camera and the RGB-D
camera are usually used in recent research. In particular,
RGB-D SLAM [4] is popular indeed following the well-
developed RGB-D camera (e.g. Microsoft Kinect and
Asus Xtion). But, they only can be used by the wire-
connected computer instead of wireless communication.
Thus, the monocular camera with lower resolution of
information is a good choice of UAV.
The goal of this study is to solve two frames SfM
problem between time steps to reconstruct the 3-D
indoor environment using a vision-based quadricopter
that consists of four processing steps. First, we calibrate
the camera by the chessboard. Then in order to find the
corresponding points between pair of frames, we
assume the indoor environment is filled with colorful
feature instead of pure white wall. Base on the matching
correspondences from previous images, we can estimate
the relative camera pose through the epipolar geometry.
Therefore, the scene point can be evaluated by the
triangulation method, which is solve the crossing point
of two rays re-projected by the camera matrix from the
image.
The remainder of this paper is organized as
follows. Section II introduces the hardware of the
quadricopter. Whereas the detection and refine matching
of corresponding points are described in section III. We
solve the triangulation problem in section IV.
2. HARDWARE CONFIGURATION OF THE
VISION SYSTEM
2.1. Visual sensor and System Configuration
We make use of a low-cost quadricopter of AR
Drone 2.0 in Figure 1 (a) made by Parrot Inc. as a test
plane. The UAV is equipped with the camera has HD
720p resolution and 92◦ field of view. Communication
between quadricopter and computer is connected by Wi-
Fi within a 50m range, then the range is sufficiently use
in indoor environment.
The flowchart of the proposed vision system is
shown in Figure 2. The main stages in proposed 3-D
indoor environment mapping by solving the SfM
problem include: (1) Find the corresponding points
between image pairs; (2) Estimate the camera pose and
compare with the evaluated pose by IMU; (3) Two-view
SfM.
(a) (b)
Figure 1 (a) AR. Drone 2.0. (b) Wide angle camera.
Figure 2 Configuration of the overall vision system.
2.2. Camera Calibration
Before acquiring the image, the distortion as
Figure 1 (b) may make larger re-projection error that
will be mentioned in section 4. The other main purpose
for calibrate the camera is shown in Figure 3 that we can
transform the object coordinates to the camera
coordinates by extrinsic camera parameter. Moreover,
the intrinsic parameter can transform the camera
coordinates to the pixel coordinates. Therefore, we use
the 2-D plane based calibration method [14] to get the
undistortion images and the camera intrinsic parameter.
First, this method is required to observe a sequence of
planar chessboard images as Figure 4. Then minimizing
the following function:
∑ ∑‖𝑚𝑖𝑗 − �̂�(𝐴, 𝑅𝑖, 𝑡𝑖 , 𝑀𝑗)‖2
𝑚
𝑗=1
𝑛
𝑖=1
,
which can be consider as an index of performance. We
estimate the error that is approximate 0.2 from Figure 4,
and the error can be accepted from 0.1 to 05 generally.
Figure 3 The coordinate transformed by the camera
matrix.
Figure 4 There are 15 different scenes of the
chessboard.
3. ESTIMATE CORRESPONDING FEATURE
3.1. Feature Detection and Classification
In order to estimate the camera pose, first we need
to find the corresponding points between two frames.
Following [11], Wagner et al. use a combination of the
Features from Accelerated Segment Test (FAST) [10]
corner detector and the Scale-Invariant Feature
Transform (SIFT) descriptor [9] to make matching
faster and more robust
The FAST algorithm which is known to be one of
the fastest corner detectors is considering a circle of 16
pixels around the corner candidate p, if there has a set of
n contiguous pixels in the circle that are all brighter or
darker than the intensity of corner candidate p (a
threshold t), then we consider p as a corner feature. As
illustrated in Figure 5, n is chosen to be 12 in order to
admit a high-speed test. The intensity of pixels of
number 1 to 12 is brighter than the pixel at p by more
than a threshold, then it means the p is the corner feature.
Figure 5 Twelve-point segment test corner detection in a
part of image. The pixel p is the candidate corner, and
the squares with number are the pixels used in corner
detection.
(a) (b)
Figure 6 For this image (a), 1156 feature points are
calculated by the 12-FAST corner detector shown in
(b).
(a) Gradient of image (b)Feature point descriptor
Figure 7 (a) Descriptor is created by computing the
gradient magnitude and orientation at every pixel in
four regions around the feature point. (b) The gradient
information of each pixel is then accumulated into
orientation histograms summarizing the content over
4x4 subregions.
After detecting corner feature points, the features
with SIFT descriptor which is estimating the local
image gradient at the selected scale in the region around
each feature. Figure 7 illustrates the estimation of the
descriptor of feature points. Around the location of
feature points, the image gradient magnitudes and
orientations are sampled. The coordinates and gradients
orientations rotate to the relative feature point
orientation. Next, we classify the two sets of corner
feature points with the SIFT descriptor by 2 nearest-
neighbor algorithm [2]. When the matching points are
estimated, keeping the feature if the ratio of distances is
less than a threshold. Finally, we can check the
matching points between images pair are one-to-one
mapping or not to ensure that the matching is correct.
3.2. Refine Matching and Pose Estimation
The first step is to avoid the outlier affecting the
result of refine matching. Thus, we use the RANSAC
algorithm [5], because of it can estimate the confidence
of each sample point from the model as Figure 8. More
than that, we compute a candidate fundamental matrix
using the eight-point algorithm [7], which is non-linear
refinement during each RANSAC iteration. Figure 9
shows that there are 635 inliers by the above methods.
Figure 8 The program of Matlab tool box [9]
randomly generates 500 points based on the black
line as a model and includes 300 inliers. There are
303 inliers estimated by RANSAC algorithm and the
result is shown as red line.
(a)
(b)
Figure 9 (a) The input image pairs. (b) Feature matched
by the RANSAC algorithm and eight-point algorithm.
4. TRIANGULATION PROBLEM
The triangulation problem is that we can use the
camera matrix to project the corresponding points
into rays, then find the intersection that is the
position of the corresponding point in the real 3D
scene as Figure 10.
-30 -20 -10 0 10 20 30 40-50
-40
-30
-20
-10
0
10
20
30
40
50
DATA
Real Model
RANSAC
Figure 10 The triangulation problem is about
finding the intersection of rays in X.
Figure 11 The noise makes the rays miss crossing.
Unfortunately, the digital camera always has
some noise of digitalization error that makes the 3D
scene point projected to the image plane may not be
at the same position of what we really see. On the
other hand, the re-projected rays of corresponding
pixel of image might not cross, in Figure 11.
Poly 14
Linear-Eigen 3
Iterative-Eigen 5
Midpoint 2
Poly-Abs 30
Linear-LS 2
Iterative-LS 3
Table 1 Relative execution time of
several methods.
Figure 12 The error between the perpendicular
projection and the distorted projection.
In order to solve this problem, there are several
methods [7] to use. In Table 1, we can observe that
the Linear-Eigen, Midpoint, Linear-Least square, and
Iterative-Least Square methods are faster. Moreover,
the Iterative-Least square method is more accurate.
Thus, for the purpose of solving the SfM problem in
real time, we implement the Iterative-Least square
method to solve the triangulation problem.
𝑑(𝑢, �̂�)2 + 𝑑(𝑢′, �̂�′)2 (1)
First, in [7], we define the cost function that can
represent the error between the perpendicular projection
and the distorted projection in Eq. (1), where 𝑑(∗,∗)
represents the Euclidean distance. Thus the main task is
to minimize the cost function.
𝑢 = 𝑃𝑋 (2)
Then we recall the projection equation from the
camera matrix Eq. (2), where u is the point in image
with homogeneous coordinates (u, v, 1)T; P is the
camera matrix; and the X is the scene point in real world
also with homogeneous coordinates (x, y, z, 1)T. In fact,
the point in image plane has an unknown scale factor w,
so that u is w(u, v, 1)T. Moreover, we define PiT as the i-
th row of camera matrix P. Then Eq. (2) becomes:
𝑢𝑤 = P1T𝑋, 𝑤𝑣 = P2TX, 𝑤 = P3
TX
Replace the scale factor with above equation, then
we get
𝑢P3TX = P1T𝑋, 𝑣P3TX = P2TX (3)
After getting Eq. (3) from two images, the four
equations can be written in the form
𝐴𝑋 = 0
There are four dimensions in X vector with three
unknowns, and a set of four nonhomogeneous equations
in A, so that we can only compute a least square
solution.
Figure 13 Adaptively update the weighting function.
However, the least square solution does not
correspond to the cost function. Moreover, there will be
an error 𝜀 = 𝑢P3TX - P1
T𝑋 , but what we want to
minimize is the difference between the 𝑢 and �̂�, which
is given by P1Tx / P3
Tx. This means that the equation of
the error get weighted by the factor 1/ 𝑤, where 𝑤 = P3
Tx.
5. EXPERIMENTAL RESULTS
5.1. Epipolar Geometry Test
By [8], the epipole is another camera center
projected to the current image plane. Moreover, we can
confirm that the feature points all lie on the epipolar line.
For the first experiment, we only move the UAV
horizontally to the right along the edge of table to get
the images. The Camera pose 2 shows that the UAV
approximately moves to the direction of negative x-axis,
although the scale of result has not been determined
precisely. In the second experimental, we stick the black
line on the floor, and put the white box and the UAV
along the line. Moreover, we move the UAV to near the
box to get the images shown in figure 15 (b). The
epipolar lines are overlaid in both cases. Note the
motion of the UAV make the relative placement of
white box is the same direction of the epipolar line. The
Camera pose 2 of figure 15 (d) is in our expected.
(a) (b)
Camera Pose 1:
[1 0 00 1 00 0 1
000
]
(c)
Camera Pose 2:
[1 0 00 0.999 00 0 0.999
−100
]
(d)
Figure 14 (a) The UAV shift to the right. The c and c’ are
the camera center. (b) The relative coordinate. (c)(d)
Epipolar line of images, and the relative camera pose.
(a)
(b)
Camera Pose 1:
[1 0 00 1 00 0 1
000
]
(c)
Camera Pose 2:
[0.999 0 0.003
0 0.999 0−0.003 0 0.999
0.016
−0.002−0.999
]
(d) Figure 15 (a) The UAV move forward. The c and c’ are
the camera center. The e and e’ are the epipole in the
images. (b) The original image. (c)(d) Epipolar line of
images, and the relative camera pose.
5.2. 3D Reconstruction
Following Figure 9 (a), finally, we use the iterative
linear least square method to solve the triangulation
problem. The pattern of the Tetra Pak can be seen easily
from Figure 16 (a). However, from the side view in
Figure 16 (b), we can observe that the depth of the point
cloud is inaccuracy.
(a)
(b)
Figure 16 3D reconstruction result. (a) Front view of
point cloud. (b) Side View of point cloud.
6. CONCLUSIONS AND FUTURE WORKS
In recent years, although there are several sensors
fully developed in robotics (e.g. Laser Range Finder and
Microsoft Kinect), they only can be used by the wire-
connected computer instead of wireless communication.
Thus, the monocular camera with lower resolution of
information is a good choice of this scenario. Moreover,
3D is more instinctive to human being, we solve the
SfM problem by those algorithms mentioned in previous
section to try to reconstruct the 3D environment, also
can localize the UAV. Although the result in Figure 16
is inaccurate, we may try to use the local or global
optimal algorithms (e.g. Bundle Adjustment) to
optimize the 3D point in the future. In order to handle
the execution time of the UAV, we will develop a
strategy to implement the real-time monocular camera
visual odometry by SfM.
REFERENCES
[1] G. Bebis, “Geometric Camera Parameters,”
http://www.cse.unr.edu/~bebis/CS791E/Notes/
CameraParameters.pdf, 2004.
[2] T. M. Cover, P. E. Hart, "Nearest neighbor
pattern classification," IEEE Transactions on
Information Theory, Vol. 13, No. 1, pp. 21–27,
1967.
[3] D. J. Crandall, A. Owens, N. Snavely, and D. P.
Huttenlocher, “SfM with MRFs: Discrete-
Continuous Optimization for Large-Scale
Introduction to Structure from Motion,” IEEE
Transactions on Pattern Analysis and Machine
Intelligence, Vol. 35, No. 12, pp. 2841-2853,
2013.
[4] F. Endres, J. Hess, N. Engelhard, J, Sturm, D.
Cremers, and W. Burgard, “An Evaluation of
the RGB-D SLAM System,” IEEE
International Conference on Robotics and
Automation, Minnesota, USA, pp. 1691-1696,
2012.
[5] M. A. Fischler and R. C. Bolles, “Random
Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and
Automated Cartography,” Comm. of the ACM,
Vol. 24, No. 6, pp. 381-395, 1981.
[6] R. I. Hartley, “In Defense of the Eight-Point
Algorithm,” IEEE Transaction on Pattern
Analysis and Machine Intelligence, Vol. 19, No.
6, pp. 580-593, 1997.
[7] R. I. Hartley and P. Sturm, “Triangulation,”
Computer Vision and Image Understanding,
Vol. 68, No. 2, pp. 146-157, 1997.
[8] R. I. Hartley and A. Zisserman, Multiple View
Geometry in Computer Vision, 2nd ed.,
Cambridge University Press, Cambridge,
England, pp. 239-260, 2004.
[9] M. Kirill, G. Victor, K. Anton, and V. Vladimir,
GML RANSAC Matlab Toolbox. Available at
http://graphics.cs.msu.ru/en/science/research/m
achinelearning/ransactoolbox.
[10] D. G. Lowe, “Distinctive image features from
scale-invariant keypoints,” Int. J. of Computer
Vision, Vol. 60, No. 2, pp. 91–110, 2004.E.
[11] Rosten and T. Drummond, “Machine Learning
for High-Speed Corner Detection,” Proceedings
of European Conference on Computer Vision,
Graz, Austria, Vol. 3951, pp. 430-443, 2006.
[12] D. Scaramuzza and F. Fraundorfer, “Visual
Odometry [Tutorial],” IEEE Robotics and
Automation Magazine, Vol. 18, No. 4, pp. 80-92,
2011.
[13] D. Wagner, G. Reitmayr, A. Mulloni, T.
Drummond, and D. Schmalstieg “ Pose
Tracking from Natural Features on Mobile
Phones,” in International Symposium on Mixed
and Augmented Reality (ISMAR), Cambridge,
UK, 2008, pp. 125–134.
[14] Z. Zhang, “A Flexible New Technique for
Camera Calibration,” IEEE Trans. Pattern
Analysis and Machine Intelligence, Vol. 22, No.
11, pp. 1330-1334, 2000.