Upload
tranngoc
View
227
Download
2
Embed Size (px)
Citation preview
From 2D to 3D:Monocular VisionWith application to robotics/AR
Motivation
How many sensors do we really need?
Motivation
● What is the limit of what can be inferred from a single embodied (moving) camera frame?
Aim
● AR with a hand-held camera● Visual Tracking provides registration● Track without prior model of world● Challenges
● Speed● Accuracy ● Robustness● Interaction with real world
Existing attempts: SLAM
● Simultaneous Localization and Mapping● Well-established in robotics (using a rich array
of sensors)● Demonstrated with a single hand-held camera
by Davison 2003
Model-based tracking vs SLAM
Model-based tracking vs SLAM
● Model-based tracking is● More robust● More accurate
● Why?● SLAM fundamentally harder?
Pinhole camera model
X ,Y ,Z ↦ fX /Z , fY /Z
XYZ1↦ fXfYZ =[
f 0f 0
1 0 ]XYZ1 PXx =
Pinhole camera model
=
++
101
01
01
1Z
Y
X
pf
pf
Z
ZpYf
ZpXf
y
x
x
x
=
1y
x
pf
pf
K calibration matrix [ ]0|IKP =
principal point: ),( yx pp
Camera rotation and translation
( )C~
-X~
RX~
cam =
X10
C~
RR
1
X~
10
C~
RRXcam
−=
−=
[ ] [ ]XC~
R|RKX0|IKx cam −== [ ],t|RKP = C~
Rt −=
In non-homogeneouscoordinates:
Note: C is the null space of the camera projection matrix (PC=0)
Triangulation• Given projections of a 3D point in two
or more images (with known camera matrices), find the coordinates of the point
O1O2
x1x2
X?
• Given: m images of n fixed 3D pointsxij = Pi Xj , i = 1, … , m, j = 1, … , n
• Problem: estimate m projection matrices Pi and n 3D points Xj from the mn correspondences xij
Structure from Motion (SfM)
x1j
x2j
x3j
Xj
P1
P2
P3
SfM ambiguity
• If we scale the entire scene by some factor k and, at the same time, scale the camera matrices by the factor of 1/k, the projections of the scene points in the image remain exactly the same:
)(1
XPPXx kk
==
It is impossible to recover the absolute scale of the scene!
• Given: m images of n fixed 3D pointsxij = Pi Xj , i = 1, … , m, j = 1, … , n
Problem: estimate m projection matrices Pi and n 3D points Xj from the mn correspondences xij
• With no calibration info, cameras and points can only be recovered up to a 4x4 projective transformation Q:
X QX, P PQ→ → -1
• We can solve for structure and motion when 2mn >= 11m +3n
• For two cameras, at least 7 points are needed
Structure from Motion (SfM)
• Non-linear method for refining structure and motion (Levenberg-Marquardt)
• Minimizing re-projection error
Bundle Adjustment
( )2
1 1
,),( ∑∑= =
=m
i
n
jjiijDE XPxXP
x1j
x2j
x3j
Xj
P1
P2
P3
P1Xj
P2Xj
P3Xj
• Self-calibration (auto-calibration) is the process of determining intrinsic camera parameters directly from uncalibrated images
• For example, when the images are acquired by a single moving camera, we can use the constraint that the intrinsic parameter matrix remains fixed for all the images
● Compute initial projective reconstruction and find 3D projective transformation matrix Q such that all camera matrices are in the form Pi = K [Ri | ti]
• Can use constraints on the form of the calibration matrix: zero skew
Self-calibration
Why is this cool?
http://www.youtube.com/watch?v=sQegEro5Bfo
Why is this still cool?
http://www.youtube.com/watch?v=p16frKJLVi0
• Simultaneous Localization And Mapping• A robot is exploring an unknown, static
environment• Given:
• The robot's controls
• Observations of nearby features
• Estimate:
• Map of features
• Path of the robot
The SLAM Problem
Structure of the landmark-based SLAM Problem
SLAM a hard problem??SLAM: robot path and map are both unknown
Robot path error correlates errors in the map
SLAM a hard problem??
Robot poseuncertainty
• In the real world, the mapping between observations and landmarks is unknown
• Picking wrong data associations can have catastrophic consequences
• Pose error correlates data associations
SLAM● Full SLAM:
● Online SLAM:
Integrations typically done one at a time
),|,( :1:1:1 ttt uzmxp
121:1:1:1:1:1 ...),|,(),|,( −∫ ∫ ∫= ttttttt dxdxdxuzmxpuzmxp
Estimates most recent pose and map!
Estimates entire path and map!
Graphical Model of Full SLAM
),|,( :1:1:1 ttt uzmxp
Graphical Model of Online SLAM
121:1:1:1:1:1 ...),|,(),|,( −∫ ∫ ∫= ttttttt dxdxdxuzmxpuzmxp
Scan Matching
{ })ˆ,|( )ˆ ,|( maxargˆ 11]1[
−−− ⋅= tttt
ttx
t xuxpmxzpxt
robot motioncurrent measurement
map constructed so far
● Maximize the likelihood of the i-th pose and map relative to the (i-1)-th pose and map
● Calculate the map according to “mapping” with known poses based on the poses and observations
SLAM approach
PTAM approach
Tracking & Mapping threads
Mapping thread
Stereo Initialization
● 5 point-pose algorithm (Stewenius et al '06)● Requires a pair of frames and feature
correspondences● Provides initial (sparse) 3D point cloud
Wait for new keyframe
● Keyframes are only added if:● There is a baseline to the other keyframes● Tracking quality is good
● When a keyframe is added:● The mapping thread stops whatever it is doing● All points in the map are measured in the
keyframe● New map points are found and added to the map
Add new map points
● Want as many map points as possible● Check all maximal FAST corners in the
keyframe:● Check Shi-Tomasi score● Check if already in map
● Epipolar search in a neighboring keyframe● Triangulate matches and add to map● Repeat in four image pyramid levels
Optimize map
● Use batch SFM method: Bundle Adjustment*● Adjusts map point positions and keyframe
poses● Minimizes re-projection error of all points in
all keyframes (or use only last N keyframes)● Cubic complexity with keyframes, linear with
map points● Compatible with M-estimators (we use Tukey)
Map maintenance
● When camera is not exploring, mapping thread has idle time – use this to improve the map
● Data association in bundle adjustment is reversible
● Re-attempt outlier measurements● Try to measure new map features in all old
keyframes
Tracking thread
Pre-process frame
● Make mono and RGB version of image● Make 4 pyramid levels● Detect FAST corners
Project Points
● Use motion model to update camera pose● Project all map points into image to see which
are visible, and at what pyramid level● Choose subset to measure
● ~50 biggest features for coarse stage● 1000 randomly selected for fine stage
Measure Points
● Generate 8x8 matching template (warped from source keyframe)
● Search a fixed radius around projected position● Use zero-mean SSD● Only search at FAST corner points
● Up to 10 inverse composition iterations for subpixel position (for some patches)
● Typically find 60-70% of patches
Update camera pose
● 6-DOF problem● 10 iterations● Tukey M-Estimator to minimize a robust
objective function of re-projection error
where ej is the re-projection error vector
Bundle-adjustment● Global bundle-adjustment
● Local bundle-adjustment
● X - The newest 5 keyframes in the keyframe chain● Z - All of the map points visible in any of these keyframes● Y - Keyframe for which a measurement of any point in Z has been
made That is, local bundle
● Optimizes the pose of the most recent keyframe and its closest neighbors, and all of the map points seen by these, using all of the measurements ever made of these points.
Video
http://www.youtube.com/watch?v=Y9HMn6bd-v8
http://www.youtube.com/watch?v=pBI5HwitBX4
Capabilities
Capabilities
Multi-scale Compactly Supported Basis FunctionsBundle adjusted point cloud with PTAM
RGB-D Sensor● Principle: structured light
● IR projector + IR camera● RGB camera
● Dense depth images
Kinect-based mapping
System Overview● Frame-to-frame alignment● Global optimization (SBA for loop closure)
Feature matching
RANSAC● Features correspondences are established; outliers
robustly removed● Homography (Transformation) between the two
keyframes can now be estimated
Global Optimization (RGBD-ICP)
Benefits● Visual and depth information used jointly for real-time mapping
application
● Reconstruct a dense map of the environment● Avoid dense stereo for every pair of KeyFrames
● Optimize over sparse set of feature points● Results in dramatic speed improvements● Allows for computing other valuable algorithms simultaneously (e.g.
navigation, obstacle avoidance, scene understanding)
Video
http://www.cs.washington.edu/ai/Mobile_Robotics/projects/rgbd-3d-mapping/
Kinect + Real-time reconstruction
Video
http://research.microsoft.com/apps/video/dl.aspx?id=152815
Conclusion● So much information available from a single
camera● Yet to truly understand what we can infer from a
single camera● Several exciting technologies in the recent past● Software problem; not a hardware limitation
● Monocular vision can be sufficient for a lot of use cases
Thanks!