AIMS Computer vision - University of Oxfordvedaldi/assets/teach/2016/... · AIMS Computer Vision 1.Matching, indexing, and search 2.Object category detection 3.Visual geometry 1/2:

AIMS Computer visionLecture 4.1: ReconstructionHT 2017Andrea Vedaldi

For slides and up-to-date information:http://www.robots.ox.ac.uk/~vedaldi/teach.html

AIMS Computer Vision

1. Matching, indexing, and search

2. Object category detection

3. Visual geometry 1/2: Camera models and triangulation

4. Visual geometry 2/2: Reconstruction from multiple views

5. Segmentation, tracking, and depth sensors

2 / 56

Outline

Introduction

Computing H or F from point matches

Feature detection and matching

RANSAC

Determining the ego-motion from F

Structure and motion from more than two views

3 / 56

Outline

Introduction



RANSAC



4 / 56

Introduction

In the previous lectures we have seen stereo reconstruction from two views:

1. Obtain (somehow) the camera parameters P = K[I|0] and P ′ = K ′[R|t].

2. Compute the fundamental matrix F = K ′−>[t]×RK−1

3. Match points x in an image to corresponding points x ′ in the second alongthe epipolar lines l ′ = Fx.

4. Triangulation: compute the 3D points X from x, x ′, P, P ′.

Next – What happens if:

1. you do not know how the camera parameters?

2. you have more than two images?

5 / 56

Introduction

In the previous lectures we have seen stereo reconstruction from two views:

1. Obtain (somehow) the camera parameters P = K[I|0] and P ′ = K ′[R|t].

2. Compute the fundamental matrix F = K ′−>[t]×RK−1

3. Match points x in an image to corresponding points x ′ in the second alongthe epipolar lines l ′ = Fx.

4. Triangulation: compute the 3D points X from x, x ′, P, P ′.

Next – What happens if:

1. you do not know how the camera parameters?

2. you have more than two images?

5 / 56

You get Structure from Motion

[Carl Olsson]6 / 56

The Structure from Motion (SFM) problem

Given two or more images of a scene:

Camera C Camera C ′

compute (i) the camera motion and (ii) the scene structure.

Assumptions:

I Known intrinsic calibration K, K ′.

I Unknown extrinsic calibration R, t (egomotion).7 / 56

Prototypical SFM pipeline

1. Match corner points to find point correspondence. This is harder thanbefore as the epipolar geometry is unavailable.

2. Compute the egomotion R, t:

I For planar scenes:

I Compute the homography matrix H (e.g. four points algorithmseen in B14);

I Extract the egomotion from H.

I For general 3D scenes:

I Compute the fundamental matrix F (e.g. eight points algorithm);

I Extract the egomotion from F.

3. Triangulate as before to obtain the 3D points.

8 / 56

Outline

Introduction



RANSAC



9 / 56

Corner Points computed for each frame

Start from two views:

10 / 56


Extract some corner points, for example using the Harris detector:

11 / 56

Egomotion from corner points

Egomotion = transformation between the cameras.

C'

x'

C

x ⇒

C'

x'

C

x

(R,t)

X

I Given point correspondences xi ↔ x ′i for i = 1 . . . n, we want to determineR and t.

I Intuition: Keep C still, and move C ′ until all rays intersect.

I Obviously three correspondences are not enough to fix C ′. How many dowe need?

12 / 56

Outline of egomotion computation

1. Compute the fundamental matrix F from the correspondences xi ↔ x ′i .

2. Decompose F = K ′−>[t]×RK−1 to find R, t (given the known K and K ′).

3. Compute the projection matrices P and P ′ if needed.

13 / 56

Actually, t found only up to scale

I F is a homogeneous matrix, so

F ∝ K ′−1[t]×RK−1 ∝ K ′−1

[λt]×RK−1

I Therefore translation and all lengths are recovered only up to scale:

t ≡ λt, Xi ≡ λXi .

Depth/scale ambiguity

We cannot distinguish:

I a large translation when viewing a large distant scene; from

I a small translation when viewing a small near-to scene.

Question: How might you resolve the depth/scale scaling ambiguity?

14 / 56

How many correspondences are required?

I Because of the depth/speed scaling ambiguity the rotation (3 DoF) can bedetermined completely but only the translation direction (2 DoF) isrecoverable.

I This allows us to evaluate the number of correspondence needed:

1. For n scene points there are 3n unknowns

2. Between 2 views there are 5 = (3 rot + 2 trans) unknowns

3. Each correspondence yields 4 measurements

4. Hence 4n > 3n + 5 and

n > 5 correspondences are needed

I For n < 7 the solutions are non-linear, so we’ll see solutions for n = 7 andn = 8.

15 / 56

Computing the fundamental matrix for n > 8

I Task: Given n correspondences xi ↔ x ′i compute F such that

∀i : x ′i>Fxi = 0.

I Solution: Each correspondence generates one constraint

[x ′i y ′i 1

]

f1 f2 f3f4 f5 f6f7 f8 f9

xi

yi

1

= 0

which can be written as

x ′i xi f1 + x ′i yi f2 + x ′i f3 + y ′i xi f4 + y ′i yi f5 + y ′i f6 + xi f7 + yi f8 + f9 = 0

or

[x ′i xi x ′i yi x ′i y ′i xi y ′i yi y ′i xi yi 1

]

f1...f9

= 0.

16 / 56

Computing the fundamental matrix /ctd

I For n correspondences build up the n × 9 system

An×9f =

x ′1x1 x ′1y1 x ′1 y ′1x1 y ′1y1 y ′1 x1 y1 1...

......

......

......

......

x ′nxn x ′nyn x ′n y ′nxn y ′nyn y ′n xn yn 1

f1...f9

.

I For n = 8 points f cand be found as the null-space of A, and so f and Fare determined up to scale (as expected).

I Since the points are noisy, in general one wants to use n > 8. This can bedone using least square.

17 / 56

A least squares version of the 8-point algorithm

Due to noise, there will not be an exact solution to Af = 0 (A has full rank).

Least square formulation

Find the unit vector f that minimizes the norm of the residual r = Af:

f∗ = argminf:‖f‖=1 ‖Af‖2

Solution with eigenvalues

Compute the eigen-decomposition of the matrix M = A>A and set f to the(unit) eigenvector e1 corresponding to the smallest eigenvalue λ1.

Solution with SVDCompute the SVD of the matrix A and set f to the (unit) right singular vectore1 corresponding to the smallest singular value σ1.

18 / 56









18 / 56









18 / 56

Proof of the eigendecomposition solution

I The squared sum of the residuals r = Af is

‖r‖2 = r>r = f>A>Af = f>Mf

I M = A>A is a n × n symmetric real matrix; hence it can be decomposedas

M = VΛV> = V

λ1

λ2. . .

λn

V> =

n∑

i=1

λi [ei e>i ]

where

I V =[e1 . . . en

]is the orthonormal matrix of eigenvectors

I eigenvalues are non-decreasing: 0 6 λ1 6 λ2 6 . . . 6 λn.

I The eigenvalues are non-negative because:

Mei = eiλi ⇒ e>i Mei = e>i eiλi ⇒ [Aei ]>[Aei ] = λi > 0.

Thenf>Mf = λ1(f>e1)

2 + λ2(f>e2)2 + . . .+ λn(f>en)

2.

I This expression is minimised when f = e1.

19 / 56



‖r‖2 = r>r = f>A>Af = f>Mf


M = VΛV> = V

λ1

λ2. . .

λn

V> =

n∑

i=1

λi [ei e>i ]

where

I V =[e1 . . . en






2 + λ2(f>e2)2 + . . .+ λn(f>en)

2.

I This expression is minimised when f = e1.

19 / 56



‖r‖2 = r>r = f>A>Af = f>Mf


M = VΛV> = V

λ1

λ2. . .

λn

V> =

n∑

i=1

λi [ei e>i ]

where

I V =[e1 . . . en






2 + λ2(f>e2)2 + . . .+ λn(f>en)

2.

I This expression is minimised when f = e1.19 / 56

Proof of the SVD solution

I Any m × n matrix A where m > n can be decomposed as

Am×n = Um×n

σ1

σ2. . .

σn

n×n

V>n×n

where U is column-orthogonal, V is fully orthogonal, and Σ contains thesingular values ordered so 0 6 σ1 6 σ2 6 . . . 6 σn.

I The singular vectors V of A are the same as the eigenvectors ofM = A>A:

M = A>A = VΣ>U>UΣV> = VΣ2V>

I In particular f = e1 is the first column of V.

I The SVD is usually preferred to the eigenvalue decomposition because itis numerically more stable.

20 / 56

Proof of the SVD solution

I Any m × n matrix A where m > n can be decomposed as

Am×n = Um×n

σ1

σ2. . .

σn

n×n

V>n×n

where U is column-orthogonal, V is fully orthogonal, and Σ contains thesingular values ordered so 0 6 σ1 6 σ2 6 . . . 6 σn.

I The singular vectors V of A are the same as the eigenvectors ofM = A>A:

M = A>A = VΣ>U>UΣV> = VΣ2V>

I In particular f = e1 is the first column of V.

I The SVD is usually preferred to the eigenvalue decomposition because itis numerically more stable.

20 / 56

Computing F from 7 points

I For the 7× 9 set of equations Af = 0 we know that f is in the null space ofA

I This null space is 2-dimensional and hence spanned by two vectors f1 andf2. Since f is determined up to scale, all solutions are given by:

f = αf1 + (1 − α)f2

I Reshaping the vectors, results in a family of candidate fundamentalmatrices

F = αF1 + (1 − α)F2

I To find which one is a “proper” fundamental matrix, use the non-linearconstraint det F = 0. This gives a cubic equation in α. Can you see why?

I The cubic has either one or three real solutions for α.

21 / 56

A Visual Compass

If the motion of the camera is known to be a pure rotation, then the imagesare related by an homography

x ′ = H∞x

whereH∞ = K ′RK−1.

Algorihtm:

I Find correspondences xi ↔ x ′i

I Compute H∞ from the correspondences (see B14)

I Extract R to find relative rotations

But of course we cannot recover any scene structure!

22 / 56

A Visual Compass

Use H∞ to register images to a common reference frame to create panoramicmosaic:

��

��

��

��

��

��

��

��

23 / 56

Outline

Introduction



RANSAC



24 / 56

Feature detection, matching and the F matrix

So far, we have not discussed matching. The reason is that computation ofthe fundamental matrix can be incorporated into the matching.

Outline:

I Extract image points as corners. Why corners?

I Obtain an initial corner matches using local descriptors.

I Remove outlier and estimate the fundamental matrix F using RANSAC.

I Obtain further corner matches using F.

25 / 56

Why corner points, especially?

I Why not use lines, or take a dense pixel-based approach?

I The key reason is that the search for matches is no longer 1D when thecamera motion is unknown. A 2D region has to be searched.

I A dense approach is then likely to be too expensive, and matchingsections of a line suffers the the aperture problem.

I Corners are:

I relatively sparse;

I reasonably cheap to compute;

I well-localized;

I appear quite robustly from frame to frame.

I Hence corners are good for matching.

26 / 56


Recall that points with distinctively high autocorrelation provide the bestchance of deriving a distinctive cross-correlation signal.

2D

1D

Uniform

27 / 56

Initial matching

I Extract corners in both images (feature detection).

I For each corner x in C, make a list of potential matches x ′ in a region in C ′

around x (heuristic).

I Rank the matches by comparing the regions around the corners usingcross-correlation.

I Sift them to reconcile forward-backward inconsistencies.

I The idea here is to not to do too much work — just enough to get somegood matches.

28 / 56

Initial matching /ctd

. . .Source Patch Target (a) Target (b) Target (c) etc ...

29 / 56

Initial Matching /ctd

Matches — some good matches, some mismatches. Can still compute F witharound 50% mismatches. How?

30 / 56

Outline

Introduction



RANSAC



31 / 56

RANSAC – RANdom SAmple Concensus

I Suppose you tried to fit a straight line to data containing outliers — pointswhich are not properly described by the assumed probability distribution.

I The usual methods of least squares are hopelessly corrupted.

I Need to detect outliers and exclude them.

⇒I Use estimation based on robust statistics RANSAC was the first,

devised by vision researchers, Fischler & Bolles (1981)

32 / 56

RANSAC algorithm for lines

1. For many repeated trials:

1.1 Select a random sample of two points

1.2 Fit a line through them

1.3 Count how many other points are within a threshold distance of the line(inliers)

2. Select the line with the largest number of inliers

3. Refine the line by fitting it to all the inliers (using least squares)

Remarks:

I Sample a minimal set of points for your problem (2 for lines).

I Repeat such that there is a high chance that at least one minimal setcontains only inliers (see tutorial sheet).

33 / 56


Data ∼ 50% corrupt Random Sample

34 / 56


Support ∼ 10 Support ∼ 50

35 / 56

RANSAC algorithm for F


1.1 Select a random sample of seven correspondences

1.2 Compute F using the cubic method

1.3 Count how many other correspondences are within threshold distanceof the epipolar lines (inliers)

2. Select the F with the largest number of inliers

3. Refine F by fitting it to all the inliers (using the SVD method)

36 / 56

RANSAC algorithm for H


1.1 Select a random sample of four correspondences

1.2 Compute H (as in B14)

1.3 Count how many other correspondences are within threshold distanceof the predicted locations (inliers)

2. Select the H with the largest number of inliers

3. Refine H by fitting it to all the inliers, optimizing the reprojection error

minH

∑

(x,x′)∈Inliers

d2(x ′,Hx) + d2(H−1x ′, x)

37 / 56

Correspondences consistent with epipolar geometry

Initial matches Inliers

38 / 56

Epipolar geometry

39 / 56

Outline

Introduction



RANSAC



40 / 56

Computing R and t from F

Recall that F = K ′−1[t]×RK−1. We now show how to recover R and t from F(given K and K ′).

1. Compute the essential matrix E = [t]×R = K ′>FK.

2. Compute t as the null vector of E> (i.e.E>t = 0).

I t is determined up to a scaling factor µ

I there are two solutions ±µt

3. Compute R from E

I the algorithm for this step is given later.

I it returns two solutions R1 and R2.

4. Overall, there are four solutions for the projection matrix:

P ′ = K ′[R1|µt] P ′ = K ′[R1|− µt]

P ′ = K ′[R2|µt] P ′ = K ′[R2|− µt]

5. Exclude 3 of these using a visibility test41 / 56

The four solutions

The 3D point is in front of both cameras in only one case.

Visible Visible

C C’

Visible

Invisible

C C’

Visible

CC’

Invisible

Invisible

CC’

Invisible

Note these are “computer vision” cameras, so to be visible a ray must passthrough the image on its way to the optic centre!

42 / 56

Computing R1,2 from the essential matrix ENon-examinable

Recall that E = [t]×R; we now recover R from E. Algorithm:

1. Compute the Singular Value Decomposition (SVD) of E.

U

1 0 00 1 00 0 0

V> ← M

2. Set

W =

0 −1 01 0 00 0 1

3. The two solutions are:

R1 = UWV>, R2 = UW>V>

43 / 56

Outline

Introduction



RANSAC



44 / 56

Structure and motion for more than two views

I Why bother?

1. Matching becomes more verifiable, as 3D point estimates are availableto reproject.

2. 3D point estimates improve as further views over a range of angles isobtained

3. There is no increase in the degree of ambiguity, though the overallscale ambiguity persists.

45 / 56

Notation for three+ views

I For three views let the cameras be C, C ′, C ′′ with projection matrices P, P ′

and P ′′, and with image points x, x ′ and x ′′.

C C’

xx’

x’’X

C’

I For m views, a point xj is imaged in the i-th camera Ci at xij = PiXj

46 / 56

Point correspondence over 3 views

I Given the projection matrices and x↔ x ′ how is the point x ′′ found?

C C’

xx’

X

C’

?

?

??

I Algorithm:

1. Compute the 3D point from x and x ′

2. Then re-project using P ′′.

I The search in the third image is zero-D, and the size of the search regiondepends only on uncertainty.

47 / 56

Problem statement: structure and motion

I Given: n matching image points xij over m views

I Find: the cameras Pi and the 3D points Xj such that xij ≈ PiXj by finding:

minPi ,Xj

n∑

j=1

m∑

i=1

d2(xij ,P

iXj)

I This is a serious minimization:I For each camera, 6 parametersI For each 3D point, 3

parametersI Total of 6m + 3n − 1 (−1 for

scale) parameters overall

C C’

xx’

X

C’x’’

I For 50 frames, 1000 points, we have 3.3× 103 unknowns!

48 / 56

/ctd

Building block is computing correspondences xij ↔ xi+1

j , finding Fi+1i and

then matrices Pi ,Pi+1

Algorithm1. Compute interest points in each image2. Compute matches between consecutive image pairs

i, i + 13. Compute Fi+1

i . Recover Pi ,Pi+1

4. Compute scene points5. Extend correspondences over image triples6. Extend correspondences over all images7. Optimize over all Pi ,Xj

Images

xj

xj

xj

x1j

1

2

3

4

2

3

4

P

P

P

P

1

2

3

4

Xj

49 / 56

2d3’s Boujou systemZisserman, Fitzgibbon, Torr, Beardsley

Original Sequence

50 / 56

2d3’s Boujou systemZisserman, Fitzgibbon, Torr, Beardsley

Augmentation

51 / 56

Batch SFM

I Up to now batch, offline processing of video sequences

DATADATA

DATA

I Post-production, 3D model reconstruction, etc.

52 / 56

Real-time, sequential SFM

I Real-time, sequential, fixed time budget (10s of milliseconds)

I Build and maintain a map, and localise w.r.t. the map

DATA

+

DATA

+

I Real-time robotics applications, but in simplified 2D environments,specialised sensors, etc

I Reliable, repeated measurement is crucial – mitigates against drift givingrepeatable accuracy.

53 / 56

Sequential structure from motion: visual SLAM

I Represent joint distribution over camera and feature positions using asingle multi-variate Gaussian.

x =

xv

y1

y2...

, P =

Pxx Pxy1 Pxy2 . . .Py1x Py1y1 Py1y2 . . .Py2x Py2y1 Py2y2 . . .

......

...

I Use Kalman Filter (see C4B Mobile Robotics)

predict→ measure→ update

framework to propagate uncertainty, and fuse measurement data

54 / 56

Example: real-time, sequential structure from motion

state

estimate

measurementsearch

image

patch

predict

update

55 / 56

Davison, Reid, Smith, Williams, Klein, et al.

Documents

AIMS Computer vision - University of Oxfordvedaldi/assets/teach/2016/... · AIMS Computer Vision 1.Matching, indexing, and search 2.Object category detection 3.Visual geometry 1/2: