28
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE Fast Structure from Motion for Planar Image Sequences Andreas Weishaupt, Luigi Bagnato, Pierre Vandergheynst

Fast Structure From Motion in Planar Image Sequences

Embed Size (px)

DESCRIPTION

These are the slides of my talk at Eusipco 2010

Citation preview

Page 1: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Fast Structure from Motion for Planar Image Sequences

Andreas Weishaupt, Luigi Bagnato, Pierre Vandergheynst

Page 2: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

2

Depth Map

Page 3: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

3

Depth Map

Motion

Page 4: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

4

Depth Map

Motion

Page 5: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

5

Depth Map

Motion

Page 6: Fast Structure From Motion in Planar Image Sequences

Cinema 3D

Autonomous Navigation (SLAM)

3D scanning/modeling

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Motivations

6

Philips 2D+Depth

Page 7: Fast Structure From Motion in Planar Image Sequences

Cinema 3D

Autonomous Navigation (SLAM)

3D scanning/modeling

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Motivations

6

Philips 2D+Depth

Target:- Real time performances- Good Accuracy

Page 8: Fast Structure From Motion in Planar Image Sequences

I1(x+ u)! I0(x) = 0

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Problem Formulation

7

Brightness Consistency Equation (BCE)

We consider only 2 consecutive frames I0 and I1

f: focal t: camera translationd: distance/depth : camera rotationu: optical flow!

Page 9: Fast Structure From Motion in Planar Image Sequences

I1(x+ u)! I0(x) = 0

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Problem Formulation

7

Brightness Consistency Equation (BCE)

We consider only 2 consecutive frames I0 and I1

f: focal t: camera translationd: distance/depth : camera rotationu: optical flow!

Page 10: Fast Structure From Motion in Planar Image Sequences

I1(x+ u)! I0(x) = 0

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Problem Formulation

7

Brightness Consistency Equation (BCE)

We consider only 2 consecutive frames I0 and I1

u

f: focal t: camera translationd: distance/depth : camera rotationu: optical flow!

Page 11: Fast Structure From Motion in Planar Image Sequences

I1(x+ u)! I0(x) = 0

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Problem Formulation

7

Brightness Consistency Equation (BCE)

We consider only 2 consecutive frames I0 and I1

u

I1(x) +!IT1 (x)u" I0(x) # 0

Linearization

If we assume that the motion between frames is small f: focal t: camera translationd: distance/depth : camera rotationu: optical flow!

Page 12: Fast Structure From Motion in Planar Image Sequences

r

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Problem Formulation - Projection Model

8

nonlinear

linearization

Z: depth map = inverse of d

1. Parallel projection

2. Optical flow

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

FAST STRUCTURE FROM MOTION FOR PLANAR IMAGE SEQUENCES

Andreas Weishaupt, Luigi Bagnato, and Pierre Vandergheynst

Signal Processing Laboratory (LTS2), Ecole Polytechnique Federale de Lausanne (EPFL)Station 11, CH–1015 Lausanne, Switzerland

phone: +41 21 69 36874, email: {andreas.weishaupt,luigi.bagnato,pierre.vandergheynst}@epfl.ch

ABSTRACTDense three-dimensional reconstruction of a scene from im-ages is a challenging task. Usually, it is achieved by find-ing correspondences in successive images and computing thedistance by means of epipolar geometry. In this paper, wepropose a variational framework to solve the depth from mo-tion problem for planar image sequences. We derive cameraego-motion estimation equations and we show how to com-bine the depth map and ego-motion estimation in a singlealgorithm. We successfully test our method on synthetic im-age sequences for general camera translation. Our methodis highly parallelizable and thus well adapted for real-timeimplementation on the GPU.

1. INTRODUCTIONThe efficient three-dimensional recovery of a scene structurefrom images has been a long-term aim in computer vision.Successful methods would have a big impact on a broadrange of fields such as autonomous navigation, biomedicalimaging or architecture. The geometry that links the 3Dstructure of a scene and its projection on images has beenstudied thoroughly, e.g. in [5] or [4]. Traditionnally, there isone main approach: from a pair of stereo images of the scenethe 3D recovery is based on epipolar geometry. Such an ap-proach has been extended to methods that handle multiplecamera inputs which are now generally known as multi-viewstereo methods (see [7] for a good overview and compari-son). Structure from motion means the 3D scene reconstruc-tion from images captured by a moving camera. Usually,similar methods are used in structure from motion recovery.They rely on finding pairs of corresponding points in succes-sive images. This has the following consequences:• The final result depends on the quality of the found cor-

respondence. If the match is not exact the reconstructionwill not be accurate.

• Finding correspondences is a computationnally expen-sive task: dense reconstruction cannot be performed inreal-time.

• For real-time reconstruction, the recovery has to be lim-ited to some few feature points. Often, tracking of thefound feature points is employed to reduce additionnalcomputation cost. Tracking introduces a second sourceof error for 3D reconstruction.Another class of recent methods to obtain dense depth

maps is based on the fusion of sparse depth maps by im-age registration techniques, e.g. [8]. Those have the advan-tage that traditional structure from motion systems can beemployed. However, in order to provide accurate results, alarge number of depth maps has to be input for such meth-ods. Finally, in [10] it is shown how robust depth map re-

construction can be achieved from video sequence by beliefpropagation and bundle optimization. Even if the results arepromising and the method does not need hundreds of framesto be input, it is computationally expensive and only destinedfor off-line processing.

Bagnato et al. [2, 1] has shown how to recover densedepth maps from omnidirectional image sequences by em-ploying a variational framework. The approach does not havethe usual drawbacks found for the methods above:• Dense depth maps can be obtained without finding corre-

spondences and combining sparse depth maps.• One depth map can be recovered by using only two suc-

cessive input images.• The framework can be implemented for a dense frame-

by-frame recovery in real-time.In this paper, we will show how to modify this approach

in order to handle regular images obtained by a planar imagesensor. In Section 2, we derive a general projection modelthat relates depth and camera motion. The model is nonlin-ear due to the central projection on the image plane and weshow how it can be linearized. In Section 3, the linearizedprojection can be used in a TV-L1 optimization frameworkfor depth from motion reconstruction. In Section 4, we solvethe camera ego-motion estimation problem. As a last step,in Section 5 we combine both, the ego-motion and the depthfrom motion estimation, into a complete structure from mo-tion framework for planar image sequences. In section 6,we separately evaluate the ego-motion and depth from mo-tion estimation on synthetic images, and then we evaluatethe complete algorithm with quite remarkable results.

2. MOTION IN PLANAR IMAGESWe model the camera movement during acquisition of twosuccessive frames by the rigid 3D translation vector t =(tx, ty, tz)T . Consequently, during camera motion, a pointp = (X ,Y,Z)T becomes p! = p"!p = p" t where !pdenotes the relative camera motion. We parametrize p byp = d(r)er where r = (x,y, f ) is a point on the planar sensor,f the focal distance and d(r) the distance or depth of the op-tical center to a point in the scene. The pinhole camera modeland motion is shown in Figure 1. We denote Z(r) = 1/d(r)the inverse depth or depth map. A point on the sensor planecan be obtained by central projection:

r =#r#pd(r)

= #r#Z(r)p. (1)

In Figure 2 we have a side view of the camera model andmotion, as well as projections on the sensor plane and on theparellel object plane. Based on Eq. 1, we can derive a projec-tion model that links camera movement, depth of the scene

Page 13: Fast Structure From Motion in Planar Image Sequences

t! = argmint

!

x"D

"I1 ! I0 +"IT1 (#r#Ztp ! u0)

#2

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Problem formulation

9

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

I1(x) +!IT1 (x)u" I0(x) # 0

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Linearization with respect to a known u0

We cast the depth estimation in a TV-L1 optimization problem

BC Equation

Data Term - bilinear in Z and tp

If we know the motion t:

If we have an estimate of Z: We estimate ego-motion by least square

Projection Model

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Page 14: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Camera Ego-Motion Estimation

10

t! = argmint

!

x"D

"I1 ! I0 +"IT1 (#r#Ztp ! u0)

#2

Special case t = (tx, ty, 0)T

General case t = (tx, ty, tz)T

Figure 3: Some of the used input images. Left: two successive frames for movement in x direction with a translation vectort = (!0.1,0,0)T . Right: two successive frames for movement in z direction with a translation vector t = (0,0,0.1)T .

prefix k denotes the scale level). We start at the coarsest res-olution, where we solve for LZ. Then we upsample LZ to thenext level L! 1 and use it as input for the projection in theimage residual. This can be repeated until level 0 is reachedwhere we obtain the final depth map 0Z.

4. EGO-MOTION ESTIMATION

Let us assume now that we have an estimate of the depth mapZ. Given two successive images I0 and I1 we can recover thecamera translation parameters by optimizing the L2 norm ofthe image residual with respect to t:

!x"D

!I1! I0 +#r#ZtpT "I1

"#r#Z"I1 = 0. (12)

In the special case of camera movement parallel to thesensor plane solving Eq. 12 results in the linear systemA(x)b = c(x) with

A(x) =

#

$%!x"D #r#2Z2

&! I1!x

'2!x"D #r#2Z2 ! I1

!x! I1!y

!x"D #r#2Z2 ! I1!x

! I1!y !x"D #r#2Z2

&! I1!y

'2

(

)*

and

c(x) =

+!!x"D #r#Z ! I1

!x (I1! I0)!!x"D #r#Z ! I1

!y (I1! I0)

,.

For general camera motion, we can solve Eq. 12 by itera-tive methods, e.g. Levenberg-Marquardt or gradient descent:xn+1 = xn + ""E(xn) where x contains the three translationparameters and E is the energy of the image residual,

!E!xi

= !x"D

!I1! I0 +"IT

1 u"

"IT1

!u!xi

.

The partial derivatives of u with respect to the motionparameters are given by the Jacobian matrix

JTu =

#

$$%

rz#r#Zrz+#r#Ztz

0

0 rz#r#Zrz+#r#Ztz

!rz#r#Z rx+#r#Ztx(rz+#r#Ztz)2 !rz#r#Z ry+#r#Zty

(rz+#r#Ztz)2

(

))* .

5. JOINT DEPTH AND EGO-MOTIONESTIMATION

For a complete depth map reconstruction from input images,we must show how to combine the depth from motion esti-mation described in Section 3 and the ego-motion estimationdescribed in Section 4. Since both parts rely on each other,it is very likely that we can combine them by performing al-ternating depth and ego-motion estimation. We find that itis best to include the alternation scheme in the multi-scaleapproach:1. At the coarsest resolution level L, we initialize Lt by zero

and LZ by some small constant. We can first solve for LZas explained in Section 3. Since the ego-motion parame-ters are zero the estimated depth map will be very flat.

2. With the flat depth map as input we estimate the motionparameters according to Section 4.

3. Given the estimated motion parameters k+1t and thedepth map k+1Z, we first estimate the optical flow u0 =#r#Z(r)tp, then we compute the depth map at level kZ.

4. From the refined depth map kZ, we compute the motionparameters kt.

5. Steps 3 and 4 are repeated until the finest resolution isreached and the final depth map 0Z is obtained.

6. RESULTSIn order to verify our approach, we use synthetic images ofsize 512$ 512 and ground truth depth maps generated byray-tracing of a 3D model of a living room. We have gen-erated multiple sequences for various types of camera trans-lation, i.e. for movement parallel and perpendicular to theimage plane as well as for linear combinations of both. Thepurpose is to evaluate first ego-motion estimation and depthfrom motion seperately.

We run the ego-motion estimation with ground truthdepth maps on the different sequences and we obtain thetranslation vector estimates as listed in Table 1. For sim-plicity we only show the mean and standard deviation of thenormalized vectors.

We evaluate the depth from motion part by using theground truth translation vectors as inputs. We normalize theinput images which is convenient for comparing the used pa-rameters. In our experiments, we use 5 levels of resolutionwith a constant scale factor of 2 from level to level. Thefunctional splitting parameter # is set to 0.05. # % 0 meansapproaching the true TV-L1 model and thus better accounting

linear system of equations

Figure 3: Some of the used input images. Left: two successive frames for movement in x direction with a translation vectort = (!0.1,0,0)T . Right: two successive frames for movement in z direction with a translation vector t = (0,0,0.1)T .

prefix k denotes the scale level). We start at the coarsest res-olution, where we solve for LZ. Then we upsample LZ to thenext level L! 1 and use it as input for the projection in theimage residual. This can be repeated until level 0 is reachedwhere we obtain the final depth map 0Z.

4. EGO-MOTION ESTIMATION

Let us assume now that we have an estimate of the depth mapZ. Given two successive images I0 and I1 we can recover thecamera translation parameters by optimizing the L2 norm ofthe image residual with respect to t:

!x"D

!I1! I0 +#r#ZtpT "I1

"#r#Z"I1 = 0. (12)

In the special case of camera movement parallel to thesensor plane solving Eq. 12 results in the linear systemA(x)b = c(x) with

A(x) =

#

$%!x"D #r#2Z2

&! I1!x

'2!x"D #r#2Z2 ! I1

!x! I1!y

!x"D #r#2Z2 ! I1!x

! I1!y !x"D #r#2Z2

&! I1!y

'2

(

)*

and

c(x) =

+!!x"D #r#Z ! I1

!x (I1! I0)!!x"D #r#Z ! I1

!y (I1! I0)

,.

For general camera motion, we can solve Eq. 12 by itera-tive methods, e.g. Levenberg-Marquardt or gradient descent:xn+1 = xn + ""E(xn) where x contains the three translationparameters and E is the energy of the image residual,

!E!xi

= !x"D

!I1! I0 +"IT

1 u"

"IT1

!u!xi

.

The partial derivatives of u with respect to the motionparameters are given by the Jacobian matrix

JTu =

#

$$%

rz#r#Zrz+#r#Ztz

0

0 rz#r#Zrz+#r#Ztz

!rz#r#Z rx+#r#Ztx(rz+#r#Ztz)2 !rz#r#Z ry+#r#Zty

(rz+#r#Ztz)2

(

))* .

5. JOINT DEPTH AND EGO-MOTIONESTIMATION

For a complete depth map reconstruction from input images,we must show how to combine the depth from motion esti-mation described in Section 3 and the ego-motion estimationdescribed in Section 4. Since both parts rely on each other,it is very likely that we can combine them by performing al-ternating depth and ego-motion estimation. We find that itis best to include the alternation scheme in the multi-scaleapproach:1. At the coarsest resolution level L, we initialize Lt by zero

and LZ by some small constant. We can first solve for LZas explained in Section 3. Since the ego-motion parame-ters are zero the estimated depth map will be very flat.

2. With the flat depth map as input we estimate the motionparameters according to Section 4.

3. Given the estimated motion parameters k+1t and thedepth map k+1Z, we first estimate the optical flow u0 =#r#Z(r)tp, then we compute the depth map at level kZ.

4. From the refined depth map kZ, we compute the motionparameters kt.

5. Steps 3 and 4 are repeated until the finest resolution isreached and the final depth map 0Z is obtained.

6. RESULTSIn order to verify our approach, we use synthetic images ofsize 512$ 512 and ground truth depth maps generated byray-tracing of a 3D model of a living room. We have gen-erated multiple sequences for various types of camera trans-lation, i.e. for movement parallel and perpendicular to theimage plane as well as for linear combinations of both. Thepurpose is to evaluate first ego-motion estimation and depthfrom motion seperately.

We run the ego-motion estimation with ground truthdepth maps on the different sequences and we obtain thetranslation vector estimates as listed in Table 1. For sim-plicity we only show the mean and standard deviation of thenormalized vectors.

We evaluate the depth from motion part by using theground truth translation vectors as inputs. We normalize theinput images which is convenient for comparing the used pa-rameters. In our experiments, we use 5 levels of resolutionwith a constant scale factor of 2 from level to level. Thefunctional splitting parameter # is set to 0.05. # % 0 meansapproaching the true TV-L1 model and thus better accounting

Figure 3: Some of the used input images. Left: two successive frames for movement in x direction with a translation vectort = (!0.1,0,0)T . Right: two successive frames for movement in z direction with a translation vector t = (0,0,0.1)T .

prefix k denotes the scale level). We start at the coarsest res-olution, where we solve for LZ. Then we upsample LZ to thenext level L! 1 and use it as input for the projection in theimage residual. This can be repeated until level 0 is reachedwhere we obtain the final depth map 0Z.

4. EGO-MOTION ESTIMATION

Let us assume now that we have an estimate of the depth mapZ. Given two successive images I0 and I1 we can recover thecamera translation parameters by optimizing the L2 norm ofthe image residual with respect to t:

!x"D

!I1! I0 +#r#ZtpT "I1

"#r#Z"I1 = 0. (12)

In the special case of camera movement parallel to thesensor plane solving Eq. 12 results in the linear systemA(x)b = c(x) with

A(x) =

#

$%!x"D #r#2Z2

&! I1!x

'2!x"D #r#2Z2 ! I1

!x! I1!y

!x"D #r#2Z2 ! I1!x

! I1!y !x"D #r#2Z2

&! I1!y

'2

(

)*

and

c(x) =

+!!x"D #r#Z ! I1

!x (I1! I0)!!x"D #r#Z ! I1

!y (I1! I0)

,.

For general camera motion, we can solve Eq. 12 by itera-tive methods, e.g. Levenberg-Marquardt or gradient descent:xn+1 = xn + ""E(xn) where x contains the three translationparameters and E is the energy of the image residual,

!E!xi

= !x"D

!I1! I0 +"IT

1 u"

"IT1

!u!xi

.

The partial derivatives of u with respect to the motionparameters are given by the Jacobian matrix

JTu =

#

$$%

rz#r#Zrz+#r#Ztz

0

0 rz#r#Zrz+#r#Ztz

!rz#r#Z rx+#r#Ztx(rz+#r#Ztz)2 !rz#r#Z ry+#r#Zty

(rz+#r#Ztz)2

(

))* .

5. JOINT DEPTH AND EGO-MOTIONESTIMATION

For a complete depth map reconstruction from input images,we must show how to combine the depth from motion esti-mation described in Section 3 and the ego-motion estimationdescribed in Section 4. Since both parts rely on each other,it is very likely that we can combine them by performing al-ternating depth and ego-motion estimation. We find that itis best to include the alternation scheme in the multi-scaleapproach:1. At the coarsest resolution level L, we initialize Lt by zero

and LZ by some small constant. We can first solve for LZas explained in Section 3. Since the ego-motion parame-ters are zero the estimated depth map will be very flat.

2. With the flat depth map as input we estimate the motionparameters according to Section 4.

3. Given the estimated motion parameters k+1t and thedepth map k+1Z, we first estimate the optical flow u0 =#r#Z(r)tp, then we compute the depth map at level kZ.

4. From the refined depth map kZ, we compute the motionparameters kt.

5. Steps 3 and 4 are repeated until the finest resolution isreached and the final depth map 0Z is obtained.

6. RESULTSIn order to verify our approach, we use synthetic images ofsize 512$ 512 and ground truth depth maps generated byray-tracing of a 3D model of a living room. We have gen-erated multiple sequences for various types of camera trans-lation, i.e. for movement parallel and perpendicular to theimage plane as well as for linear combinations of both. Thepurpose is to evaluate first ego-motion estimation and depthfrom motion seperately.

We run the ego-motion estimation with ground truthdepth maps on the different sequences and we obtain thetranslation vector estimates as listed in Table 1. For sim-plicity we only show the mean and standard deviation of thenormalized vectors.

We evaluate the depth from motion part by using theground truth translation vectors as inputs. We normalize theinput images which is convenient for comparing the used pa-rameters. In our experiments, we use 5 levels of resolutionwith a constant scale factor of 2 from level to level. Thefunctional splitting parameter # is set to 0.05. # % 0 meansapproaching the true TV-L1 model and thus better accounting

b = (tx, ty)T

Gradiend descent

Page 15: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Depth Estimation - TV-L1

11

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Page 16: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Depth Estimation - TV-L1

11

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Page 17: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Depth Estimation - TV-L1

11

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Page 18: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Depth Estimation - TV-L1

11

Functional Splitting

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Page 19: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Depth Estimation - TV-L1

11

Functional Splitting

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

!

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Fixed Z:

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Page 20: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Depth Estimation - TV-L1

11

Functional Splitting

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

|

!

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Fixed Z:

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Fixed V:

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

TV-L2 denoising scheme

!

Figure 1: pinhole camera model and rigid camera motion

and optical flow. The optical flow u is defined as the appar-ent motion of brightness pattern between two images. Thecentral projection on the object plane parallel to the sensorplane is given by:

tp =rz

!r!Z(r)· r+!r!Z(r)t

rz +!r!Z(r)tz" r!r!Z(r)

. (2)

Let us define Eq. 2 as the parallel projection. The opticalflow can be approximated by the following projection on thesensor plane:

u = rz ·r+!r!Z(r)trz +!r!Z(r)tz

"r. (3)

Eq.3 shows the nonlinear dependency of the estimated opti-cal flow on the depth map Z(r) as well as on the translation tzperpendicular to the sensor plane. In this nonlinear form, it isdifficult to include the projection in a variational framework.Nevertheless, combining Eqs. 2 and 3 we find a linearizedrelationship between the parallel projection and the opticalflow:

u = !r!Z(r)tp. (4)

3. TV-L1 DEPTH FROM MOTIONWe assume for the moment that we know the camera trans-lation parameters t for two successive frames I0 and I1. Fur-thermore, we assume that the brightness does not change be-tween those images. Using the definition of optical flow andthe projection in Eq. 4, we can express the image residual!(Z) as in [6]:

!(Z) = I1(x+u0)+!IT1 (!r!Ztp"u0)" I0. (5)

A depth map Z = Z(r) can be obtained by solving the fol-lowing optimization problem [2]:

Z# = argminZ "

x$D|!Z|+" "

x$D!(Z, I0, I1), (6)

where D is the discrete domain of pixels and x their positionon the image. The left term in Eq. 6 represents the regular-ization term. Here we set it to the TV norm of Z which im-poses a sparseness constraint on Z and acts edge-preserving.The right term is the data term which we set to the imageresidual as defined in Eq. 5. We have chosen the robust L1norm as it has some advantages when compared to the usu-ally employed L2 norm [9]. Eq. 6 is not a strictly convex

Figure 2: side view with the projections of camera motion

optimization problem and thus hard to solve. From [2] and[9] we know that a convex relaxation can be formulated:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2 +" "x$D

|!(V )|,

(7)where V is a close approximation of Z and for # % 0 wehave V % Z. We solve Eq. 7 using an alternative two-stepiteration scheme:1. For fixed Z, solve for V:

V # = argminV "

x$D

12#

(V "Z)2 +" "x$D

|!(V )|. (8)

2. For fixed V, solve for Z:

Z# = argminZ "

x$D|!Z|+ 1

2# "x$D

(V "Z)2. (9)

Eq. 8 can be solved by the following soft-thresholding:

V = Z+

!"#

"$

"#!r!!IT1 tp if !(Z) <""#(!r!!IT

1 tp)2

""#!r!!IT1 tp if !(Z) > "#(!r!!IT

1 tp)2

!(Z)!r!!IT

1 tpif |!(Z)|& "#(!r!!IT

1 tp)2.

(10)In order to solve Eq. 9, the dual formulation of the TV normcan be exploited. It is given by: TV (Z) = max{p · !Z :!p! & 1}. With the introduced dual variable p, Eq. 9 canbe solved iteratively by the Chambolle algorithm [3, 2]:

pn+1 =pn + $!(! ·pn"V/#)1+ $!(! ·pn"V/#)

. (11)

In the discrete domain the stability and properties of the so-lution depends on the implementation of the differential op-erators. In Eq. 11, ! represents the discrete gradient operatorand the scalar product with ! represents the discrete diver-gence operator as defined in [3]. From Eq. 11, the depthmap can be recovered by Z = V "#! ·p. Furthermore, thedepth positivity constraint has to be imposed on the recov-ered depth map, i.e. if Z(r) < 0 we set Z(r)' 0. In orderto provide global convergence and to handle different levelsof detail in the depth map Z we propose solving Eq. 7 us-ing a multi-scale resolution approach. This means that weuse downsampled images kI0 and kI1 of L different sizes (the

Chambolle projection algorithm

Page 21: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Joint Depth and Ego-motion Estimation

12

Page 22: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Joint Depth and Ego-motion Estimation

13

Multiple images / video sequence propagate solution initialize motion parameters with zero

Page 23: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

GPU Implementation

14

Image processing on the GPU:

‣ Host : I/O + render calls

‣ Device: parallel processing per pixel by fragment shading kernels

• Fixed number of iterations and parameters• Non optimized code• We reach a performance of 5 fps

Page 24: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Results

15

(the synthetic room sequence – Camera moving left)

Page 25: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Results

16

movement in x movement in y movement in z

GT GT GT

x+y+z

4.5 · 10!4MSE: 2.7 · 10!4MSE: 3.1 · 10!4MSE: 3 · 10!4MSE:

Depth from motion

Page 26: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

17

GT GT

6.2 · 10!4MSE: 10.1 · 10!4MSE: 4.1 · 10!4MSE: 3.9 · 10!4MSE:

movement in x movement in x movement in z movement in z

ResultsJoint ego-motion and depth estimation

Page 27: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Conclusions and Future Perspectives

We propose a new SfM algorithm based on TV-L1 model- Joint estimation of motion parameter and dense depth map- Fast and high parallelizable- (almost) real time implementation (5fps) based on GLSL

Limitations- no camera rotation- Tested only on synthetic images

Future work- Include rotation- Include a robust tracker for the ego-motion

18

Page 28: Fast Structure From Motion in Planar Image Sequences

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Thank you for your attention

Questions?

19