15
IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Variational Light Field Analysis for Disparity Estimation and Super-Resolution Sven Wanner, Member, IEEE, and Bastian Goldluecke, Member, IEEE Abstract—We develop a continuous framework for the analysis of 4D light fields, and describe novel variational methods for disparity reconstruction as well as spatial and angular super-resolution. Disparity maps are estimated locally using epipolar plane image analysis without the need for expensive matching cost minimization. The method works fast and with inherent subpixel accuracy, since no discretization of the disparity space is necessary. In a variational framework, we employ the disparity maps to generate super-resolved novel views of a scene, which corresponds to increasing the sampling rate of the 4D light field in spatial as well as angular direction. In contrast to previous work, we formulate the problem of view synthesis as a continuous inverse problem, which allows us to correctly take into account foreshortening effects caused by scene geometry transformations. All optimization problems are solved with state-of-the-art convex relaxation techniques. We test our algorithms on a number of real-world examples as well as our new benchmark dataset for lightfields, and compare results to a multiview stereo method. The proposed method is both faster as well as more accurate. Data sets and source code are provided online for additional evaluation. Index Terms—Light fields, epipolar plane images, 3D reconstruction, super-resolution, view interpolation, variational methods 1 I NTRODUCTION The 4D light field has been established as a promising paradigm to describe the visual appearance of a scene. Compared to a traditional 2D image, it offers informa- tion about not only the accumulated intensity at each image point, but separate intensity values for each ray direction. Thus, the light field implicitly captures 3D scene geometry and reflectance properties. The additional information inherent in a light field allows a wide range of applications. Popular in com- puter graphics, for example, is light field rendering, where the scene is displayed from a virtual view- point [26], [21]. The light field data also allows to add effects like synthetic aperture, i.e. virtual refocusing of the camera, stereoscopic display, and automatic glare reduction as well as object insertion and removal [18], [16], [10]. As we will exploit in this work, the con- tinuous disparity space also admits non-traditional approaches to the multiview stereo problem that do not rely on feature matching [7], [35]. However, in practice, it used to be difficult to achieve a dense enough sampling of the full light field. Expensive custom-made hardware was de- signed to be able to acquire several views of a scene. Straightforward but hardware-intensive are camera arrays [31]. Somewhat more practical and less ex- pensive is a gantry construction consisting of a sin- gle moving camera [19], [30], which is restricted to static scenes. Recently, however, the first commercial S. Wanner and B. Goldluecke are with the Heidelberg Collaboratory for Image Processing (HCI), Heidelberg, Germany. plenoptic cameras have become available on the mar- ket. Using an array of microlenses, a single one of these cameras essentially captures an full array of views simultaneously. This makes such cameras very attractive for a number of industrial applications, in particular depth estimation and surface inspection, and they can also acquire video streams of dynamic scenes [11], [22], [23]. However, plenoptic cameras usually have to deal with a trade-off between spatial and angular resolu- tion. Since the total sensor resolution is limited, one can either opt for a dense sampling in the spatial (image) domain with sparse sampling in the angular (view point) domain [23], or vice versa [22], [6], [11]. Increasing angular resolution is therefore a paramount goal if one wants to make efficient use of plenoptic cameras. It is equivalent to the synthesis of novel views from new viewpoints, which has also been a prominent research topic in the computer graphics community [21], [19]. Beyond the need for super-resolution, there is a growing demand for efficient and robust algorithms which reconstruct information directly from light fields. However, while there has been a lot of work on for example stereo and optical flow algorithms for traditional image pairs, there is a lack of similar modern methods which are specifically tailored to the rich structure inherent in a light field. Furthermore, much of the existing analysis is local in nature, and does not enforce global consistency of results.

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Variational Light Field Analysis forDisparity Estimation and Super-Resolution

Sven Wanner, Member, IEEE, and Bastian Goldluecke, Member, IEEE

Abstract—We develop a continuous framework for the analysis of 4D light fields, and describe novel variational methods for

disparity reconstruction as well as spatial and angular super-resolution. Disparity maps are estimated locally using epipolar

plane image analysis without the need for expensive matching cost minimization. The method works fast and with inherent

subpixel accuracy, since no discretization of the disparity space is necessary. In a variational framework, we employ the disparity

maps to generate super-resolved novel views of a scene, which corresponds to increasing the sampling rate of the 4D light field

in spatial as well as angular direction. In contrast to previous work, we formulate the problem of view synthesis as a continuous

inverse problem, which allows us to correctly take into account foreshortening effects caused by scene geometry transformations.

All optimization problems are solved with state-of-the-art convex relaxation techniques. We test our algorithms on a number of

real-world examples as well as our new benchmark dataset for lightfields, and compare results to a multiview stereo method.

The proposed method is both faster as well as more accurate. Data sets and source code are provided online for additional

evaluation.

Index Terms—Light fields, epipolar plane images, 3D reconstruction, super-resolution, view interpolation, variational methods

1 INTRODUCTION

The 4D light field has been established as a promisingparadigm to describe the visual appearance of a scene.Compared to a traditional 2D image, it offers informa-tion about not only the accumulated intensity at eachimage point, but separate intensity values for each raydirection. Thus, the light field implicitly captures 3Dscene geometry and reflectance properties.

The additional information inherent in a light fieldallows a wide range of applications. Popular in com-puter graphics, for example, is light field rendering,where the scene is displayed from a virtual view-point [26], [21]. The light field data also allows to addeffects like synthetic aperture, i.e. virtual refocusing ofthe camera, stereoscopic display, and automatic glarereduction as well as object insertion and removal [18],[16], [10]. As we will exploit in this work, the con-tinuous disparity space also admits non-traditionalapproaches to the multiview stereo problem that donot rely on feature matching [7], [35].

However, in practice, it used to be difficult toachieve a dense enough sampling of the full lightfield. Expensive custom-made hardware was de-signed to be able to acquire several views of a scene.Straightforward but hardware-intensive are cameraarrays [31]. Somewhat more practical and less ex-pensive is a gantry construction consisting of a sin-gle moving camera [19], [30], which is restricted tostatic scenes. Recently, however, the first commercial

• S. Wanner and B. Goldluecke are with the Heidelberg Collaboratoryfor Image Processing (HCI), Heidelberg, Germany.

plenoptic cameras have become available on the mar-ket. Using an array of microlenses, a single one ofthese cameras essentially captures an full array ofviews simultaneously. This makes such cameras veryattractive for a number of industrial applications, inparticular depth estimation and surface inspection,and they can also acquire video streams of dynamicscenes [11], [22], [23].

However, plenoptic cameras usually have to dealwith a trade-off between spatial and angular resolu-tion. Since the total sensor resolution is limited, onecan either opt for a dense sampling in the spatial(image) domain with sparse sampling in the angular(view point) domain [23], or vice versa [22], [6], [11].Increasing angular resolution is therefore a paramountgoal if one wants to make efficient use of plenopticcameras. It is equivalent to the synthesis of novelviews from new viewpoints, which has also been aprominent research topic in the computer graphicscommunity [21], [19].

Beyond the need for super-resolution, there is agrowing demand for efficient and robust algorithmswhich reconstruct information directly from lightfields. However, while there has been a lot of workon for example stereo and optical flow algorithmsfor traditional image pairs, there is a lack of similarmodern methods which are specifically tailored to therich structure inherent in a light field. Furthermore,much of the existing analysis is local in nature, anddoes not enforce global consistency of results.

Page 2: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

Contributions. In this paper, we first address theproblems of disparity estimation in light fields, wherewe introduce a novel local data term tailored to thecontinuous structure of light fields. The proposedmethod can locally obtain robust results very fast,without having to quantize the disparity space intodiscrete disparity values. The local results can furtherbe integrated into globally consistent depth maps us-ing state-of-the-art labeling schemes based on convexrelaxation methods [24], [29].

In this way, we obtain an accurate geometry esti-mate with subpixel precision matching, which we canleverage to simultaneously address the problems ofspatial and angular super-resolution. Following state-of-the-art spatial super-resolution research in com-puter vision [25], [14], we formulate a variationalinverse problem whose solution is the synthesizedsuper-resolved novel view. As we work in a continu-ous setting, we can for the first time correctly take intoaccount foreshortening effects caused by the scenegeometry.

The present work unifies and extends our previousconference publications [35] and [36]. As additionalcontributions, we focus in depth on interactive tech-niques applicable in practice, and analyze parame-ter choices, scope and limitations of our method indetailed experiments. To faciliate speed, we providea new fast method to combine local epipolar planeimage disparity estimates into a global disparity map.Furthermore, we give an extensive comparison toa multiview stereo method on our new light fieldbenchmark dataset.

Advantages and limitations of the method. Ourmethod exploits the fact that in a light field withdensely sampled view points, derivatives of the inten-sity can be computed with respect to the view pointlocation. The most striking difference to standard dis-parity estimation is that at no point, we actually com-pute stereo correspondences in the usual sense - wenever try to match pixels at different locations. As a con-sequence, the run-time of our method is completelyindependent of the desired disparity resolution, andwe beat other methods we have compared against interms of speed, while maintaining on average similaror better accuracy. A typical failure case (just asfor stereo methods) are regions with strong specularhighlights or devoid of any texture. By design, thereare no problems with repetitive structures due to thelocal nature of the slope estimation.

However, as we will explore in experiments, thesampling of view points must be sufficiently densesuch that disparities between neighbouring viewsare less than around two pixels to achieve reason-able accuracy. Furthermore, for optimal results it isalso recommended that the view points form a two-dimensional rectangular grid, although in principle aone-dimensional line of view points is sufficient.

In particular, this is not the case for established data

Fig. 1: Our novel paradigm for depth reconstruction in a lightfield allows to efficiently estimate sub-pixel accurate depth mapsfor all views, which can be used for light field super-resolution.

sets like the Middlebury stereo benchmark, which hasfar too large disparities and therefore is beyond thedesign scope of our algorithm. For this reason, weevaluate on our own set of synthetic benchmarks,which more closely resembles the data from a plenop-tic camera and is more representative for what weconsider the advantages of light field data comparedto traditional multi-view input. It is available onlinetogether with complete source code to reproduce allexperiments [33], [13].

2 RELATED WORK

The concept of light fields originated mainly in com-puter graphics, where image based rendering [27] isa common technique to render new views from aset of images of a scene. Adelson and Bergen [1]as well as McMillan and Bishop [21] treated viewinterpolation as a reconstruction of the plenoptic func-tion. This function is defined on a seven-dimensionalspace and describes the entire information about lightemitted by a scene, storing an intensity value forevery 3D point, direction, wavelength and time. Adimensionality reduction of the plenoptic functionto 4D, the so called Lumigraph, was introduced byGortler et al. [15]. and Levoy and Hanrahan [19]. Intheir parametrization, each ray is determined by itsintersections with two planes.

A main benefit of light fields compared to tradi-tional images or stereo pairs is the expansion of thedisparity space to a continuous space. This becomes ap-parent when considering epipolar plane images (EPIs),which can be viewed as 2D slices of constant angularand spatial coordinate through the Lumigraph. Due toa dense sampling in angular direction, correspondingpixels are projected onto lines in EPIs, which can bemore robustly detected than point correspondences.

Geometry estimation in a 4D light field. One of thefirst approaches using EPIs to analyze scene geometrywas published by Bolles et al. [7]. They detect edges,peaks and troughs with a subsequent line fitting inthe EPI to reconstruct 3D structure. Another approachis presented by Criminisi et al. [10], who use aniterative extraction procedure for collections of EPI-lines of the same depth, which they call an EPI-tube.Lines belonging to the same tube are detected viashearing the EPI and analyzing photo-consistency inthe vertical direction. They also propose a procedure

Page 3: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

to remove specular highlights from already extractedEPI-tubes.

There are also two less heuristic methods whichwork in an energy minimization framework. In Ma-tousek et al. [20], a cost function is formulated tominimize a weighted path length between points inthe first and the last row of an EPI, preferring constantintensity in a small neighborhood of each EPI-line.However, their method only works in the absence ofocclusions. Berent et al. [4] deal with the simultaneoussegmentation of EPI-tubes by a region competitionmethod using active contours, imposing geometricproperties to enforce correct occlusion ordering.

As novelty to previous work, we suggest to employthe structure tensor of an EPI to obtain a fast androbust local disparity estimation. Furthermore, weenforce globally consistent visibility across views byrestricting the spatial layout of the labeled regions.Compared to methods which extract EPI-tubes se-quentially [7], [10], this is independent of the orderof extraction and does not suffer from an associatedpropagation of errors. While a simultaneous extrac-tion is also performed in [4], they use a level setapproach, which makes it expensive and cumbersometo deal with a large number of regions.

Spatial and angular super-resolution. Super-resolving the four dimensions in a light field amountsto generating high-resolution novel views of the scenewhich have not originally been captured by the cam-eras. A collection of images of a scene can be inter-preted as a sparse sampling of the plenoptic function.Consequently, image-based rendering approaches [27]treat the creation of novel views as a resampling prob-lem, circumventing the need for any explicit geometryreconstruction [21], [19], [17]. However, this approachignores occlusion effects, and therefore is only reallysuitable for synthesis of views reasonably close to theoriginal ones.

Overall, it quickly became clear that one faces atrade-off, and interpolation of novel views in suffi-cient enough quality requires either an unreasonablydense sampling or knowledge about the scene [8].A different line of approaches to light field render-ing therefore tries to infer at least some geometricknowledge about the scene. They usually rely onimage registration, for example via robust featuredetecting and tracking [28], or view-dependent depthmap estimation based on color consistency [12].

The creation of super-resolved images requiressubpixel-accurate registration of the input images.Approaches which are based on pure 2D image regis-tration [25] are unsuitable for the generation of novelviews, since a reference image for computing the mo-tion is not available yet. Super-resolved depth mapsand images are inferred in [6] with a discrete super-resolution model tailored to a particular plenopticcamera. A full geometric model with a super-resolvedtexture map is estimated in [14] for scenes captured

P = (X, Y, Z)Ω

Π

f

t∗

y∗

∆x

∆s

(a) Light field geometry

s∗

y∗

y

s

x

x

(b) Pinhole view at (s∗, t∗) and epipolar plane image Sy∗,t∗

Fig. 2: Each camera location (s∗, t∗) in the image plane Π yieldsa different pinhole view of the scene. By fixing a horizontalline of constant y∗ in the image plane and a constant cameracoordinate t∗, one obtains an epipolar plane image (EPI) in (x, s)coordinates. A scene point P is projected onto a line in the EPIdue to a linear correspondence between its s- and projected x-coordinate, see figure (a) and equation (3).

with a surround camera setup. Our approach is math-ematically closely related to the latter, since [14] isalso based on continuous geometry which leads tocorrect point-wise weighting of the energy gradientcontributions. However, we do not perform expensivecomputation of a global model and texture atlas, butinstead compute the target view directly.

3 4D LIGHT FIELD STRUCTURE

Several ways to represent light fields have beenproposed. In this paper, we adopt the light fieldparametrization from early works in motion analy-sis [7]. One way to look at a 4D light field is toconsider it as a collection of pinhole views fromseveral view points parallel to a common image plane,figure 2. The 2D plane Π contains the focal pointsof the views, which we parametrize by the coordi-nates (s, t), and the image plane Ω is parametrized bythe coordinates (x, y). A 4D light field or Lumigraph isa map

L : Ω×Π → R, (x, y, s, t) 7→ L(x, y, s, t). (1)

It can be viewed as an assignment of an intensityvalue to the ray passing through (x, y) ∈ Ω and(s, t) ∈ Π.

For the problem of estimating 3D structure, weconsider the structure of the light field, in particular

Page 4: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

λj

λi

ni

(a) Allowed

λj

λi

ni

(b) Forbidden

Fig. 3: Global labeling constraints on an EPI: if depth λi is lessthan λj and corresponds to direction ni, then the transition fromλi to λj is only allowed in a direction orthogonal to ni to notviolate occlusing order.

on 2D slices through the field. We fix a horizontalline of constant y∗ in the image plane and a constantcamera coordinate t∗, and restrict the light field toan (x, s)-slice Σy∗,t∗ . The resulting map is called anepipolar plane image (EPI),

Sy∗,t∗ : Σy∗,t∗ → R,

(x, s) 7→ Sy∗,t∗(x, s) := L(x, y∗, s, t∗).(2)

Let us consider the geometry of this map, figure 2.A point P = (X,Y, Z) within the epipolar planecorresponding to the slice projects to a point in Ωdepending on the chosen camera center in Π. If wevary s, the coordinate x changes according to [7]

∆x =f

Z∆s, (3)

where f is the distance between the parallel planes.Note that to obtain this formula from figure 12, ∆xhas to be corrected by the translation ∆s to accountfor the different local coordinate systems of the views.

Interestingly, a point in 3D space is thus projectedonto a line in Σy∗,t∗ , where the slope of the line isrelated to its depth. This means that the intensity ofthe light field should not change along such a line,provided that the objects in the scene are Lambertian.Thus, computing depth is essentially equivalent tocomputing the slope of level lines in the epipolarplane images. Of course, this is a well-known fact,which has already been used for depth reconstructionin previous works [7], [10]. In the next section, wedescribe and evaluate our novel approach how toobtain consistent slope estimates.

4 DISPARITY ESTIMATION

The basic idea of our approach is as follows. Wefirst compute local slope estimates on epipolar planeimages for the two different slice directions usingthe structure tensor. This gives two local disparityestimates for each pixel in each view. These canbe merged into a single disparity map in two dif-ferent ways: just locally choosing the estimate withthe higher reliability, optionally smoothing the result

(a) Typical epipolar plane image Sy∗,t∗

(b) Noisy local depth estimate

(c) Consistent depth estimate after optimization

Fig. 4: With the consistent labeling scheme described in sec-tion 4.1, one can enforce global visibility constraints in order toimprove the depth estimates for each epipolar plane image.

(which is very fast), or solving a global optimizationproblem (which is slow). In the experiments, wewill show that fortunately, the fast approach leads toestimates which are even slightly more accurate.

Obviously, our approach does not use the full4D light field information around a ray to obtain thelocal estimates - we just work on two different 2D cutsthrough this space. The main reason is performance,in order to be able to achieve close to interactivespeeds, which is necessary for most practical applica-tions, the amount of data which is used locally mustbe kept to a minimum. Moreover, in experiments witha multi-view stereo method, it turns out that usingall of the views for the local estimate, as opposed toonly the views in the two epipolar plane images, doesnot lead to overall more accurate estimates. Whileit is true that the local data term becomes slightlybetter, the result after optimization is the same. Alikely reason is that the optimization or smoothingstep propagates the information across the view.

4.1 Disparities on epipolar plane images

a) Local disparity estimation on an EPI. We firstconsider how we can estimate the local direction of aline at a point (x, s) in an epipolar plane image Sy∗,t∗ ,where y∗ and t∗ are fixed. The case of vertical slices isanalogous. The goal of this step is to compute a localdisparity estimate dy∗,t∗(x, s) for each point of the slicedomain, as well as a reliability estimate ry∗,t∗(x, s) ∈[0, 1], which gives a measure of how reliable the localdisparity estimate is. Both local estimates will used insubsequent sections to obtain a consistent disparitymap in a global optimization framework.

In order to obtain the local disparity estimate, weneed to estimate the direction of lines on the slice.This is done using the structure tensor J of the epipolarplane image S = Sy∗,t∗ ,

J =

[Gσ ∗ (SxSx) Gσ ∗ (SxSy)Gσ ∗ (SxSy) Gσ ∗ (SySy)

]

=

[Jxx JxyJxy Jyy

]

. (4)

Here, Gσ represents a Gaussian smoothing operatorat an outer scale σ and Sx,Sy denote the gradient

Page 5: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

5x5 (

768)

7x7 (

768)

9x9 (

768)

11x1

1 (76

8)

13x1

3 (76

8)

15x1

5 (76

8)

17x1

7 (76

8)

9x9 (

256)

9x9 (

512)

9x9 (

768)

9x9 (

896)

9x9 (

1024

)

angular (spatial) resolution

stru

ctur

e te

nsor

sca

le

BuddhaMonaConeHead

Outer ScaleInner Scale

(a) optimal structure tensor parameters

96%

44%

70%

57%

83%

(b) grid search example on ConeHead 9x9 (768x768)

Fig. 5: Using grid search, we find the ideal structure tensor parameters over a range of both angular and spatial resolutions (a). Bluedata points show the optimal outer scale, red data points the optimal inner scale, respectively. The thick streaks are added only forvisual orientation. In (b) an example of a single grid search is depicted. Color-coded is the amount of pixels with a relative error tothe ground truth of less than 1%, which is the target value to be optimized for in (a).

components of S calculated on an inner scale ρ.

The direction of the local level lines can then becomputed via [5]

ny∗,t∗ =

[∆x∆s

]

=

[sin(ϕ)cos(ϕ)

]

with ϕ =1

2arctan

(Jyy − Jxx

2Jxy

)

,

(5)

from which we derive the local depth estimate viaequation (3) as

Z = −f∆s

∆x. (6)

Frequently, a more convenient unit is the disparitydy∗,t∗ = f

Z= ∆x

∆s= tanφ, which describes the pixel

shift of a scene point when moving between theviews. We will usually use disparity instead of depthin the remainder of the paper. As the natural relia-bility measure we use the coherence of the structuretensor [5],

ry∗,t∗ :=(Jyy − Jxx)

2+ 4J2

xy

(Jxx + Jyy)2 . (7)

Using the local disparity estimates dy∗,t∗ , dx∗,s∗ andreliability estimates ry∗,t∗ , rx∗,s∗ for all the EPIs inhorizontal and vertical direction, respectively, one cannow proceed to directly compute disparity maps ina global optimization framework, which is explainedin section 4.2. However, it is possible to first enforceglobal visibility constraints separately on each of theEPIs, which we explain in the next section.

b) Consistent disparity labeling on an EPI. Thecomputation of the local disparity estimates using the

structure tensor only takes into account the immediatelocal structure of the light field. In truth, the disparityvalues within a slice need to satisfy global visibilityconstraints across all cameras for the labeling to beconsistent. In particular, a line which is labeled witha certain depth cannot be interrupted by a transitionto a label corresponding to a greater depth, since thiswould violate occlusion ordering, figure 3.

In our conference paper [35], we have shown thatby using a variational labeling framework based onordering constraints [29], one can obtain globally con-sistent estimates for each slice which take into accountall views simultaneously. While this is a computation-ally very expensive procedure, it yields convincingresults, see figure 4. In particular, consistent label-ing greatly improves robustness to non-Lambertiansurfaces, since they typically lead only to a smallsubset of outliers along an EPI-line. However, at themoment this is only a proof-of-concept, since it is fartoo slow to be useable in any practical applications.For this reason, we do not pursue this method furtherin this paper, and instead evaluate only the interactivetechnique, using results from the local structure tensorcomputation directly.

4.2 Disparities on individual views

After obtaining EPI disparity estimates dy∗,t∗

and dx∗,s∗ from the horizontal and vertical slices,respectively, we integrate those estimates into aconsistent single disparity map u : Ω → R for eachview (s∗, t∗). This is the objective of the followingsection.

Page 6: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

5 7 9 11 13 15 17

angular resolution

amou

nt o

f pix

els

[%] w

ith le

ss d

evia

tion

from

gt t

han

BuddhaMonaConeHead

= 1%= 0.1%

(a) Accuracy depending on angular resolution

TV-L denoised1

Histogram of mean deviation per pixel from gt Buddha 9x9

(b) Mean error depending on disparity for Buddha 9x9

Fig. 6: Analysis of the error behaviour from two different points of view. In a), we plot the percentage of pixels which deviate fromthe ground truth (gt) by less than a given threshold over the angular resolution. Very high accuracy (i.e. more than 50% of pixelsdeviate by less than 0.1%) requires an angular resolution of the light field of at least 9×9 views. In b), we show the relative deviationfrom ground truth over the disparity value in pixels per angular step. Results were plotted for local depth estimations calculated fromthe original (clean) light field, local depth estimated from the same light field with additional Poisson noise (noisy) as well as thesame result after TV-L1 denoising, respectively. While the ideal operational range of the algorithm are disparities within ±1 pixel perangular step, denoising significantly increases overall accuracy outside of this range.

a) Fast denoising scheme. Obviously, the fastestway to obtain a sensible disparity map for the viewis to just point-wise choose the disparity estimatewith the higher reliability rx∗,s∗ or ry∗,t∗ , respectively.We can see that it is still quite noisy, furthermore,edges are not yet localized very well, since computingthe structure tensor entails an initial smoothing ofthe input data. For this reason, a fast method toobtain quality disparity maps is to employ a TV-L1

smoothing scheme, where we encourage discontinu-ities of u to lie on edges of the original input image byweighting the local smoothness with a measure of theedge strength. We use g(x, y) = 1− rs∗,t∗(x, y), wherers∗,t∗ is the coherence measure for the structure tensorof the view image, defined similarly as in (7). Highercoherence means a stronger image edge, which thusincreases the probability of a depth discontinuity.

We then minimize the weighted TV-L1 smoothingenergy

E(u) =

Ω

g |Du|+1

2λ|u− f | d(x, y), (8)

where f is the noisy disparity estimate and λ > 0a suitable smoothing parameter. The minimization isimplemented in our open-source library cocolib [13]and performs in real-time.

b) Global optimization scheme. From a modelingperspective, a more sophisticated way to integrate thevertical and horizontal slice estimates is to employ aglobally optimal labeling scheme in the domain Ω,

where we minimize a functional of the form

E(u) =

Ω

g |Du|+ ρ(u, x, y)d(x, y). (9)

In the data term, we want to encourage the solutionto be close to either dx∗,s∗ or dy∗,t∗ , while suppress-ing impulse noise. Also, the two estimates dx∗,s∗

and dy∗,t∗ shall be weighted according to their reli-ability rx∗,s∗ and ry∗,t∗ . We achieve this by setting

ρ(u, x, y) := min(ry∗,t∗(x, s∗) |u− dy∗,t∗(x, s

∗)| ,

rx∗,s∗(y, t∗) |u− dx∗,s∗(y, t

∗)|).(10)

We compute globally optimal solutions to the func-tional (9) using the technique of functional lift-ing described in [24], which is also implementedin cocolib [13]. While being more sophisticatedmodeling-wise, the global approach requires minutesper view instead of being real-time, and a discretiza-tion of the disparity range into labels.

4.3 Performance analysis for interactive labeling

In this section, we perform detailed experimentswith the local disparity estimation algorithm in sec-tion 4.1(a) to analyze both quality as well as speedof this method. The aim is to investigate how wellour disparity estimation paradigm performs when thefocus lies on interactive applications, as well as findout more about the requirements regarding light fieldsampling and the necessary parameters.

Optimal parameter selection. In a first experiment,we establish guidelines to select optimal inner andouter scale parameters of the structure tensor. As

Page 7: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

(a) ground truth (b) disparity estimate (c) TV-L1 denoising (d) 96.5% of pixels de-viate by less than 1%

(e) 72.6% of pixels de-viate by less than 0.1%

(a) ground truth (b) disparity estimate (c) TV-L1 denoising (d) 95.4% of pixels de-viate by less than 1%

(e) 60.1% of pixels de-viate by less than 0.1%

(a) ground truth (b) disparity estimate (c) TV-L1 denoising (d) 98.3% of pixels de-viate by less than 1%

(e) 77.95% of pixels de-viate by less than 0.1%

Fig. 7: Results of disparity estimation on the datasets Buddha (top), Mona (center) and Conehead (bottom). (a) shows ground truthdata, (b) the local structure tensor disparity estimate from section 4.1 and (c) the result after TV-L1 denoising according to section 4.2.In (d) and (e), one can observe the amount and distribution of error, where green labels mark pixels deviating by less than the giventhreshold from ground truth, red labels pixels which deviate by more. Most of the larger errors are concentrated around image edges.

a quality measurement, we use the percentage ofdepth values below a relative error ǫ = |u(x, y) −r(x, y)|/r(x, y), where u is the depth map for theview and r the corresponding ground truth. Optimalparameters are then found with a simple grid searchstrategy, where we test a number of different param-eter combinations. Results are depicted in figure 5,and determine the optimal parameter for each lightfield resolution and data set. Following evaluationsare all done with these optimal parameters. In general,it can be noted that an inner scale parameter of 0.08is always reasonable, while the outer scale should bechosen larger with larger spatial and angular resolu-tion to increase the overall sampling area.

Minimum sampling density. In a second step, weinvestigate what sampling density we need for anoptimal performance of the algorithm. To achieve this,we tested all datasets over the full angular resoultionrange with the optimal parameter selection foundin figure 5. The results are illustrated in figure 6, andshow that for very high accuracy, i.e. less than 0.1%deviation from ground truth, we require about nineviews in each angular direction of the light field.

Moreover, the performance degrades drasticallywhen the disparities become larger than around

±1 pixels, which makes sense from a sampling per-spective since the derivatives in the structure tensorare computed on a 3 × 3 stencil. Together with thecharacteristics of the camera system used (baseline,focal length, resolution), this places constraints on thedepth range where we can obtain estimates with ourmethod. For the Raytrix plenoptic camera we use inthe later experiments, for example, it turns out that wecan reconstruct scenes which are roughly containedwithin a cube-shaped volume, whose size and dis-tance is determined by the main lens we choose.

Noisy input. A second interesting fact is observ-able on the right hand side of figure 6, where wetest the robustness against noise. Within a disparityrange of ±1, the algorithm is very robust, while theresults quickly degrade for larger disparity valueswhen impulse noise is added to the input images.However, when we apply TV-L1 denoising, whichrequires insignificant extra computational cost, we cansee that the deviation from ground truth is on averagereduced below the error resulting from a noise-freeinput. Unfortunately, denoising always comes at aprice: since it naturally incurs some averaging, whileaccuracy is globally increased, some sub-pixel detailscan be lost.

Page 8: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

Average run time1 Accuracy Disparity error Depth errorAlgorithm all views single view SSIM2 MSE ≥ 1.0 ≥ 0.5 ≥ 0.1 ≥ 1.0% ≥ 0.5% ≥ 0.1%

Proposed EPI algorithm

Structure tensor only 6s3 –4 0.87 0.029 0.70 1.95 16.60 0.32 3.66 55.69TV-L1 smoothing 1.6s5 0.02s 0.94 0.018 0.43 1.28 9.85 0.18 2.45 47.83Global optimization 5.5h5 240s 0.93 0.019 0.54 1.4 8.82 0.23 2.44 46.71

Multiview stereo(all 9× 9 views)

Dataterm only 101s 1.25s 0.77 0.037 0.85 2.89 14.61 0.45 6.93 46.36TV-L1 smoothing 1.6s 0.02s 0.91 0.023 0.54 1.71 9.13 0.19 3.68 41.65Global optimization 5.5h 240s 0.92 0.027 0.71 1.9 7.51 0.08 3.48 37.34

Multiview stereo(center view crosshair)

Dataterm only 44s 0.55s 0.74 0.043 0.93 3.48 18.73 0.58 8.77 51.44TV-L1 smoothing 1.6s 0.02s 0.91 0.021 0.46 1.64 10.91 0.18 4.14 45.21Global optimization 5.5h 240s 0.92 0.023 0.57 1.71 8.34 0.09 3.17 39.64

1 All algorithms implemented on the GPU, running on a nVidia GTX 580 hosted on an Intel Core i7 board.2 The structural similarity measure (SSIM) ranges from -1 to 1, larger values mean higher similarity.3 Tuned for optimal accuracy instead of speed: three channel RGB structure tensor, convolution kernel size 9 × 9.4 By design, the algorithm always computes disparity for all views at the same time.5 Run time for optimization only, excludes computation of the data term (same for subsequent optimization methods).

Fig. 8: Average disparity reconstruction accuracy and speed for our method compared to a multi-view stereo method on our light fieldbenchmark database. Parameters for all methods were tuned to yield an optimal structural similarity (SSIM) measure [32] (boldedcolumn), but for completeness we also show mean squared disparity error (MSE), as well as the percentage of pixels with disparityor depth error larger than a given quantity. Our method is the only one which is fast enough to yield disparity maps for all 9 × 9views at near-interactive frame rates, while also being the most accurate. Results on individual data sets can be observed in figure 9.

In figure 7 we observe the distribution of the errors,and can see that almost all large-scale error is concen-trated around depth discontinuities. This is a quitecommon behaviour of depth reconstruction schemes,and improving it a central topic of possible furtherinvestigations.

4.4 Comparison to multi-view stereo

Our method uses a paradigm which is quite differentfrom multi-view stereo, so it is of course of interesthow it fares in competition to these methods. Wetherefore compare it to a straight-forward stereo dataterm, while using the exact same smoothing andoptimization schemes for all data terms. Since inthe proposed algorithm, we restrict the computationof the local disparity to slices in s- and t-direction,it is also of interest how many views are actuallyrequired to produce optimal results. The question iswhether by means of this restriction we do not throwaway potentially useful information. However, duringoptimization, all local information is integrated into aglobal functional, and spatially propagated. We willsee that using more views for the local data term thanthe ones in those two slices does not actually improvethe optimized results anymore.

Competing method. We compute a local stereomatching cost for a single view as follows. Let V =(s1, t1), ..., (sN , tN ) be the set of N view points withcorresponding images I1, ..., IN , with (sc, tc) being thelocation of the current view Ic for which the costfunction is being computed. We then choose a set Λof 64 disparity labels within an appropriate range,

for our test we choose equidistant labels within theground truth range for optimal results. The local costρall(x, l) for label l ∈ Λ at location x ∈ Ic computedon all neighbouring views is then given by

ρall(x, l) :=∑

(sn,tn)∈V

‖In(x+ lvn)− Ic(x)‖ , (11)

where vn := (sn−sc, tn−tc) is the view point displace-ment. To test the influence of the number of views, wealso compute a cost function on a “crosshair” of viewpoints along the s- and t-axis from the view (sc, tc),which is given by

ρcrosshair(x, l) :=∑

(sn,tn)∈V

sn=sc or tn=tc

‖In(x+ lvn)− Ic(x)‖ .

(12)In effect, this cost function thus uses exactly the samenumber of views as we do to compute the localstructure tensor.

The local cost function can be used to computefast point-wise results, optionally smoothing themafterwards, or also integrated into a global energyfunctional

E(u) =

Ω

ρ(x, u(x))dx+ λ

Ω

|Du| (13)

for a labeling function u : Ω → Λ on the imagedomain Ω, which is solved to global optimality usingthe method in [24].

Experiments and discussion. In figures 8 and 9,we show detailed visual and quantitative disparityestimation results on our benchmark datasets. Algo-rithm parameters for all methods were tuned for an

Page 9: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

Multi-view stereo Proposed EPI methodCenter view Ground truth Data term only Global optimum Structure tensor TV-L1 smoothing Global optimum

Bu

dd

ha

0.86 0.94 0.91 0.94 0.94

Ho

rses

0.75 0.92 0.84 0.92 0.89

Med

iev

al

0.61 0.93 0.85 0.97 0.96

Mo

na

0.85 0.91 0.90 0.94 0.93

Pap

illo

n

0.72 0.90 0.83 0.93 0.92

Sti

llL

ife

0.89 0.91 0.92 0.94 0.93

Fig. 9: Structural similarity error for different methods on individual scenes from our benchmark data set. All light fields have 9× 9views, with image resolutions between 768× 768 and 1024× 768. The multi-view stereo method makes use of all views to computethe data term. One can see that we obtain in most cases more accurate results in much shorter computation time.

optimal structural similarity (SSIM) measure. Strongarguments why this measure should be preferred tothe MSE are given in [32], but we also have computeda variety of other quantities for comparison (however,the detailed results vary when parameters are opti-mized for different quantities).

First, one can observe that our local estimate al-ways is more accurate than any of the multi-viewstereo data terms, while using all of the views givesslightly better results for multi-view than using onlythe crosshair. Second, our results after applying theTV-L1 denoising scheme (which takes altogether lessthan two seconds for all views) are more accuratethan all other results, even those obtained with globaloptimization schemes (which takes minutes per view).A likely reason why our results do not become betterwith global optimization is that the latter requires aquantization in to a discrete set of disparity labels,which of course leads to an accuracy loss. Notably,after either smoothing or global optimization, both

multiview stereo data terms achieve the same ac-curacy, see figure 8 - it does not matter that thecrosshair data term makes use of less views, likelysince information is propagated across the view in thesecond step. This also justifies our use of only twoepipolar plane images for the local estimate.

Our method is also the fastest, achieving near-interactive performance for computing disparity mapsfor all of the views simultaneously. Note that infigure 8, we give computation times when our methodis tuned for maximum quality (i.e. three channel RGBstructure tensor with a convolution kernel size of9×9). At the loss of some accuracy, one can work withgrayscale light fields (three times faster) or reducethe convolution kernel size (again up to three timesfaster). Note that by construction, the disparity mapsfor all views are always computed simultaneously.Performance could further be increased by restrictingthe computation on each EPI to a small stripe if onlythe result of a specific view is required.

Page 10: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

5 SUPER-RESOLUTION VIEW SYNTHESIS

In this section, we propose a variational model forthe synthesis of super-resolved novel views. To thebest of our knowledge, it is the first of its kind. Sincethe model is continuous, we will be able to deriveEuler-Lagrange equations which correctly take intoaccount foreshortening effects of the views causedby variations in the scene geometry. This makes themodel essentially parameter-free. The framework isin the spirit of [14], which computes super-resolvedtextures for a 3D model from multiple views, andshares the same favourable properties. However, ithas substantial differences, since we do not requirea complete 3D geometry reconstruction and costlycomputation of a texture atlas. Instead, we only makeuse of disparity maps on the input images, and modelthe super-resolved novel view directly.

The following mathematical framework is formu-lated for views with arbitrary projections. However,an implementation in this generality would be quitedifficult to achieve. We therefore specialize to thescenario of a 4D light field in the subsequent section,and leave a generalization of the implementation forfuture work.

For the remainder of the section, assume we haveimages vi : Ωi → R of a scene available, which areobtained by projections πi : R3 → Ωi. Each pixel ofeach image stores the integrated intensities from acollection of rays from the scene. This subsamplingprocess is modeled by a blur kernel b for functionson Ωi, and essentially characterizes the point spreadfunction for the corresponding sensor element. It canbe measured for a specific imaging system [2]. Ingeneral, the kernel may depend on the view and evenon the specific location in the images. We omit thedependency here for simplicity of notation.

The goal is to synthesize a view u : Γ → R ofthe light field from a novel view point, representedby a camera projection π : R

3 → Γ, where Γ isthe image plane of the novel view. The basic ideaof super-resolution is to define a physical model forhow the subsampled images vi can be explainedusing high-resolution information in u, and then solvethe resulting system of equations for u. This inverseproblem is ill-posed, and is thus reformulated as anenergy minimization problem with a suitable prior orregularizer on u.

5.1 Image formation and model energy

In order to formulate the transfer of informationfrom u to vi correctly, we require geometry infor-mation [8]. Thus, we assume we know (previouslyestimated) depth maps di for the input views. A pointx ∈ Ωi is then in one-to-one correspondence to apoint P which lies on the scene surface Σ ⊂ R

3. Thecolor of the scene point can be recovered from u via

Γ

x

τi

Σ

x′

P

τi(x) = π(P )

c

Ωi

ci

Fig. 10: Transfer map τi from an input image plane Ωi to theimage plane Γ of the novel view point. The scene surface Σ canbe inferred from the depth map on Ωi. Note that not all pointsx ∈ Ωi are visible in Γ due to occlusion, which is described bythe binary mask mi on Ωi. Above, mi(x) = 1 while mi(x

′) = 0.

u π(P ), provided that x is not occluded by otherscene points, see figure 10.

The process explained above induces a backwardswarp map τi : Ωi → Γ which tells us where to look on Γfor the color of a point, as well as a binary occlusionmask mi : Ωi → 0, 1 which takes the value 1 if andonly if a point in Ωi is also visible in Γ. Both mapsonly depend on the scene surface geometry as seenfrom vi, i.e. the depth map di. The different termsand mappings appearing above and in the followingare visualized for an example light field in figure 11.

Having computed the warp map, one can formulatea model of how the values of vi within the mask canbe computed, given a high-resolution image u. Usingthe downsampling kernel, we obtain vi = b ∗ (u τi)on the subset of Ωi where mi = 1, which consists ofall points in vi which are also visible in u. Since thisequality will not be satisfied exactly due to noise orinaccuracies in the depth map, we instead propose tominimize the energy

E(u) = σ2

Γ

|Du|+

n∑

i=1

1

2

Ωi

mi(b ∗ (u τi)− vi)2 dx

︸ ︷︷ ︸

=:Eidata

(u)

.

(14)

which is the MAP estimate under the assumptionof Gaussian noise with standard deviation σ on theinput images. It resembles a classical super-resolutionmodel [2], which is made slightly more complex bythe inclusion of the warp maps and masks.

In the energy (14), the total variation acts as aregularizer or objective prior on u. Its main tasksare to eliminate outliers and enforce a reasonableinpainting of regions for which no information isavailable, i.e. regions which are not visible in anyof the input views. It could be replaced by a moresophisticated prior for natural images, however, thetotal variation leads to a convex model which can bevery efficiently minimized.

Functional derivative. The functional derivative forthe inverse problem above is required in order to

Page 11: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

on Ωi on Γ

low

-res

olu

tio

n

Input view vi Disparity map di Forward warp vi βi

hig

h-r

eso

luti

on

Backward warp u τi Visibility mask mi Weighted mask mi Novel view u

Fig. 11: Illustration of the terms in the super-resolution energy. The figure shows the ground truth depth map for a single input viewand the resulting mappings for forward- and backward warps as well as the visibility mask mi. White pixels in the mask denotepoints in Ωi which are visible in Γ as well.

find solutions. It is well-known in principle, but oneneeds to take into account complications caused bythe different domains of the integrals. Note that τi isone-to-one when restricted to the visible region Vi :=mi = 1, thus we can compute an inverse forwardwarp map βi := (τi|Vi

)−1, which we can use totransform the data term integral back to the domain Γ,see figure 11. We obtain for the derivative of a singleterm of the sum in (14)

dEidata(u) = |detDβi|

(mib ∗ (b ∗ (u τi)− vi)

) βi

(15)The determinant is introduced by the variable sub-stitution of the integral during the transformation. Amore detailed derivation for a structurally equivalentcase can be found in [14].

The term |detDβi| in equation (15) introduces apointwise weight for the contribution of each imageto the gradient descent. However, βi depends on thedepth map on Γ, which needs to be inferred and is notreadily available. Furthermore, for efficiency it needsto be pre-computed, and storage would require an-other high-resolution floating point matrix per view.Memory is a bottleneck in our method, and we needto avoid this. For this reason, it is much more efficientto transform the weight to Ωi and multiply it with mi

to create a single weighted mask. Note that

|detDβi| =∣∣detDτ−1

i

∣∣ = |detDτi|

−1 βi. (16)

Thus, we obtain a simplified expression for the func-tional derivative,

dEidata(u) =

(mi b ∗ (b ∗ (u τi)− vi)

) βi (17)

with mi := mi |det(Dτi)|−1. An example weighted

mask is visualized in figure 11.

5.2 Specialization to 4D light fields

The model introduced in the previous section is hardto implement efficiently in fully general form. Thispaper, however, focuses on the setting of a 4D lightfield, where we can make a number of significiantsimplifications. The main reason is that the warpmaps between the views are given by parallel trans-lations in the direction of the view point change. Theamount of translation is proportional to the disparityof a pixel, which is in one-to-one correspondence tothe depth, as explained in section 3.

How the disparity maps are obtained does not mat-ter, but in this work, naturally, they will be computedusing the technique described in the previous section.

View synthesis in the light field plane. The warpmaps required for view synthesis become particularlysimple when the target image plane Γ lies in thecommon image plane Ω of the light field, and π resem-bles the corresponding light field projection througha focal point c ∈ Π. In this case, τi is simply given bya translation proportional to the disparity,

τi(x) = x+ di(x)(c− ci), (18)

see figure 12. Thus, one can compute the weight inequation (17) to be

|detDτi|−1

= |1 +∇di · (c− ci)|−1 (19)

There are a few observations to make about thisweight. Disparity gradients which are not alignedwith the view translation ∆c = c− ci do not influenceit, which makes sense since it does not change theangle under which the patch is viewed. Disparitygradients which are aligned with ∆c and tend toinfinity lead to a zero weight, which also makes sense

Page 12: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

x+ di(x)∆c = y + di(y)∆c

Ω

c∆cci

xy

⇔ ∆cdi(x)−di(y)

x−y= −1

Fig. 12: The slope of the solid blue line depends on the disparitygradient in the view vi. If ∆c·∇di = −1, then the line is projectedonto a single point in the novel view u.

since they lead to a large distortion of the patch in theinput view and thus unreliable information.

A very interesting result is the location of maximumweight. The weights become larger when ∆c · ∇diapproaches −1. An interpretation can be found infigure 12. If ∆c · ∇di gets closer to −1, then moreinformation from Ωi is being condensed onto Γ, whichmeans that it becomes more reliable and should beassigned more weight. The extreme case is a linesegment with a disparity gradient such that ∆c·∇di =−1, which is projected onto a single point in Γ. In thissituation, the weight becomes singular. This does notpose a problem: From a theoretical point of view, theset of singular points is a null set according to thetheorem of Sard, and thus not seen by the integral.From a practical point of view, all singular points leadto occlusion and the mask mi is zero anyway.

Note that formula (19) is non-intuitive, but the cor-rect one to use when geometry is taken into account.We have not seen anything similar being used inprevious work. Instead, weighting factors for viewsynthesis are often imposed according to measuresbased on distance to the interpolated rays or matchingsimilarity scores, which are certainly working, butalso somewhat heuristic strategies [19], [12], [17], [25].

5.3 Super-resolution results

For the optimization of the (convex) energy (14), wetransform the gradient to the space of the targetview via equation (17), discretize, and employ thefast iterative shrinkage and thresholding algorithm(FISTA) found in [3]. All steps are explained in ourprevious work [36], and an implementation is avail-able on our web site, so we omit the details herefor brevity. In order to demonstrate the validity androbustness of our algorithm, we perform extensivetests on our synthetic light fields, where we haveground truth available, as well as on real-world datasets from a plenoptic camera. As a by-product, thisestablishes again that disparity maps obtained by ourproposed method have subpixel accuracy, since this isa necessary requirement for super-resolution to work.

closeup of center view at low resolution

bilinear upsampling to 4× resolution

TV zooming [9]

super-resolution result, 7× 7 input views

original high-resolution center view

Fig. 13: Comparison of the different upsampling schemes on thelight field of a resolution chart. Input resolution is 512 × 512,which is 4× upsampled. The bottom image shows the original1024 × 1024 center view for comparison. All images shown arecloseups.

View interpolation and super-resolution. In a firstset of experiments, we show the quality of viewinterpolation and super-resolution, both with groundtruth as well as estimated disparity. In table 14, wesynthesize the center view of a light field with ouralgorithm using the remaining views as input, andcompare the result to the actual view. For the down-sampling kernel b, we use a simple box filter of sizeequal to the downsampling factor, so that it fits exactlyon a pixel of the input views. We compute results bothwith ground truth disparities to show the maximumtheoretical performance of the algorithm, as well asfor the usual real-world case that disparity needs tobe estimated. This estimation is performed using thelocal method described in section 4.1, so requires lessthan five seconds for all of the views. Synthesizing asingle super-resolved view requires about 15 secondson an nVidia GTX 580 GPU.

In order to test the quality of super-resolution, wecompute the 3 × 3 super-resolved center view andcompare with ground truth. For reference, we alsocompare the result of bilinear interpolation (IP) aswell as TV-zooming [9] of the center view synthe-sized in the first experiment. While the reconstruc-

Page 13: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

Conehead Buddha MonaViews 1x1 3x3 TV IP 1x1 3x3 TV IP 1x1 3x3 TV IP5× 5 31.6 29.3 27.4 26.5 32.2 28.9 27.5 26.5 30.1 28.3 27.4 26.4

GT9× 9 31.6 29.4 27.5 26.5 32.2 29.1 27.5 26.5 30.0 28.3 27.4 26.3

17× 17 31.2 30.4 27.3 26.0 31.8 30.2 28.8 27.2 30.2 28.9 27.8 26.5

5× 5 31.1 29.3 27.1 25.8 28.0 28.9 25.8 24.3 26.4 28.3 25.7 23.8

ED9× 9 31.4 29.4 27.6 26.2 30.7 29.1 28.9 27.7 28.9 28.3 26.8 25.1

17× 17 31.5 30.9 25.9 24.3 31.4 29.5 27.9 26.8 29.5 28.3 27.1 25.8

Fig. 14: Reconstruction error for the data sets obtained with a ray-tracer. The table shows the PSNR of the center view withoutsuper-resolution, at super-resolution magnification 3 × 3, and for bilinear interpolation (IP) and TV-Zooming (TV) [9] to 3 × 3resolution as a comparison. The set of experiments is run with both ground truth (GT) and estimated disparities (ED). The estimationerror for the disparity map can be found in figure 6. Input image resolution is 384× 384.

(a) Buddha (b) Mona

Fig. 15: Closeups of the upsampling results for the light fields generated with a ray tracer. From left to right: low-resolution centerview (not used for reconstruction), high resolution center view obtained by bilinear interpolation of a low-resolution reconstructionfrom 24 other views, TV-Zooming [9], super-resolved reconstruction. The super-resolved result shows increased sharpness and details.

tion with ground truth disparities is very precise,we can see that in the case of estimated disparity,the result strongly improves with larger angular res-olution due to better disparity estimates, figure 6.Super-resolution is superior to both competing meth-ods. This also emphasizes the sub-pixel accuracy ofthe disparity maps, since without accurate matching,super-resolution would not be possible. Figures 1, 13and 15 show closeup comparison images of the inputlight fields and upsampled novel views obtained withdifferent strategies. At this zoom level, it is possible toobserve increased sharpness and details in the super-resolved results. Figure 13 indicates that the proposedscheme also produces the least amount of artifacts.

Figures 19 and 18 show the results of the same set ofexperiments for two real-world scenes captured withthe Raytrix plenoptic camera. The plenoptic cameradata was transformed to the standard representationas an array of 9 × 9 views using the method in [34].Since no ground truth for the scene is available, theinput views were downsampled to lower resolutionbefore performing super-resolution and comparedagainst the original view. We can see that the pro-posed algorithm allows to accurately reconstruct bothsubpixel disparity as well as a high-quality super-

resolved intermediate view.Disparity refinement. As we have seen in figure 16,

the disparity estimate is more accurate when theangular sampling of the light field is more dense.An idea is therefore to increase angular resolutionand improve the disparity estimate by synthesizingintermediate views.

We first synthesize novel views to increase angularresolution by a factor of 2 and 4. Figure 16 showsresulting epipolar plane images, which can be seento be of high quality with accurate occlusion bound-aries. Nevertheless, it is highly interesting that thequality of the disparity map increases significantlywhen recomputed with the super-resolved light field,figure 17. This is a striking result, since one wouldexpect that the intermediate views reflect the error inthe original disparity maps. However, they actuallyprovide more accuracy than a single disparity map,since they represent a consensus of all input views.Unfortunately, due to the high computational cost,this is not a really viable strategy in practice.

6 CONCLUSIONS

We developed a continuous framework for light fieldanalysis which allows us to both introduce novel data

Page 14: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

5× 5 input views

super-resolved to 9× 9

super-resolved to 17× 17

Fig. 16: Upsampling of epipolar plane images (EPIs). The leftimage shows the five layers of an epipolar plane image of theinput data set with 5× 5 views. We generate intermediate viewsusing our method to achieve angular super-resolution. One canobserve the high quality and accurate occlusion boundaries of theresulting view interpolation.

terms for robust disparity estimation, as well as thefirst fully continuous model for variational super-resolution view synthesis. Disparity is estimated lo-cally using dominant directions on epipolar planeimages, which are computed with the structure tensor.The local estimates can be consolidated into globaldisparity maps using state-of-the art convex optimiza-tion techniques. Several such methods are compared,trading off more and more modeling accuracy andsophistication for speed.

We also give a detailed anaysis about optimalparameter choices, the ideal use cases as well aslimitations of the method. As expected, the methodis best suited to densely sampled light fields, asfor example obtained by recent commercial plenopticcamera models. Experiments on new benchmark datasets tailored to the light field paradigm show state-of-the-art results, which surpass a traditional stereo-based method in both accuracy as well as speed.

The subpixel-accurate disparity maps we obtain arethe pre-requisite for super-resolved view synthesis. Asa theoretical novelty, we can within our frameworkanalytically derive weighting factors for the contri-butions of the input views caused by foreshorteningeffects due to scene geometry. Extensive experimentson synthetic ground truth as well as real-world im-ages from a recent plenoptic camera give numericalevidence about the competitive performance of ourmethod, which is capable of achieving near-interactiveframe rates.

REFERENCES

[1] E. Adelson and J. Bergen. The plenoptic function and the ele-ments of early vision. Computational models of visual processing,1, 1991. 2

[2] S. Baker and T. Kanade. Limits on super-resolution and howto break them. IEEE Trans. on Pattern Analysis and MachineIntelligence, 24(9):1167–1183, 2002. 10

[3] A. Beck and M. Teboulle. Fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM J. Imaging Sci-ences, 2:183–202, 2009. 12

[4] J. Berent and P. Dragotti. Segmentation of epipolar-planeimage volumes with occlusion and disocclusion competition.In IEEE 8th Workshop on Multimedia Signal Processing, pages182–185, 2006. 3

Views Cone Buddha Monainput 5× 5 4.534 0.883 2.125

SR 9× 9 1.084 0.559 1.058input 9× 9 0.066 0.080 0.192SR 17× 17 0.044 0.066 0.105

Disparity MSE

Fig. 17: By computing intermediate views, one can increase theresolution of the epipolar plane images, see figure 16, which inturn leads to an improved disparity estimate. The table showsmean squared error for the depth maps at original and super-resolved (SR) angular resolution, the images illustrate the distri-bution of the depth error before and after super-resolution.

Method Demo Motor1× 1 36.91 35.363× 3 30.82 31.72TV 25.21 24.99IP 23.89 22.84

Reconstruction PSNR

Fig. 19: Reconstruction error for light fields captured with theRaytrix plenoptic camera. The table shows PSNR for the recon-structed input view at original resolution as well as 3× 3 super-resolution and 3×3 interpolation (IP) and TV-Zooming (TV) [9]for comparison.

[5] J. Bigun and G. H. Granlund. Optimal orientation detection oflinear symmetry. In Proc. International Conference on ComputerVision, pages 433–438, 1987. 5

[6] T. Bishop and P. Favaro. Full-resolution depth map estimationfrom an aliased plenoptic light field. Computer Vision–ACCV2010, 1:186–200, 2011. 1, 3

[7] R. Bolles, H. Baker, and D. Marimont. Epipolar-plane imageanalysis: An approach to determining structure from motion.International Journal of Computer Vision, 1(1):7–55, 1987. 1, 2, 3,4

[8] J.-X. Chai, X. Tong, S.-C. Chany, and H.-Y. Shum. Plenopticsampling. Proc. SIGGRAPH, pages 307–318, 2000. 3, 10

[9] A. Chambolle. An algorithm for total variation minimizationand applications. Journal of Mathematical Imaging and Vision,20(1-2):89–97, 2004. 12, 13, 14, 15

[10] A. Criminisi, S. Kang, R. Swaminathan, R. Szeliski, andP. Anandan. Extracting layers and analyzing their specularproperties using epipolar-plane-image analysis. Computer vi-sion and image understanding, 97(1):51–85, 2005. 1, 2, 3, 4

[11] T. Georgiev and A. Lumsdaine. Focused plenoptic camera andrendering. Journal of Electronic Imaging, 19:021106, 2010. 1

[12] I. Geys, T. P. Koninckx, and L. V. Gool. Fast interpolatedcameras by combining a GPU based plane sweep with a max-flow regularisation algorithm. In 3DPVT, pages 534–541, 2004.3, 12

[13] B. Goldluecke. cocolib - a library for continuous convexoptimization. http://cocolib.net, 2013. 2, 6

[14] B. Goldluecke and D. Cremers. Superresolution texture mapsfor multiview reconstruction. In Proc. International Conferenceon Computer Vision, 2009. 2, 3, 10, 11

[15] S. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen. TheLumigraph. In Proc. SIGGRAPH, pages 43–54, 1996. 2

[16] A. Katayama, K. Tanaka, T. Oshino, and H. Tamura.Viewpoint-dependent stereoscopic display using interpolationof multiviewpoint images. In Proceedings of SPIE, volume 2409,

Page 15: IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE ...publications.lightfield-analysis.net/WG13_tpami.pdf · effects like synthetic aperture, i.e. virtual refocusing of the camera,

IEEE TRANSACTIONS OF PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

(a) Demo (b) Motor

Fig. 18: Super-resolution view synthesis using light fields from a plenoptic camera. Scenes were recorded with a Raytrix cameraat a resolution of 962 × 628 and super-resolved by a factor of 3 × 3. The light field contains 9 × 9 views. Numerical quality ofthe estimate is computed in figure 19. From left to right: low-resolution center view (not used for reconstruction), high resolutioncenter view obtained by bilinear interpolation of a low-resolution reconstruction from 24 other views, TV-Zooming [9], super-resolvedreconstruction. One can find additional detail, for example the diagonal stripes in the Euro note, which were not visible before.

page 11, 1995. 1[17] A. Kubota, K. Aizawa, and T. Chen. Reconstructing dense light

field from array of multifocus images for novel view synthesis.IEEE Transactions on Image Processing, 16(1):269–279, 2007. 3,12

[18] M. Levoy. Light fields and computational imaging. Computer,39(8):46–55, 2006. 1

[19] M. Levoy and P. Hanrahan. Light field rendering. In Proc. SIG-GRAPH, pages 31–42, 1996. 1, 2, 3, 12

[20] M. Matousek, T. Werner, and V. Hlavac. Accurate correspon-dences from epipolar plane images. In Proc. Computer VisionWinter Workshop, pages 181–189, 2001. 3

[21] L. McMillan and G. Bishop. Plenoptic modeling: An image-based rendering system. In Proc. SIGGRAPH, pages 39–46,1995. 1, 2, 3

[22] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, andP. Hanrahan. Light field photography with a hand-heldplenoptic camera. Technical Report CSTR 2005-02, StanfordUniversity, 2005. 1

[23] C. Perwass and L. Wietzke. The next generation of photogra-phy, 2010. www.raytrix.de. 1

[24] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. Globalsolutions of variational models with convex regularization.SIAM Journal on Imaging Sciences, 2010. 2, 6, 8

[25] M. Protter and M. Elad. Super-resolution with probabilisticmotion estimation. IEEE Transactions on Image Processing,18(8):1899–1904, 2009. 2, 3, 12

[26] S. Seitz and C. Dyer. Physically-valid view synthesis by imageinterpolation. In Proc. IEEE Workshop on Representation of visualscenes, pages 18–25, 1995. 1

[27] H. Shum, S. Chan, and S. Kang. Image-based rendering.Springer-Verlag, New York, 2007. 2, 3

[28] A. Siu and E. Lau. Image registration for image-based render-ing. IEEE Transactions on Image Processing, 14(1):241–252, 2005.3

[29] E. Strekalovskiy and D. Cremers. Generalized ordering con-straints for multilabel optimization. In Proc. InternationalConference on Computer Vision, 2011. 2, 5

[30] V. Vaish and A. Adams. The (New) Stanford Light FieldArchive. http://lightfield.stanford.edu, 2008. 1

[31] V. Vaish, B. Wilburn, N. Joshi, and M. Levoy. Using plane +parallax for calibrating dense camera arrays. In Proc. Inter-national Conference on Computer Vision and Pattern Recognition,2004. 1

[32] Z. Wang and A. Bovik. Mean squared error: Love it or leaveit? IEEE Signal Processing Magazine, 26(1):98–117, 2009. 8, 9

[33] S. Wanner. HCI light field archive. http://lightfield-analysis.net, 2012. 2

[34] S. Wanner, J. Fehr, and B. Jahne. Generating EPI representa-tions of 4D light fields with a single lens focused plenoptic

camera. Advances in Visual Computing, pages 90–101, 2011. 13[35] S. Wanner and B. Goldluecke. Globally consistent depth

labeling of 4D light fields. In Proc. International Conference onComputer Vision and Pattern Recognition, pages 41–48, 2012. 1,2, 5

[36] S. Wanner and B. Goldluecke. Spatial and angular variationalsuper-resolution of 4D light fields. In Proc. European Conferenceon Computer Vision, 2012. 2, 12

Sven Wanner received his Diploma from theUniversity of Heidelberg for his work on “In-teractive rendering of data from wind-drivenwatersurfaces and event classification”. Hismain research interests lie in the topic areaof light field image processing. Currently, hefocuses on developing algorithms for robust3D reconstruction and segmentation of lightfields within the Lumigraph parametrization,which he obtains for different kinds of rawdata from camera arrays, plenoptic cameras

as well as from simulation. He is a member of the “HeidelbergGraduate School of Mathematical and Computational Methods forthe Sciences” (HGS).placeholder

Bastian Goldluecke received a PhD on“Multi-Camera Reconstruction and Render-ing for Free-viewpoint Video” from the MPIfor computer science in Saarbrucken in 2005.Subsequently, he held PostDoc positionsat the University of Bonn and TU Munich,where he developed variational methodsand convex optimization techniques for high-accuracy 3D and texture reconstruction. Hisspeciality are GPU implementations of effi-cient algorithms for variational inverse prob-

lems and multilabel problems, for which he develops and maintainsthe open source library cocolib. In current research, he focuseson varitional methods for light field analysis, and he is heading thecorresponding junior research group at the Heidelberg Collaboratoryfor Image Processing.