3D Reconstruction from Finite-Aperture Lensespeople.csail.mit.edu/hasinoff/pubs/depthoral.pdf · 2006-04-14 · 3D Reconstruction from Finite-Aperture Lenses (Ph.D. Depth Oral Report)

3D Reconstruction from Finite-Aperture Lenses

(Ph.D. Depth Oral Report)

Samuel W. Hasinoffhasinoffcs.toronto.edu

April 14, 2006

Abstract

In this report we investigate the problem of reconstructing 3D scenegeometry from images acquired from a stationary camera with a control-lable finite-aperture lens. We review the relevant literature in computervision and optics, present some new insights into the problem [21], andsuggest directions for future research.

1

Contents

1 Introduction 3

2 Imaging with Finite-Aperture Lenses 42.1 Lens parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Generic finite-aperture lens model . . . . . . . . . . . . . . . . . 9

3 Models for Finite-Aperture Lenses 113.1 Basic analytic models . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Distortions in real lenses . . . . . . . . . . . . . . . . . . . . . . . 153.3 Calibration methods . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Models of Defocus 204.1 Defocus as linear filtering . . . . . . . . . . . . . . . . . . . . . . 204.2 Spatially variant filtering . . . . . . . . . . . . . . . . . . . . . . 214.3 Windowed linear filtering . . . . . . . . . . . . . . . . . . . . . . 214.4 Defocus for local tangent planes . . . . . . . . . . . . . . . . . . . 224.5 Analytic defocus models . . . . . . . . . . . . . . . . . . . . . . . 234.6 Empirical defocus models . . . . . . . . . . . . . . . . . . . . . . 24

5 Focus Measures 245.1 Known blur kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Blind focus measures . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Depth-from-Focus 266.1 Maximizing depth resolution . . . . . . . . . . . . . . . . . . . . 276.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Depth-from-Defocus 287.1 Direct deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . 287.2 Depth from relative defocus . . . . . . . . . . . . . . . . . . . . . 307.3 Strong scene models . . . . . . . . . . . . . . . . . . . . . . . . . 357.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8 Future research directions 37

References 39

2

You cannot depend on your eyes when yourimagination is out of focus.

A Connecticut Yankee in King Arthur’s Court

Mark Twain (1835–1910)

I never hit a shot, not even in practice, without havinga very sharp, in-focus picture of it in my head.

Jack Nicklaus

1 Introduction

According to legend, Archimedes created the world’s first “death ray” around212 bce, using an arrangement of mirrors to focus sunlight on the invadingRoman fleet [1]. While the story may be apocryphal, the phenomenon of focushas held our fascination since antiquity.

Focus, the convergence of light to a sharp point, been studied as part ofclassical optics prior to the time of Newton [45]. The phenomenon of focus isalso familiar to photographers, who frequently exploit defocus for artistic effect,to draw the viewer’s attention to certain parts of the scene, or to represent oursubjective experience of world.

By contrast, defocus is typically regarded by the computer vision communityas a form of image degradation, analogous to motion blur or image saturation,corrupting the ideal pinhole image. Indeed, the high frequencies lost to defocuscannot be recovered in general, without a strong model of the underlying scene.For many applications, defocus can be avoided by simply choosing the camerasettings appropriately.

Importantly, the defocus arising from finite-aperture lenses has a particularstructure that encodes information about the scene not present in an ideal pin-hole image. In particular, the depth of a given point in the scene is related tothe amount of its defocus. Although this idea has received significant study incomputer vision [13, 18, 21, 25, 39, 42, 54, 60], methods exploiting defocus for 3Dreconstruction, or otherwise making use of a more sophisticated finite-aperturelens model, are by far the exception.

Despite the relative lack of attention, 3D reconstruction methods from finite-aperture lenses hold great potential for many common scenes for which obtainingdetailed models is beyond the state of the art [22, 62, 64, 65]. Generally speaking,a stationary finite-aperture camera is particularly well-suited to reconstructingscenes with large appearance variation over moderate changes in viewpoint.

In particular, finite-aperture lenses are well-suited to reconstructing scenesthat are highly specular (crystals), finely detailed (steel wool), composed of mul-tiple semi-transparent layers (dirty plastic), or possessing complex self-occlusionrelationships (tangled hair). In all these cases, matching over viewpoint changesmay be difficult, due to large variations in appearance, and active illuminationmethods may also fail, due to the fine geometric structure or materials makingup these scenes. A further advantage of reconstruction methods using defocus

3

is the ability to detect camouflaged objects, which allows segmentation of thescene based on shape rather than texture cues [15].

On the other hand, 3D reconstruction methods from finite-aperture lensesmay not be practical for dynamic scenes, since they rely on a acquiring imagesfrom a stationary camera with different lens settings. Although this problemmay be addressed, for example, using beam-splitters and multiple synchronizedcameras [32], reconstruction from finite-aperture lenses may only be practicalover a small number of lens settings. Another issue, common to passive multi-view 3D reconstruction methods, is that depth estimation is poorly defined foruntextured regions. The precise relationship between surface albedo, geomet-ric structure, and defocus is a complex one, but for some applications, likeimage-based rendering, the depth estimates in untextured regions may be lessimportant.

Note that exploiting finite-aperture imaging for 3D reconstruction requiresrecovering a detailed, empirical lens model suitable for high-resolution cameras.While significant calibration may be required to recover such a model, this isthankfully a one-time process, and the recovered lens model can be reused fora variety of applications.

In this report, we begin by describing design considerations and generalproperties for finite-aperture lenses (Sec. 2). Next, we present concrete modelsfor finite-aperture lenses and methods for their calibration (Sec. 3). We thendescribe models of defocus (Sec. 4), and measures for evaluating the level ofdefocus in an image (Sec. 5). Finally, we discuss two major categories of 3D re-construction methods from finite-aperture lenses, depth-from-focus (Sec. 6), anddepth-from-defocus (Sec. 7). We conclude with a discussion of open problemsand possible topics for future research (Sec. 8).

2 Imaging with Finite-Aperture Lenses

The classic minimalist design for a finite-aperture lens consists of a simple convexlens element, at a controllable distance from the sensor plane, with a controllableaperture in front [45]. By contrast, modern commercially available SLR lensesare signficantly more complex devices, designed to correct and balance a varietyof distortions (Sec. 3.2) throughout their range of operation.

Modern lenses are typically composed of four or more lens elements, and upto 25–30 elements is not uncommon for a telephoto zoom lens (Fig. 1). Theselens elements are arranged in various groups, whose axial spacing controls thefocusing behavior of the lens. Compared to zoom lenses, fixed focal length orprime lenses require less elements.

The most common lens element shape is the spherical segment, because ofits first-order ideal focusing property [45], and the ease of which it may bemachined to demanding tolerances. Lens elements are typically created fromglass (n ≈ 1.5), although more exotic materials like fluorite (n ≈ 1.4) may alsoused in higher-end lenses. The lens elements are often thinly coated with othermaterials to reduce inter-reflections, or filter certain wavelengths of light.

4

(a) (b)

Figure 1: (a) Cut-away view of a real zoom lens, the Panasonic Lumix DMC-FZ30, reveals 14 lens elements arranged in 10 groups. In response to user-appliedfocus and zoom settings, some of the lens groups translate along the optical axisby different amounts. (b) Screenshot of Optis Solidworks computer-aided designsoftware, which allows such complex lens designs to be modeled and evaluatedin simulation.

2.1 Lens parameters

Although modern SLR lenses are complex mechanical devices, composed ofmany internal parts (Fig. 1), they are typically controlled using a standardset of parameters: (1) focus setting, (2) aperture setting, and (3) zoom set-ting (available for zoom lenses only). Note that from the user’s point of view,these lens parameters are the only way to control how the scene is focused ontothe image plane. While more specialized lenses, such as tilt-shift lenses, mayoffer additional lens parameters to control, we ignore such parameters in thistreatment.

The parameters of the lens are often under electronic control from the camerabody, and may further be exposed to the user through a software interface, e.g.,from a laptop computer over a USB cable. Such remote programmable operationallows unattended capture sessions, which helps avoid the external vibrationsassociated with manual operation.

Below we give a phenomenological account of these three lens parameters,describing their effects at the level that a technically-minded amateur photog-rapher might understand them, and describe their realization in typical lensdesigns (Sec. 2). Later we discuss these lens parameters more concretely, bydescribing analytic lens models that are explicitly defined in terms of these pa-rameters (Sec. 3.1). We assume that any other camera parameters, such as whitebalancing or exposure time, are either held constant or may be fully accountedfor across different images.

Focus setting The focus setting, f , is used to choose the region of the scenefrom which light is brought into perfect focus (Fig. 2). In contrast to the ideal-

5

(a) (b)

Figure 2: As the focus setting is varied, the 3D in-focus region moves from (a)close to the camera, to (b) far from the camera. c© dpreview.com.

ized pinhole model (Sec. 3.1.1), in which every pixel on the sensor plane corre-sponds to a single ray of light, integration over the finite aperture means thatonly light from certain 3D points in the scene will be perfectly focused to sin-gle points on the sensor plane. For typical lenses, at a given focus setting, thecorresponding scene points in perfect focus will lie on a smooth surface roughlyparallel to the sensor plane.

We assume that any scene point further away than some minimum focusingdistance can potentially be brought into perfect focus. Although the opticsdefine such a minimum distance, this distance is also limited by engineeringconstraints such as the maximum physical extension of the lens.

Changes to the focus setting can be realized by translating the whole assem-bly of lens elements together, in an axial direction perpendicular to the sensorplane. In practice, changing the focus setting typically adjusts the inter-elementspacing as well, however the main effect can be achieved by simply changing thedistance between the lens elements and the sensor plane.

Note that in addition to changing the focusing behavior, modifying the lens-sensor distance also has the side-effect of magnifying the image, which must beaccounted for certain applications (Sec. 3.3.1).

Aperture setting The aperture setting, α, is used to control the shape ofan opaque mask or “stop” in front of the lens. This mask controls what lightfrom the scene is blocked, and what light is integrated on the sensor plane.Clearly the largest, completely-open aperture setting is limited by the physicaldimensions of the lens.

In practice, the front aperture is typically formed using a set of 5–12 opaquemechanical blades that pinwheel around to block off an approximately circularopening (Fig. 3). While arbitrary apertures masks can be implemented in the-ory, e.g., using specially designed filters [13], the aperture masks for standardcameras are nested and connected regions. Note that lenses effectively haveinternal apertures that block the incoming light as well (see Sec. 3.2), but these

6

f/16 f/5.6 f/1.8

Figure 3: Images of a real 50mm SLR lens showing the non-circular variationin aperture.

(a) (b)

Figure 4: Changing the depth-of-field by varying the aperture setting. (a)A small aperture (f/8.0) yields a large depth-of-field, with most of the scene“acceptably” in focus, whereas (b) a larger aperture (f/1.4) yields a shallowerdepth-of-field, with a limited region of the 3D scene in focus. c© dpreview.com.

apertures are not directly controllable.The most obvious effect of changing the aperture setting is to control the

total amount of light integrated from the scene. To a first approximation, we cancompensate for this effect simply by modifying the exposure time accordingly.Changing the aperture setting also leads to secondary radiometric effects, mostnotably increased vignetting, or relative darkening, at the corners of the image,especially for larger apertures (Sec. 3.3.2).

More importantly, the aperture setting affects the focusing behavior of thelens as well. In particular, for any 3D scene point that does not lie on thesurface of perfect focus, changing the size of the aperture will affect the levelof defocus for that point, namely the extent of its footprint on the sensor plane(see Sec. 4).

Note that because real sensors are composed of discrete sensor elements,the size of the elements can be used to designate an “acceptable” level of de-focus, below which a scene point would effectively project to a single pixelanyway. Therefore, for a particular threshhold representing an acceptable levelof defocus, changing the aperture setting controls the depth-of-field, defined asthe extent of the 3D region in the scene whose points are acceptably in focus

7

(a) (b)

Figure 5: Changing the zoom setting from (a) telephoto (100mm), to (b) wide-angle (28mm), but moving the camera to keep the figurine roughly constant-sizein the image. With this compensation, at an aperture of f/4, the depths-of-fieldare nearly the same. Also note how flattened perspective in the telephoto casecauses the background to be relatively magnified. c© Paul van Walree.

(Fig. 4). Large apertures yield a shallow depth-of-field, with a limited region ofthe 3D scene in focus, whereas smaller apertures yield a larger depth-of-field.In the limit, the depth-of-field for an idealized pinhole camera is infinite, withthe whole scene perfectly in-focus.

The aperture setting is usually written using a special f-number notation,where “f/α” corresponds to an aperture whose effective diameter is λ/α, where λis a lens parameter representing the focal length (see below). In typical lenses theaperture setting is a discrete parameter, whose values are spaced geometrically,e.g., so that the effective diameter grows by a factor of 21/6 between successiveaperture settings.

Zoom setting For zoom lenses, the zoom setting is used to manipulate thefocal length of the lens, λ. The focal length controls the focusing power of alens, or the degree to which light is redirected. Note that the zoom setting isinapplicable for prime lenses, which are designed with a fixed focal length.

In practice, changes to the zoom setting are effected by modifying the rel-ative spacing of various groups of lens elements. Zoom lenses are also moremechanically complex than fixed focal length lenses.

The main effect of changing the focal length is to change the field of viewand magnification, however differences due to changes in perspective projectionare a notable side-effect (Fig. 5). Large (telephoto) focal lengths correspondto a narrow field of view, high magnification, and relatively flattened depthvariation. By contrast, small (wide angle) focal lengths correspond to a widefield of view, low magnification, and relatively exaggerated depth variation.

The zoom setting also has a subtle effect on the focusing behavior of a lens.While telephoto lenses appear to create a shallower depth-of-field, this effectis primarily due to magnification and perspective flattening, which causes thebackground to be relatively magnified without resolving any additional detail.

8

p(x, y)(x, y)

Cxy(α, f)

frontbackaperture

aperture

scene

lenssensorplane

(a) (b)

Figure 6: Generic finite-aperture lens model. (a) At the ideal focus setting ofpixel (x, y), the lens collects outgoing radiance from a scene point p and directsit toward the pixel. The 3D position of point p is uniquely determined by pixel(x, y) and its ideal focus setting. The shaded cone of rays, Cxy(α, f), determinesthe radiance reaching the pixel. This cone is a subset of the cone subtendedby p and the front aperture because some rays may be blocked by internalcomponents of the lens, or its back aperture. (b) For non-ideal focus settings,the lens integrates outgoing radiance from a region of the scene.

If we compensate for the magnification of a given subject, i.e., by moving thecamera and adjusting the focus accordingly, the depth-of-field remains nearlyconstant across focal length (Fig. 5), however slight differences still remain.

The zoom setting is typically held fixed in the context of 3D reconstructionfrom finite aperture images, since magnification limits the achievable resolution,the focusing behavior is more complex to model, and perspective distortionsmake correspondence and calibration more difficult.

2.2 Generic finite-aperture lens model

To analyze the fundamental properties of finite-aperture lenses, it is useful tostart from a generic black box model, that covers a wide class of real lenses foundin SLR cameras (Fig. 6). Later, by assuming a more restrictive concrete lensmodel (e.g., Sec. 3.1.2), we can derive stronger results about focusing behavior.

A generic way of thinking about a lens at a particular setting, (α, f, λ), is asa transformation between two light fields [19, 30], the external light field presentin the scene, incident upon the lens, and the internal light field that occursbetween the back aperture of the lens and the sensor plane. We assume that forreal lenses this transformation is smooth, i.e., a diffeomorphism. Note that if wecan somehow recover the light field transformation of a lens in full, i.e., through ahypothetical extensive calibration process, we would possess a complete forwardmodel for the lens, that would allow us to simulate finite-aperture images givena description of the scene.

We proceed by identifying three fundamental properties we claim are well-approximated by real lenses, after suitable correction for geometric and radio-metric distortions (Sec. 3.3).

Property 1 (Negligible absorption) Light that enters the lens in a given

9

direction is either blocked from exiting or is transmitted with no absorption.

If a constant fraction of light is in fact reflected, this attenuation may simplybe folded into the camera response function.

As a consequence of the negligible absorption property, the light field trans-formation corresponding to the lens can be described more simply as a smoothmapping between rays in the external light field, and rays in the internal lightfield.

Property 2 (Perfect focus) For every 3D point in front of the lens there is

a unique focus setting that causes rays through the point to converge to a single

pixel on the sensor plane.

The perfect focus property has important implications for 3D reconstruction,because it suggests that the ideal focus setting for a given 3D point can act as adirect proxy for depth (Sec. 6). Consider the function mapping from 3D scenepoints to their corresponding ideal focus settings. For real lenses, the level setsof this function, or equifocal surfaces, are smooth and connected surfaces. Poseddifferently, for a given pixel on the sensor plane, the locus of scene points focusingperfectly to the pixel will be a smooth trajectory in 3D space, parameterized bythe focus setting.

More generally, the amount of defocus for a given 3D point can also act asa proxy for depth (Sec. 7). Defocus for a given point is typically quantified bythe extent of its footprint on the sensor plane (Sec. 4).

Certain lens distortions, such as chromatic aberration, cannot be easily cor-rected through post-processing and so violate the perfect focus assumption(Sec. 3.2). Nevertheless, perfect focus is closely approximated by many highquality SLR lenses in practice. Even if perfect focus cannot actually be achieved,the existence of a unique ideal focus setting is sufficient for many applications.

Property 3 (Aperture-focus independence) The aperture setting controls

only which light rays are blocked from entering the lens; it does not affect the

way the lens redirects transmitted light

The property of aperture-focus independence implies that the only role for theaperture setting is to select the domain of unblocked rays from the externallight field to be integrated onto the sensor. Thus, because the apertures in areal camera are nested, the rays integrated by a given pixel will be a subset ofthose integrated using any larger aperture, all other lens settings being equal.This means that the intensity at a given pixel will vary monotonically over therange of aperture settings.

Hasinoff and Kutulakos recently described a method to exploit cross-apertureintensity variation for 3D reconstruction [21]. They introduced the confocal

constancy property, which states that the pixel intensity of a visible in-focusscene point will vary in a scene-independent way, that may be predicted byprior radiometric lens calibration. The only requirement is that the incomingradiance within the cone subtended by the largest aperture is nearly constant.

10

3 Models for Finite-Aperture Lenses

Any concrete finite-aperture lens model needs to specify two things: (1) geo-metric properties, or how incoming light is redirected, and (2) radiometric prop-erties, or light from different incoming rays is blocked or attenuated. Specifyingsuch a model completely defines the image formation process, and allows us toapply the lens model synthetically given a description of the scene.

In practice, simple analytic models (Sec. 3.1) are typically used to approxi-mate the behavior of the lens, after factoring out the effects of various geometricand radiometric distortions (Sec. 3.2). The parameters of these analytic modelsmay be provided by the lens manufacturer, but more often are fit empiricallyusing a calibration procedure (Sec. 3.3).

The gold standard lens model consists of a detailed simulation of the optics,given a complete physical description of the lens design (Fig. 1b). Unfortunately,these detailed simulations belong exclusively in the domain of lens designers.For all commercially available SLR lenses, information about their designs isproprietary, or else is not specified accurately enough to be used for this purpose.Instead, to achieve a high level of modeling accuracy we resort to constructingdetailed empirical models based on calibration (Sec. 3.3). In practice, suchempirical models may actually be more suitable for describing the quirks of anindividual lens, which may not be manufactured exactly to specification.

By contrast, some applications using finite-aperture lenses require no ex-plicit lens model whatsoever. These applications instead exploit generic lensproperties (Sec. 2.2), such as the perfect focus property for depth-from-focusmethods (Sec. 6), or confocal constancy in the case of confocal stereo [21].

3.1 Basic analytic models

3.1.1 Pinhole model

The well-known pinhole model is the standard model used computer vision andcomputer graphics to represent an idealized perspective camera. Because thepinhole model assumes the aperture to be an infinitesimally small point, itcannot express finite-aperture effects, such as defocus. Nevertheless, it useful tounderstand this model as a limiting case of more sophisticated finite-aperturelens models.

In practice, very small apertures such as f/22 can serve as a reasonable ap-proximation for the idealized pinhole model, particularly for a relatively distantscene. There are limits, however, to the accuracy of this approximation, becausediffraction effects limit the sharpness that may be achieved with an extremelysmall aperture. Furthermore, because such small apertures collect less light,using them requires relatively long exposure times or strong external lighting.

The pinhole model can be defined by a center of projection, C, coincidentwith the pinhole aperture (Fig. 7). This geometry implies that every pixel onthe sensor plane, X , corresponds to a single perfectly focused ray from the scene,−−→CX, so the entire scene will be in-focus. Moreover, since all points along a ray

11

d

u

P

scene

equifocal sensorplane

aperture

blurdiameter

v

plane

C

lens

aoptical axis

X

µ

P0

2¾

Figure 7: Geometry of the thin lens model, represented in 2D. When the sensorplane is positioned at an axial distance of v from the lens, the set of in-focuspoints lie on a corresponding equifocal plane at an axial distance of u, as givenby Eq. (1). As shown, the rays converging on a particular point on the sensorplane, X , originate from a corresponding in-focus scene point, P . When thescene surface and P do not coincide, X is defocused and integrates light from acone, whose projected diameter, 2σ, is given by Eq. (4). Note that the principalray passing through the lens center, C, is undeflected. The aperture is assumedto be circular, with diameter a = λ/α, and centered on C.

through the pinhole correspond to the same pixel, information about distanceto objects in the scene is lost through projection.

Photometrically, the image irradiance, E, depends only on the radiance, L,associated with the corresponding ray. Assuming a linear sensor response, this

gives the relationship E(X) ∝ L(−−→CX).

Although aperture and zoom setting have no meaningful interpretation fora pinhole lens, the distance from the pinhole to the sensor plane, v, can beinterpreted as a degenerate form of focus setting (Fig. 7). For any such distance,the pinhole lens will still achieve perfect focus, however, moving the sensor planehas the side-effect of magnifying the image.

Note that the pinhole lens does not actually redirect light from the scene,but simply restricts which rays reach the sensor. Therefore, an alternate way ofthinking about a pinhole lens is a mechanism to select a 2D slice from the 4Dexternal light field from the scene (Sec. 2.2).

3.1.2 Thin lens model

The thin lens model is a classic lens model based on two spherical refractingsurfaces with negligible separation, assuming a simple first-order approximationof geometric optics (Fig. 7). In a sense the thin lens model is the simplest rea-sonable model accounting for finite-aperture lenses with variable aperture, focussetting, and focal length. The model has been widely used for 3D reconstruc-tion from finite-aperture images [6, 12, 38, 39, 42, 54, 60]. Note that for a pinhole

12

aperture at the lens center, the thin lens model reduces to the pinhole model.An important consequence of the thin lens model is that for a given focus

setting, namely the position of the sensor plane, v, the surface defining thecorresponding set of perfectly focused scene points is a plane parallel to thesensor plane. In other words, the equifocal surfaces (Sec. 2.2) for the thin lensmodel are simply fronto-parallel planes.

The thin lens model is based on a distinguished line known as the optical

axis. The optical axis is perpendicular to the sensor plane, passes through thelens center, C, and is normal to both refracting surfaces (Fig. 7). Accordingto the first-order optics assumed by the thin lens model, we can assume thatthe angle between a ray and the optical axis is negligible. This approximation,known as the paraxial assumption [42, 46], gives an invariance to transversal shiftperpendicular to the optical axis. In algebraic terms, the paraxial assumptionlinearizes the trigonometric functions as sin(x) ≈ x and cos(x) ≈ 1. Therefore,Snell’s law of refraction, n1 sin θ1 = n2 sin θ2, may be approximated as n1θ1 =n2θ2.

Under the paraxial assumption, a spherical refracting surface can be shownto focus incident rays of light to a common point. Then, using basic geometry,we can derive the classic focusing relationship between points on either side ofthe lens, also known as the thin lens law:

1

u+

1

v=

1

λ, (1)

where v is the axial distance from a point on the sensor plane to the lens, uis the axial distance from the lens to the corresponding in-focus scene point,and λ is the focal length (see Sec. 2.1). Note that under the thin lens model,the focal length also corresponds to the distance behind the lens at which therays parallel to the optical axis, i.e., from an infinitely distant scene point, willconverge.

For a given pixel on the sensor plane, X , the ray passing through the lens

center,−−→XC, also known as the principal ray, will not be refracted. This is due to

the paraxial assumption, which treats the principal ray analogous to the opticalaxis. By definition, the corresponding in-focus scene point, P , must lie on theprincipal ray, giving rise to the following explicit construction,

v = ||C − X || cos θ (2)

P = X +u + v

v(C − X) , (3)

where θ is the angle between the principal ray and the optical axis (Fig. 7), andthe scene-side axial distance u may be computed according to Eq. (1).

If the closest scene point along the principal ray, P ′, lies on the equifocalplane, i.e., P ′ = P , then the corresponding pixel on the sensor plane, X , isperfectly in focus. Otherwise, X integrates light from some region of the scene.

From simple geometry, the domain of integration is a cone whose cross-section is the shape of the aperture. Thus for an idealized circular aperture, the

13

projection of the cone onto a fronto-parallel plane passing through P ′ is circular.By similar triangles, the diameter, 2σ, of this blur circle may be computed as

2σ = a|d − u|

u, (4)

where a = λ/α is the aperture diameter, and d is the axial distance of P ′ fromthe lens. By rearranging this equation, we obtain:

d = u

(

1 ± 2σ

a

)

, (5)

which expresses the depth of the scene, d, in terms of the degree to which pixelX is defocused, as represented by the blur diameter, 2σ. This is the basic ideathat enables depth-from-defocus methods (Sec. 7).

Photometrically, the image irradiance, E, for the thin lens model dependsnot only on the radiance, L, associated with the corresponding ray, but alsofactors causing radiometric variation over the sensor plane and for objects atdifferent depths. Assuming a linear sensor response, the thin lens irradiancemay be derived as [2, 45]:

E(X) ∝(

A

v2

)

L(−−→CX) cos4 θ , (6)

where A = π(a/2)2 is the area of the aperture. The distance to the sensor plane,v, causes an inverse-squared distance falloff, and the angle from the optical axis,θ, contributes to several off-axis effects.

Note that the thin lens formula is an idealization, satisfied only for anaberration-free lens near the optical axis, so that the paraxial assumption holds.In practice, the model is still a reasonable approximation for many finite-aperture lenses, however calibrating its parameters is non-trivial (Sec. 3.3). Foradditional accuracy, more detailed empirical calibration may be used to factorout residual geometric and radiometric distortions, to reduce the behavior of areal finite-aperture lens to the thin lens model.

3.1.3 Thick (Gaussian) lens model

Another classic imaging model favored by some authors, is the thick Gaussianlens model [25, 46, 48], however this can easily be reduced to the thin lens model,provided the medium, e.g., air, is the same on both sides of the lens. Thethick lens model defines two distinct refracting surfaces with fixed separation,where axial distance u and v measured with respect to those planes. By simplycollapsing the planes we obtain the thin lens model, as the model of refractionwith respect to the planes is identical in both cases. In any case the “thickness”of the lens model has no physical meaning for real multi-element lenses.

14

3.1.4 Pupil-centric model

As Aggarwal and Ahuja note, the thin lens model assumes that position of theaperture is coincident with the effective scene-side refractive surface, howeverreal lens designs often violate this assumption [2]. To address this deficiency,they propose a richer analytic model, called the pupil-centric model, whichincorporates the positions of entrance and exit pupil, and possibly the tilt ofthe sensor plane relative to the optical axis.

For a given setting of the lens parameters, the pupil-centric model reduces toan instance the thin lens model, whose effective parameters could be calibratedempirically. The real advantage of the pupil-centric model is that it providesa more accurate analytic model across all lens settings, from a small numberof extra model parameters. These pupil-centric parameters may be empiricallyfit through calibration, however the authors suggest measuring some of themdirectly for the lens in question, using a second camera to perform manualdepth-from-focus (Sec. 6) on the internal lens components!

3.2 Distortions in real lenses

Analytic imaging models, like the thin lens model, serve as a useful first approx-imation to the behavior of real lenses. In practice, however, real lenses sufferfrom significant geometric and radiometric distortions from those basic models,also known as aberrations. The bulk of these distortions are due to fundamen-tal limitations in the analytic model, i.e., the approximate first-order model ofoptics assumed by the thin lens model, and so take one of the predictable formsdescribed below. However, physical design constraints, such as aperture place-ment, as well as limited manufacturing tolerances can also contribute to thesedistortions.

Lens distortions manifest themselves most clearly for the minimalist finite-aperture lens design mentioned earlier, consisting of a single spherically-shapedconvex lens element. More complicated real lens designs suffer less from distor-tions, since a great deal of design effort was already expended to reduce thesedistortions, or otherwise make compromises to balance different distortions overthe operational range of the lens settings. Even the highest quality professionalSLR lenses suffer from residual distortions, due to inherent limitations of theoptical design. While complex lens systems make the analysis of distortionsmore difficult, to some extent we can model and correct for such distortionsusing an empirical calibration procedure (Sec. 3.3).

Seidel aberrations The first category of distortions we consider are geomet-ric distortions from the first-order paraxial model of optics, which prevent raysfrom the scene from focusing perfectly on the sensor, or else from focusing at theexpected location. Five common types of geometric distortions, known as Seidelaberrations, may be accounted for by considering a richer third-order model ofoptics. Algebraically, third-order optics involves adding an extra Taylor seriesterm to the trigonometric functions, as sin(x) ≈ x− 1

3!x3 and cos(x) ≈ 1− 1

2!x2.

15

While the first of these distortions, spherical aberration, affects the focusingeven for scene points along the optical axis, the remaining Seidel aberrations donot affect such points, but rather affect off-axis scene points to different degrees,depending on their angle from the optical axis:

• Spherical aberration Due to the fact that a spherical lens is not theideal shape for focusing, rays that reach the margins of the lens will berefracted to a relatively closer position, preventing rays from convergingperfectly at a point on the sensor plane. To minimize this distortion, lenselements may be designed with special aspheric shapes.

• Coma For large apertures, off-axis scene points will be defocused in acharacteristic comet shape, whose scale increases with the angle from theoptical axis.

• Astigmatism From the point of view of an off-axis scene point, thelens is effectively tilted with respect to the principal ray. This causesforeshortening and leads to focusing differences in the radial and tangentialdirections. The classic example is a wheel centered in the image, whosespokes and rim can both be perfectly focused, but not simultaneously.

• Field curvature Even using a perfectly focusing aspheric lens element,the resulting equifocal surfaces in the scene may be slightly curved. Thisincompatibility between the curved shape of the equifocal surfaces andthe planar sensor causes objects at infinity, or fronto-planar objects, to beradially defocused.

• Radial distortion If the aperture stop is displaced from the front ofthe lens, rays through center of the aperture will be refracted, leadingto radially symmetric magnification which depends on the angle of theincoming ray. This distortion depends on the displacement but not thesize of the aperture, and its magnitude is dependent on the position of thesensor plane as well. Although radial distortion does not affect sharpness,it causes straight lines in the scene to be imaged as curved lines.

Chromatic aberrations Another fundamental type of distortion stems fromthe dependence of refractive index on the wavelength of light, according to thesame physical principle which causes blue light to be more refracted than redlight through a prism. In addition to reducing the overall sharpness of theimage, chromatic aberrations can also lead to color fringing artifacts that areespecially pronounced at high-contrast edges.

One component to chromatic aberration is axial, which prevents the lensfrom focusing simultaneously on different colored rays originating from the samescene point. Chromatic aberration also has a lateral component, causing off-axisscene points to be focused with magnification that is dependent on their color,leading to prism-like dispersion effects. In practice, systems using multiple lenselements with special spacing or different refractive indexes can largely eliminateboth types of chromatic aberration.

16

Radiometric distortions The last category of distortions we consider areradiometric distortions, which cause intensity variations on the sensor even whenthe radiance of every ray in the scene is constant. The most common type ofradiometric distortion is vignetting, which refers to darkening, or even completeocclusion, at the periphery of the sensor. There are a variety of sources forvignetting:

• Mechanical vignetting Some light paths are completely blocked by themain aperture, internal apertures, or external attachments such filters orlens hoods.

• Natural vignetting Also known as off-axis illumination, natural vi-gnetting refers to the cos4 θ falloff already accounted for by the thin lensmodel (Sec. 3.1.2), arising from integration over oblique differential solidangles [45].

• Optical vignetting The displacement of the aperture from the front ofthe lens causes reduced visibility for oblique rays, particularly for largeaperture settings, as portions of the entrance pupil become effectively oc-cluded. This type of vignetting leads to characteristic “cat’s eye” defocus,narrowing toward the edges of the image and corresponding to the shapeof the visible aperture.

Another radiometric distortion related to optical vignetting, known as pupilaberration, is the nonuniformity of radiometric variation of a scene point acrossthe visible aperture. This effect may be especially pronounced for small asym-metric apertures whose centroid is off-axis [3].

As shown by Kang and Weiss, it is even possible in principle to fit models ofvignetting to a single image of a diffuse white plane, in order to recover intrinsiccamera calibration [24]. While some camera parameters could not be estimatedin a stable way, their results nonetheless suggest that radiometric distortions canbe accounted for, and even carry useful information about the imaging system.

3.3 Calibration methods

In order to relate images captured at different lens settings, (α, f, λ), it is neces-sary to align them in both geometric and radiometric terms. Any analytic lensmodel will predict such an alignment, so a naıve approach to camera calibrationis to assume that the chosen analytic lens model is sufficiently accurate, andthat its effective parameters may be obtained directly from the lens manufac-turer [48]. We assume that the lens parameters, (α, f, λ), that we manipulateare well instrumented, or else can be made so through additional calibration.

For higher accuracy, however, known calibration patterns may be used toestimate the parameters of the analytic lens model empirically, as well as toestimate various distortions from that model (Sec. 3.2) [5, 10, 21, 27, 56]. Takingthis idea to an extreme, extensive empirical calibration could theoretically beused to reduce lens calibration to a pixel-level table look-up, with entries forevery image coordinate, (x, y), at every tuple of lens parameters, (α, f, λ) [46].

17

In the following, we describe geometric and radiometric calibration methodsformulated in terms of in-focus 3D points, or else in terms of the principalray projecting to the “center” of the defocus on the sensor plane. We deferdiscussion of the specific form of defocus, including methods for its empiricalcalibration, to a later section (Sec. 4).

3.3.1 Geometric calibration

Geometric calibration means that we can locate the projection of the same 3Dpoint in multiple images taken with different settings. Real lenses, however,map 3D points onto the image plane in a non-linear fashion that cannot bepredicted by ordinary perspective projection. While the main source of thesedistortions are changes to the focus or zoom setting, the aperture setting affectsthis distortion in a subtle way as well, by amplifying certain aberrations whichcause small changes to the image magnification.

The single geometric distortion with the largest effect is image magnificationcaused by changes to the focus or zoom setting. Such magnification can actuallybe predicted using the thin lens model (Sec. 3.1.2), but in practice the extentof the magnification must be recovered empirically, as typical lenses are notsufficiently calibrated a priori with respect to their settings.

For some reconstruction methods, image magnification is not mentioned ex-plicitly [63], or else is consciously ignored [34, 46, 48]. Since such magnication ison the order of 5% at the image margins, methods that use very low-resolutionimages or consider large image patches can sidestep modeling these effects forsimplicity. Other reconstruction approaches circumvent the image magnificationproblem by changing the aperture setting instead [38, 39, 46, 48], by moving theobject [36], or by changing the zoom setting to compensate [10, 57].

Another interesting approach for avoiding the image magnification effect isto use special image-side telecentric optics, designed so that the principal rayalways emerges parallel to the optical axis [52–54]. The telecentric lens designhas the effect of avoiding magnification with sensor plane motion, and has theadded benefit of avoiding the any radiometric falloff due to the position of thesensor plane. In practice, telecentric lens designs can be realized by placing anadditional aperture at an analytically-derived position, e.g., at a distance of λin front of the lens, for the thin lens model.

A more direct approach for handling image magnification involves fittingthis magnification, either directly [5, 27] or in a prior calibration step [10, 21, 56],which allows us to warp and resample the input images to some reference lenssetting. However, as Willson and Shafer note, simply using the center pixelof the sensor is insufficient for accurately modeling magnification [58]. Theysuggest that suggest a simple method for fitting the magnification center, basedon the pairwise relative expansions of image features, but the image center mayalso be recovered as another parameter in an overall optimization [21].

Beyond simple image magnification, with an empirically fit center, evenricher model of geometric distortion, including radial distortion, have been pro-posed as well [21, 27, 34, 56]. Kubota, et al. proposed a hierarchical registration

18

method, analogous to block-based optical flow, for geometrically aligning de-focused images captured at different lens settings [27]. Willson implementedextensive geometric calibration as well, by fitting polynomials in the lens pa-rameters, (α, f, λ), to a complete parameterization of the standard 3 × 4 per-spective projection camera matrix, inferring the degree of these polynomialsautomatically [56]. Nair and Stewart suggest correcting for field curvature, byfitting a quadratic surface to a depth map obtained by applying their recon-struction method to a known fronto-planar scene [34].

Another class of geometric distortion, not accounted for in analytic opticalmodels, is non-deterministic distortion, caused by random vibrations, both in-ternal and external to the camera, and hysteresis of the lens mechanism [21,53, 58]. These effects can be especially significant for high-resolution images,and can even occur when the camera is controlled remotely without any changein settings, and is mounted securely on an optical table [21]. To offset non-deterministic distortions, a first-order translational model can be fit to subpixelshifts [21, 53]. Unlike other geometric distortions, which may be calibrated of-fline, non-deterministic distortion must be recomputed online, in addition tobeing accounted for in any offline calibration process.

3.3.2 Radiometric calibration

Radiometric calibration means that we can relate the intensity of the same 3Dpoint in multiple images taken with different settings. While the main sourceof radiometric distortion is changes to the aperture setting, the focus and zoomsettings affect this distortion in a more subtle way as well, e.g., by the inverse-squared falloff due to the distance to the sensor plane.

Some reconstruction methods that rely on cross-aperture image comparisonsdo not mention radiometric distortion explicitly [38, 39]. The most commonapproach to handling radiometric distortion is simply to normalize a given imageregion by its mean brightness [46, 48, 52]. This normalization provides someinvariance to radiometric distortion, provided that the level of distortion doesnot vary too much across image region.

Richer models of radiometric variation may also be fit to images of calibrationtarget such as a diffuse white plane [21, 24]. One approach is to fit a parametricmodel of vignetting to each single image, e.g., off-axis illumination with a simplelinear falloff with radius [24]. By contrast, we can use a more direct approachto obtain an empirical measure of cross-aperture radiometric variation on aper-pixel level [21].

Another rarely mentioned radiometric distortion that should really be ac-counted for is the camera response function, which maps image irradiance topixel intensity in a non-linear way [11, 20, 33]. By recovering and inverting thisfunction, we can compare measured image irradiances directly, in agreementwith the additive nature of light (Sec. 3) [21].

19

4 Models of Defocus

If the rays incident on the lens from a given 3D scene point do not converge to aunique pixel on the sensor plane, the scene point is considered to be defocused,and the extent of its defocus is given by the footprint of these rays on the sensorplane. In a dual sense, a pixel on the sensor plane is defocused if the raysconverging on that pixel do not all originate from a single 3D point lying on thescene surface.

Given a given concrete lens model describing how every ray in the sceneis redirected and attenuated (Sec. 3), the phenomenon of defocus will be com-pletely defined. But while the analytic lens models we described lead directlyto models of defocus (see Sec. 3.1.2), these simple models may not be detailedenough for some applications, nor is it clear how geometric warping for imagealignment interacts with defocus. For these reasons, defocus is often treatedseparately from other aspects of the finite-aperture imaging model.

Although defocus is overwhelmingly modeled as some form of linear filter-ing, this approximation cannot accurately represent defocus at sharp occlusionboundaries, nor the presence of any self-occlusion over the 2D baseline definedby the lens [6, 17]. In fact, simulating defocus in general requires full knowledgeof the scene geometry, which adds significant complexity to the reconstructionproblem. By properly modeling occlusions, more general models of defocus pre-dict such effects as the ability to see “behind” severely defocused foregroundobjects.

To date, however, only a few focus-based reconstruction methods have at-tempted to accurately model defocus at occluding edges [7, 17]. However, bothof these methods require strong assumptions about the geometry of the scene,that depth discontinuities are due to opaque step edges aligned with the opticalaxis [7], or that the scene lies on two non-intersecting smooth surfaces [17].

For some applications such as depth-from-focus (Sec. 6), an explicit modelfor defocus in unnecessary. Instead, the basic realization that defocus causesreduction in contrast is sufficient to devise a focus measure to identify the lenssetting at which defocus is minimized. While this approach holds for generallens models without defocus calibration, it implicitly assumes that the scenegeometry is smooth, otherwise occluding foreground objects could contaminatethe estimation of optimal focus and lead to arbitrary errors.

4.1 Defocus as linear filtering

In computer vision, defocus has overwhelmingly been modeled as linear filteringacting on an ideal, perfectly focused pinhole version of the image. This modelhas the advantage that it allows us to describe an observed defocused image, I,as a simple convolution,

I = Bσ ⊗ I , (7)

where Bσ is the blur kernel, or 2D point-spread function, σ is a parametercorresponding to the level of defocus, and I is the ideal pinhole image of the

20

scene. We assume that the blur kernel is normalized,∫∫

Bσ(x, y) dxdy = 1,

implying that radiometric calibration between I and I has been taken intoaccount.

This model of defocus as linear filtering corresponds to the paraxial assump-tion of the thin lens model (Sec. 3.1.2), for the restrictive assumption the sceneconsists of a single plane, parallel to the sensor plane.

The blur kernel acts as a low-pass filter, so that as the image is defocused,contrast is lost and high frequencies rapidly decrease. Although the responseof the blur kernel need not decay monotonically for all higher frequencies (i.e.,side lobes may exist), none of its local frequency maxima are as high as the DCresponse in any reasonable physical system.

To make the identification of blur tractable, we typically require that theblur kernel may be parameterized by a single quantity, σ. We often make thefurther assumption that the blur kernel Bσ is radially symmetric, or at leastthat it may be parameterized according to its radius of gyration (see Sec. 5.1).

4.2 Spatially variant filtering

In practice, the assumption that the scene consists of a single fronto-parallelplane is too restrictive. We can relax this assumption somewhat by modeling theblur parameter as spatially varying, i.e., σ(x, y), corresponding to a scene thatis locally fronto-parallel at every point [7, 40, 41]. This results in the followingspatially variant convolution,

I(x, y) =

∫∫

s,t

Bσ(s,t)(x − s, y − t) · I(s, t) ds dt , (8)

which can be thought of as independently defocusing every pixel in the pinholeimage, I(x, y), according to varying levels of blur, and integrating the results.Note that although this defocusing model is no longer a simple convolution, itis still linear, since every pixel I(x, y) is a linear function of I.

In practice, smoothness priors are often introduced on the spatially variantblur, I(x, y), corresponding to smoothness priors on the scene geometry [40, 41].These priors helps regularize the solution of Eq. (8), and balance reconstructionfidelity against discontinuities in depth.

4.3 Windowed linear filtering

In general, the spatially variant filtering model of Eq. (8) means that we canno longer relate a particular observed defocused pixel, I(x, y), to a single blurparameter, σ. But provided that σ(x, y) is constant within a sufficiently largewindow centered on (x, y), Eq. (8) reduces locally to

I(x, y) = [Bσ(x,y) ⊗ I](x, y) . (9)

This observation motivates the popular sliding window model [5, 9, 10, 12, 16, 18,34, 35, 38, 39, 46, 48, 52, 54], where for a particular pixel, (x, y), we can express

21

defocusing as filtering within its local window,

I · W(x,y) = (Bσ(x,y) ⊗ I) · W(x,y) . (10)

where W(x,y) represents the windowing function centered at (x, y).The choice of the window size in this model presents a dilemma. While larger

windows may improve the robustness of depth estimation because they providemore data, they are also more likely to violate the assumption that the sceneis locally fronto-parallel, and lead to a lower effective resolution. Therefore nosingle window size for a given scene may lead to both accurate and precise depthestimates.

Note that strictly speaking, the geometric model implied by the sliding win-dow model is inconsistent, in the sense that two nearby pixels assigned to differ-ent depths contradict each other’s assumption that the scene is locally fronto-planar, wherever their windows overlap. Therefore, the windowed model is onlya reasonable approximation if the scene is smooth enough so that depth withinthe sliding window can be locally approximately as fronto-parallel.

A problem caused by analyzing overlapping windows in isolation is that blurfrom points outside the window may intrude and contaminate the reconstruction[34, 46]. This problem can be partially mitigated using a smooth falloff, such asa Gaussian, for the windowing function [18, 39, 46, 48].

4.4 Defocus for local tangent planes

One way to generalize the defocus model beyond even shift-variant filtering isto relax the assumption that the scene is locally fronto-parallel. In particular,by estimating the normal at each point as well as its depth, the scene can bemodeled as a set of local tangent planes, accounting for the effects of foreshort-ening on defocus [23, 60]. Note that local tangent planes are not sufficient tomodel sharp occlusion boundaries or generic self-occluding scenes.

When the neighborhood of a point is modeled by a tangent plane, the de-focus parameter, σ, may vary across that neighborhood. Even if this variationis linear, the defocus integral can no longer be expressed as a simple spatialconvolution, even locally. To address this issue, the resulting defocus integralcan be linearized by truncating higher-order terms, assuming that the defocusparameter varies sufficiently slowly [23, 60].

For reconstruction methods recovering relative defocus operators over a ba-sis of fronto-parallel planes [18], the defocus model can be generalized to localtangent planes in a straightforward way. In theory, this simply involves ex-panding the basis of shapes, by sampling over the orientation of the planes aswell. However, because of the additional calibration burden and the large patchsizes that would be required to avoid overfitting, this approach has not yet beenimplemented.

The local tangent plane model leads to a more complex estimation problem,however it also gives robust estimates for shape [23, 60], and provides a moredirect means of evaluating the smoothness of the scene [23]. Furthermore, the

22

Pr(x, y) Gr(x, y)

F [Pr](ω, ν) F [Gr](ω, ν)

(a) (b)

Figure 8: Point-spread functions for two common defocus models, (a) the pill-box, and (b) the isotropic Gaussian. The top row corresponds to the spatialdomain, and the bottom row to the frequency domain.

recovered normal is a useful cue for reliability, as severe foreshortening oftencorresponds to unstable depth estimation.

4.5 Analytic defocus models

Assuming a linear filtering model of defocus, the two most commonly used an-alytic models for the blur kernel Bσ are the pillbox and the isotropic Gaussian(Fig. 8). We present these models along with their Fourier transforms, antici-pating a frequency-based analysis later.

Pillbox defocus model Starting from the thin lens model, geometric opticspredict that the footprint of a point on the sensor plane, as projected onto afronto-parallel plane in the scene, is just a scaled version of the aperture (Fig. 7).So under the idealization that the aperture is circular, the footprint will becircular as well, leading to a cylindrical, or pillbox, model of defocus [42, 54]:

Pr(x, y) =

{

1πr2 x2 + y2 ≤ r2,

0 otherwise.(11)

⇐⇒ F [Pr](ω, ν) = 2J1

(

2πr√

ω2 + ν2)

2πr√

ω2 + ν2, (12)

where r is the radius of blur circle, and J1 represents the first-order Besselfunction, of the first kind, which produces cylindrical harmonics that are qual-itatively similar to the ringing behavior of the 1D function sinc(x) = 1

x sin(x).For the thin lens model we simply have the relation r = σ.

23

Gaussian defocus model Although first-order geometric optics predict thatdefocus within the blur circle should be constant, as in the pillbox function, thecombined effects of such phenomena as diffraction, lens imperfections, and aber-rations mean that a 2D circular Gaussian is arguably a more realistic empiricalmodel for defocus [18, 39, 46]:

Gr(x, y) =1

2πr2e−

x2+y

2

2r2 (13)

⇐⇒ F [Gr](ω, ν) = e−12(ω2+ν2)r2

, (14)

where r is the standard deviation of the Gaussian. Note that this model is an-alytically simpler than the pillbox defocus function, because the Fourier trans-form of a Gaussian is simply an unnormalized Gaussian, which simplifies furthermathematical manipulation. For the thin lens model, r can be related to theblur circle radius according to some empirically determined factor, r = kσ.

4.6 Empirical defocus models

Purely empirical measurements can also be used to recover the blur kernel, withno special assumptions about its form beyond linearity. In blind deconvolutionmethods (see Sec. 7.1), the blur kernel may also be estimated simultaneouslywith the geometry and radiance of the perfectly-focused scene.

One common method for calibrating the blur kernel in microscopy appli-cations uses small fluorescent beads mounted on a fronto-planar surface [51],each projecting to approximately one pixel at the in-focus setting. Since theperfectly focused beads approximate the impulse function, the 2D blur kernelmay be recovered directly from the blurred image observed at a given lens set-ting. By assuming rotational symmetry, the blur kernel may also be recoveredempirically from the spread across sharp edges or other known patterns suchsine gratings.

Favaro and Soatto suggest an alternative method for recovering defocus cal-ibration, using a more general type of fronto-planar calibration pattern placedat a discretized set of known depths [18]. For each depth, they propose usinga rank-based approximation over a set of patches, to recover the correspondinglinear operator that relates defocus between several lens settings, factoring outthe variation in radiance over the sample patches.

5 Focus Measures

Even in the absence of an explicit model for defocus, it is still possible to for-mulate a focus measure with the ability to distinguish the lens setting at whicha given point is optimally in-focus. Such a focus measure is the basis for bothimage-based auto-focusing [25] and a 3D reconstruction method known as depth-from-focus (Sec. 6). We start by analyzing the level of defocus for a known blurkernel, then discuss a variety of possible focus measures for the blind case.

24

5.1 Known blur kernel

For a known blur kernel, B(x, y), a widely used measure of defocus is the radiusof gyration [8, 46],

σ =

[∫∫

(x2 + y2)B(x, y) dxdy

]1/2

, (15)

where B is assumed to be normalized with zero mean. For an isotropic Gaussianblur kernel, the radius of gyration is equivalent to the standard deviation. More-over, as Buzzi and Guichard show, under the assumption that defocus corre-sponds to convolution, the radius of gyration is the only defocus measure thatsatisfies additivity and several other natural properties [8].

Buzzi and Guichard also remind us that central moments in the spatialdomain are related to derivatives at the origin in the frequency domain [8]. Thisproperty allows us to reformulate our analytic measure of defocus, Eq. (15), interms of the Laplacian at the DC component in the Fourier domain,

σ =(

−∇2 F [B(x, y)])∣

∣

∣

(0,0). (16)

This relationship leads to the counter-intuitive result that defocus measure canbe thought of exclusively as an low-frequency phenomenon, where perceivedcontrast reduction is due entirely to the falloff at the DC component. To sup-port their argument, Buzzi and Guichard present the results of a small percep-tual study, including images blurred with several artificially constructed defocusfunctions preserving high frequencies [8]. This defies the conventional view ofdefocus as a low-pass filter, where the level of defocus depends on the extent towhich high frequencies are filtered out.

5.2 Blind focus measures

Even when the form of the blur kernel is completely unknown, it may stillbe possible to detect the lens setting which brings some portion of the sceneinto optimal focus. To this end, a variety of blind focus measures have beenproposed, all of which essentially function as contrast detectors within a smallspatial window in the image. The justification of this approach is that anyamount of defocus will reduce the contrast, regardless of the underlying sceneradiance, according to underlying integration.

The image processing literature is a rich source of ideas for such contrastdetectors. One approach is to apply a contrast-detecting filter, such as the gradi-ent [25, 57] or the Laplacian [10, 25, 36], and to sum the magnitude of those filterresponses over the window. Note that these kinds of contrast-detecting filterscan formulated in the Fourier domain as well, which connects more directly withour intuition that they measure the energy present at high frequencies [25, 63].

An alternative approach for contrast detection is to consider the pixel in-tensities in the patch as an unordered set, ignoring their spatial relationship.

25

Various focus measures along these lines include the entropy of the binned in-tensities [25], the variance of the intensities [21, 25], or simply the raw maximumpixel intensity.

Averaging the focus measure over a patch can cause mutual interferencebetween multiple focus peaks representing real structure, so several proposedmeasures explicitly model focus as multi-modal [43, 61]. Xu, et al. assume abimodal intensity distribution for the in-focus scene, and define a measure ofdefocus based on closeness to either of the extreme intensities in the 3D volumeconsisting of the image window over all focus settings [61]. Their bimodal modelof intensity also has the advantage of mitigating bleeding artifacts across sharpintensity edges (see Sec. 6). Similarly, Schechner, et al. propose a voting schemeover the 3D volume, where each pixel votes individually for local maxima acrossfocus setting, and then votes are aggregated over the window, weighted bymaxima strength [43].

Krotkov reviewed a number of classic focus measures, and concluded thatthe sum of squared gradients, or Tenengrad measure, has the best qualitativeperformance in terms of producing a sharp unimodal peak, with robustnessto noise [25]. Subbarao and Tyan undertook a more detailed study of severalfocus measures, and proposed theoretical metrics for evaluating their noise sen-sitivity [49]. Based on both theory and experimental evaluation, they insteadrecommend the Laplacian measure over either the Tenengrad measure or spatialvariance.

A new type of focus measure was recently proposed by Hasinoff and Ku-tulakos, defined for individual pixels. The confocal constancy error evaluatesfocus by considering cross-aperture intensity variation at a given focus setting,rather than measuring differences in contrast between pixels in the spatial neigh-borhood [21]. However, since low confocal constancy error is only a sufficientcondition for focus, in the absence of nearby fine-scale texture, this measuremay be multi-modal or noisy.

6 Depth-from-Focus

Depth-from-focus (DFF) is a straightforward 3D reconstruction method, whichexploits a blind focus measure (Sec. 5) to determine the lens setting at whichsome region of the scene is brought into optimal focus. This lens setting canthen be related to depth, according to prior lens calibration (Sec. 3). DFFhas the advantage of being simple to implement and not requiring an explicitcalibrated model of defocus (Sec. 4).

DFF is most commonly realized by varying the focus setting and holding allother lens settings fixed, which may be thought of as scanning a test surfacethrough the 3D scene volume and evaluating the degree of focus at differentdepths [25, 34, 57]. Alternative schemes involve moving the object relative thecamera [36].

One disadvantage of DFF is that the scene must remain stationary whilea significant number of images are captured with different lens settings. For

26

an online version of DFF, such as image-based auto-focusing, we would preferto minimize the number of images required. As Krotkov points out, if thefocus measure is unimodal and decreases monotonically from its peak, Fibonaccisearch is optimal for locating this peak [25].

Instead of greedily optimizing the focus measure for each pixel independently,it is also possible to construct a prior favoring surface smoothness, and insteadto solve a regularized version of DFF, e.g., using graph cuts [61].

6.1 Maximizing depth resolution

To maximize the depth resolution, DFF should use the largest aperture avail-able, corresponding to the narrowest depth-of-field. Furthermore, DFF shoulduse a relatively large number of lens settings (up to several dozen) to denselysample the range of depths covered by workspace.

The most efficient sampling of depths for DFF is at intervals correspondingto the depth-of-field, as any denser sampling would mean that the highest fre-quencies may not be detectably influenced by defocus [42]. Note that the opticspredict that depth precision will fall off quadratically with depth in the scene,according to the quadratic relationship between depth and depth-of-field [25].

Although the depth resolution of DFF is limited by both the number ofimages acquired and the depth-of-field, it is possible to recover depth at sub-interval resolution by interpolating the focus measure about the optimal lenssetting, for example, by fitting a Gaussian to the peak [25, 57].

6.2 Analysis

For DFF to identify an optimal focus peak, there must be enough radiometricvariation within the window considered by the focus measure. While an obviousfailure case for DFF is an untextured surface, a linear intensity gradient is afailure case for DFF as well, since any symmetric defocus function integratedabout a point on the gradient will produce the same intensity [14, 50]. In otherwords, unlike stereo, the mean of the window must also vary with depth. Indeed,theory predicts that for DFF to be discriminative, the in-focus radiance musthave non-zero second-order spatial gradients [14].

Because nearly all blind focus measures (Sec. 5.2) are based on spatial imagewindows, DFF inherits the problems of assuming a windowed, locally fronto-parallel model of the scene, including the lack of occlusion modeling (Sec. 4.3).A notable exception is DFF based on confocal constancy, which operates at thesingle-pixel level [21].

Another related problem with DFF is that defocused features may bleedin from outside the window, contaminating the focus measure and biasing thereconstruction to a false peak [34, 46, 61]. This problem may be avoided byconsidering only image windows that at least as large as the largest blur kernelobserved over the workspace, but this can severely limit the effective resolutionwhen large blurs are present. Alternatively, Nair and Stewart suggest restrictingthe DFF computation to a sparse set of pixels corresponding to sufficiently

27

isolated edges [34]. Modeling the intensity distribution as bimodal may alsomitigate this problem when the contaminating features are strong [61]

7 Depth-from-Defocus

Depth-from-defocus (DFD) is a 3D reconstruction method based on fitting amodel of defocus to images acquired at different lens settings. The level ofdefocus recovered for a given pixel can then be related to depth, according to thedefocus model (Sec. 4), the lens model (Sec. 3), and the particular lens settingsused. In general, DFD requires far less image measurements than DFF, sincetwo images are sufficient for many DFD methods. Given a strong enough scenemodel, 3D reconstruction may even be possible from a single image (Sec. 7.3),however most DFD methods benefit from more data.

Note that depth recovery using DFD is potentially ambiguous, since for aparticular pixel there may be two points, one on either side of the in-focus 3Dsurface, that give rise to the observed level of defocus, σ. For the thin lens model,this effect is represented in Eq. (5). In practice the ambiguity may be resolved bycombining results from more than two images [48], or by requiring, for example,that the camera is focused on the nearest scene point in one condition [38].

Because nearly all DFD methods are based on linear models of defocus(Sec. 4.1), recovering the defocus function can be viewed as a form of inverse fil-tering, or deconvolution. In particular, DFD is equivalent to recovering Bσ(x,y)

from Eq. (8), or in the simplified case, Bσ from Eq. (10). Note that the blurkernel is parameterized by a single variable, σ, so this deconvolution is in effectconstrained to the family of blur kernels defined by this parameter.

DFD methods can be broken into several broad categories. The most straight-forward approach is to tackle the deconvolution problem directly, seeking thescene radiance and defocus parameters best reproducing two or more input im-ages acquired at different camera settings. Alternatively, we can factor out theradiance of the underlying scene, and estimate the relative defocus between theinput images instead. Relative defocus may also be related to depth, accord-ing to the lens calibration. Finally, if our prior knowledge the scene is strongenough, we can directly evaluate different defocus hypotheses using as little asa single image.

7.1 Direct deconvolution

The most direct approach to DFD is to formulate an optimization that seeksthe scene radiance and defocus parameters best reproducing the input imagesacquired at different lens settings. Note that this optimization is commonlyregularized with additional smoothness terms, to address the ill-posedness ofdeconvolution, reduce noise, and to enforce prior knowledge of scene smoothness,e.g., [17]. Although this approach represents a form of deconvolution, it can bestbe thought of as working forwards from various hypotheses about the underlyingscene, to maximize consistency with the input images.

28

Since a global optimization of direct deconvolution is intractable for imagesof reasonable size, practical deconvolution methods resort to various iterativerefinement techniques, such as gradient descent flow [23], EM-like alternatingminimization [14, 16], or simulated annealing [41]. These iterative methods havethe disadvantage of being sensitive to the initial estimate, and may becometrapped in a local extremum.

Layered decomposition A common approach for direct deconvolution is todiscretize the scene into a fixed number of fronto-parallel depth layers, typicallyone per input image [4, 26, 28, 32]. This formulation reduces the deconvolutionproblem to the distribution of scene radiance over the layered volume, where theinput images can be reproduced by a linear combination of the layers defocusedin a known way. In particular, provided that the input images correspond toan even sampling of the focus setting, the finite-aperture imaging model maybe expressed more succinctly as a 3D convolution between the layered scenevolume and the 3D point-spread function [31, 51].

For an opaque scene model, a given pixel must be assigned to a single depthlayer, which casts depth recovery as a combinatorial assignment problem (seethe MRF-based models below). More commonly, however, the imaging modelis taken to be additive and semi-transparent, so that scene radiance may bedistributed volumetrically over the different layers [4, 26, 28, 51]. McGuire, et

al. suggest a hybrid formulation where radiance is assigned to both of two fixedlayers, but with an alpha value for the front-most layer represented explicitly aswell [32].

To implement this deconvolution, radiance is distributed among the depthlayers in an iterative way, where the contribution of every 3D voxel is updatedgiven the discrepancy between the input images and the images synthesized fromthe current estimate, according to the known defocus model [4, 26, 28, 31, 32, 51].

This layered scene model is featured in the application of deconvolution

microscopy [31, 51], which involves deconvolving a set of microscopy imagescorresponding to a dense sampling of focus settings, similar to the input fordepth-from-focus (Sec. 6). Since many microscopy applications involve semi-transparent biological specimens, assuming an additive imaging model is welljustified.

Information divergence Note that all iterative deconvolution methods in-volve updating the estimated radiance and shape of the scene, based on the dis-crepancy between the input images and the synthetically defocused estimate.Favaro, et al. have argued that the discrepancy measure used should be theinformation divergence,

Φ(A||B) =∑

A log

(

A

B

)

− A + B , (17)

because it is the only such measure consistent with certain axioms, given pos-itivity constraints on scene radiance and the blur kernel [14, 16, 23]. The in-

29

formation divergence can be viewed as a generalization of the Kullback-Leibler(KL) divergence.

This new discrepancy measure has been used in the context of alternatingminimization for surface and radiance [14, 16], as well as minimization by PDEgradient descent flow, using level set methods [23].

Defocus as diffusion Favaro, et al. approach direct deconvolution by pos-ing defocus in terms of a partial differential equation (PDE) for an diffusionprocess [15]. Their strategy is to run the PDE on the more focused of thetwo images, until it becomes identical to the other image. This is a form of“simulation-based” inference, where the time variable is related to the amountof relative blur. For isotropic diffusion, their formulation is equivalent to thesimple isotropic heat equation, whereas for shift-variant diffusion, the anisotropyof the diffusion tensor characterizes the local variance of defocus. Note that toavoid backwards diffusions, the image is partitioned into two so that diffusionalways runs from more to less focused regions.

MRF-based models Another framework for performing direct deconvolutionis the Markov random field (MRF), which describes data costs for assigning dis-crete defocus labels to each pixel, as well as smoothness costs favoring adjacentpixels with similar defocus labels [41]. Rajagopalan and Chaudhuri formulate aspatially-variant model of defocus (Sec. 4.2) in terms of an MRF, and suggestoptimizing the MRF using a simulated annealing procedure, initialized usingclassic window-based DFD methods (Sec. 7.2.1).

Deconvolution with occlusion Recently, Favaro and Soatto proposed aDFD method that explicitly models occlusion effects, following the reversed-projection blurring model [6], which describes a richer model of defocus thansimple linear filtering [17]. To make the reconstruction tractable, they assumea scene model consisting of two smooth, non-intersecting surfaces, where thefront-most surface is opaque and has an associated binary alpha mask defining“holes” with arbitrary topology.

Even though defocus is no longer a linear operator under the more powerfulocclusion model, the simultaneous reconstruction of scene radiance, depth, andalpha mask is still amenable to direct deconvolution techniques, using regular-ized gradient-based optimization [17].

7.2 Depth from relative defocus

While direct deconvolution methods rely on simultaneously reconstructing theunderlying scene radiance and depth, it is also possible to factor out the sceneradiance, by considering the relative amount of defocus over a particular imagewindow, between two different lens settings.

By itself, relative defocus is not enough to determine the depth of the scene,however the lens calibration may be used to disentangle relative focus into ab-

30

solute blur parameters, which can then be related to depth as before. In general,this approach can be thought of as working backwards from the defocused im-ages to recover the underlying scene.

If one of the blur parameters is known in advance, the other blur parametercan be resolved by simple equation fitting. As Pentland describes, when oneimage is acquired with a pinhole aperture, i.e., σ1 = 0, the relative blur directlydetermines the other blur parameter, e.g., according to Eq. (19) [38, 39].

In fact, the restriction that one of the images is a pinhole image can easilybe relaxed, as the lens calibration provides an additional constraint betweenabsolute blur parameters. For the thin lens model, Eq. (5) may be used to derivea linear constraint in the underlying blur parameters, i.e., σ1 = Aσ2+B, betweenany two lens settings [46]. Technically speaking, for this linear constraint to beunique, the sign ambiguity in Eq. (5) must be resolved as described earlier inSec. 7.

7.2.1 Frequency-domain analysis

The most common method of recovering relative defocus is by analyzing relativefrequency response in corresponding image windows. As we show below, therelative frequency response is invariant to the underlying scene radiance, andprovides evidence about the level of relative defocus.

Convolution ratio In the Fourier domain, the simple convolutional modelof defocus given by Eq. (7) reduces elegantly. Using the fact that convolutionin the spatial domain corresponds to multiplication in the Fourier domain, wehave

F [I1]

F [I2]=

F [Bσ1] · F [I]

F [Bσ2] · F [I]

=F [Bσ1

]

F [Bσ2]

. (18)

This formula, also known as the convolution ratio, has the important feature ofcanceling all frequencies F [I] due to the underlying scene radiance [39, 60].

Thought of another way, the convolution ratio provides us with a relationshipbetween the unknown defocus parameters, σ1 and σ2, that is invariant to theunderlying scene. For example, by assuming a Gaussian defocus function as inEqs. (13)–(14), the convolution ratio in Eq. (18) reduces to:

σ22 − σ2

1 =2

ω2 + ν2ln

(F [I1](ω, ν)

F [I2](ω, ν)

)

. (19)

While other defocus functions may not admit such a simple closed-form solution,the convolution ratio will nevertheless describe a relationship between the blurparameters σ1 and σ2, i.e., that can be expressed through a lookup table.

In theory, we can fully define the relative defocus, as in the prototypicalEq. (19), simply by considering the response of the defocused images at a single

31

2D frequency, (ω, ν). However, using a fixed particular frequency can causearbitrarily large errors and instability if the images do not contain sufficientenergy in that frequency. Provided the defocus function is symmetric, for addi-tional robustness we can integrate over radial frequency, λ =

√ω2 + ν2, without

affecting the relationship between relative blur and the convolution ratio [38, 39].Note that an alternate version of the convolution ratio can be formulated us-

ing the total Fourier power, P(ω, ν) = |F(ω, ν)|2, instead. This gives analogousequations for relative defocus, but has the advantage that Parseval’s theorem,∫∫

|I(x, y)|2 dxdy = 14π2

∫∫

|F [I](ω, ν)|2 dω dν, may be used to compute relativedefocus more efficiently in the spatial domain [38, 46].

As a recent application of the convolution ratio, Chaudhuri suggested amethod for interpolating defocus between two images taken using a thin lenscamera with different settings [9]. Unfortunately, this method does not addressthe inherent ambiguity in Eq. (5), and fails to admit the case where the opti-mal in-focus setting lies between two focus settings used to generate the inputimages, potentially leading to inconsistent results.

Windowing effects By generalizing the convolution ratio to the windowed,locally fronto-parallel scene model described by Eq. (10), we obtain the morecomplicated formula,

F [I1] ⊗F [W1]

F [I2] ⊗F [W2]=

(F [Bσ1] · F [I]) ⊗F [W1]

(F [Bσ2] · F [I]) ⊗F [W2]

≈ F [Bσ1]

F [Bσ2]

. (20)

For the underlying scene radiance to cancel in this case, the windowing functionsmust be compact in the Fourier domain, i.e., F [W1] = F [W2] ≈ δ, which corre-sponds to large spatial windows, over which the scene must be fronto-parallel.

In addition to the problems with windowing previously described (Sec. 4.3),inverse filtering in Fourier domain presents additional difficulties due to finite-window effects [12]. Firstly, accurate spectral analysis requires large windows,which corresponds to low depth resolution or very smoothly varying scenes.Secondly, since windowing can be thought of as an additional convolution inthe Fourier domain, it may cause zero-crossings in the Fourier domain to shiftslightly, causing potentially large variations in convolution ratio. Finally, usingsame size windows in both images can lead to border artifacts, as the differentlevels of defocus imply that the actual source areas in I are different.

Tuned inverse filtering To mitigate the errors and instability caused byfinite-width filters, one approach is to use filters specially tuned to the dominantfrequencies in the image. Around the dominant frequencies, finite-width effectsare negligible [60], however a priori we do not know which frequencies over animage window are dominant, or even if any exist.

A straightforward approach for identifying the dominant frequencies, whichprovides an appealing formal invariance to surface radiance [18], is to use a

32

large bank of tuned narrow-band filters, densely sampling the frequency do-main [38, 59, 60]. Dominant frequencies can then be identified as filter responsesof significant magnitude, and satisfying a stability criterion that detects conta-mination due to finite-width windowing artifacts [59, 60]. By assigning higherweights to the dominant frequencies, the depth estimates over all narrow-bandfilters may be aggregated, e.g., using weighted least-squares regression.

Note that the uncertainty relation means that highly tuned filters with nar-row response in the frequency domain require large kernel support in the spatialdomain. For a fixed window size in the spatial domain, Xiong and Shafer im-prove the resolution in the frequency domain by using additional moment-basedfilters, up to five times as many, to better model the spectrum within eachnarrow-band filter [60].

As Nayar, et al. suggest, another way to constrain the dominant frequenciesin the scene is to use active illumination to project structured patterns ontothe scene [35]. They propose projecting a checkerboard pattern, paired with aLaplacian-like focus operator that is optimally tuned to the dominant checker-board frequency resulting in the workspace. Although this approach works welleven for textureless surfaces, it assumes that the light source is much strongerthan the surface radiance, and that the scene is not overly specular or geomet-rically complex.

Broadband inverse filtering A contrasting approach to inverse filteringinvolves using broadband filters, that integrate over many different frequen-cies [46, 54]. Broadband filters lead to smaller windows in the spatial domain,and therefore to higher resolution; they are more computationally efficient, sinceless of them are required to estimate defocus; and they are more stable to lowmagnitude frequency responses. However, because defocus is not uniform overfrequency (see Eq. (14), for example), the relative defocus estimated by inte-grating over a broad range of frequencies is potentially less accurate.

Watanabe and Nayar designed a set of three broadband filters for DFD,motivated as a higher-order expansion of the convolution ratio, Eq. (18), for arelatively small 7 × 7 spatial kernel [54]. In fact, they consider a normalizedversion of the convolution ratio instead,

F [I1] −F [I2]

F [I1] + F [I2]≈ F [Bσ1

] −F [Bσ2]

F [Bσ1] + F [Bσ2

]. (21)

constrained to [−1, 1] for positive frequencies, and sharing the property thatfrequencies due to the underlying surface radiance cancel out [35, 54].

Watanabe and Nayar also suggest that it is important to pre-filter the imagebefore inverse filtering, to remove any bias caused by the DC component, andto remove higher frequencies that violate the assumed monotonicity of the blurkernel [54]. Furthermore, to address the instability in low-texture regions, theydefine a confidence measure, derived using a perturbation analysis, and adap-tively smooth the results until confidence everywhere is below some acceptablethreshold [54].

33

7.2.2 Modeling relative defocus

A different method for evaluating the relative defocus between two image win-dows is to explicitly model the operator representing relative defocus. Givensuch a model for relative defocus, we can potentially build a set of detectorscorresponding to different levels of relative defocus, producing output some-what analogous to a focus measure (Sec. 5).

Cubic polynomial patches Subbarao and Surya suggest modeling the idealpinhole image as a cubic bivariate polynomial, that is, I =

∑

km,n xmyn, withm + n ≤ 3 [48]. Under this simple scene model, the convolution of Eq. (7) canbe expressed in terms of low-order moments of the defocus function. Then byassuming the defocus function is radially symmetric, the deconvolution reduces

to the analytic form, I = I − σ2

4 · ∇2I, which is equivalent to a well-knownsharpening filter. Therefore the relative blur can be expressed analytically as

I2 − I1 =1

4(σ2

2 − σ21) · ∇2

(

I1 + I2

2

)

, (22)

for an arbitrary radially symmetric defocus function. Note that this expressioncontains no terms related to I, and so also provides invariance to scene radiance.

Matrix-based deconvolution For the simple convolutional model of defocusdescribed by Eq. (7), the convolution ratio (Sec. 7.2.1) can be easily reformulatedin the spatial domain as

I2 = B∆ ⊗ I1 , (23)

where B∆ is a kernel representing the relative defocus.Because convolution is a linear operator, Eq. (23) can be expressed as ma-

trix multiplication, where the matrix representing convolution is sparse, andhas banded Toeplitz block structure. This suggests that B∆ may be recoverednaıvely by matrix inversion, however in the presence of noise, the problem isill-posed and unstable. Ens and Lawrence propose performing this matrix in-version, but regularizing the solution by including a term that measures the fitof B∆ to a low-order polynomial [12].

Note that by manipulating the spatial domain convolution ratio, Eq. (23),we can obtain another expression for relative defocus, in terms of the underlyingblur kernels,

Bσ2= B∆ ⊗ Bσ1

. (24)

Therefore, given a pair of known blur kernels (Sec. 4), we may fully recover amodel of the relative defocus, B∆, by deconvolving Eq. (24). Ens and Lawrencesuggest applying this method to recover B∆ over a range of different blur kernelpairs, corresponding to different depths [12]. Then each relative defocus hypoth-esis, B∆, can be evaluated by applying Eq. (23) and measuring the least-squareserror.

34

Rank-based methods As discussed in Sec. 4.6, in one recent approach, theblur kernel may be recovered empirically as a set of rank-based approximationscharacterizing the relative defocus at particular calibrated depths [18]. Thisrepresentation leads to a simple data-driven focus measure. In particular, thenull space of a given rank-based approximation can be used to form an orthogo-nal projection operator, measuring the distance to the subspace that representsthe defocused images for scenes of all possible radiances at the correspondingdepth.

Differential defocus Another way to model relative defocus is accordingto differential changes to the lens settings. As Subbarrao proposed in theory,the relative defocus can be fully specified by a differential analysis of the lensmodel [46].

Farid and Simoncelli realized an interesting version of differential DFD, byusing specially designed pairs of optical filters that “directly” measure deriva-tives with respect to aperture size or viewpoint [13]. By comparing the imageproduced with one filter, and the spatial derivative of the image produced withanother filter, they obtain a scale factor for every point, which can then berelated to depth. This method relies on defocus, otherwise the scale factor willbe undefined, and moreover it implicitly assumes a locally fronto-parallel sceneover the extent of defocus [13].

7.3 Strong scene models

Given strong enough prior knowledge about the scene, the scene may be recon-structed from as little as one defocused image. A strong model of the scene mayalso be used to simplify the depth estimation problem or to increase robustness.

Sharp Reflectance Edges Various reconstruction methods from focus arebased on the assumption that the in-focus scene contains perfectly sharp re-flectance edges, i.e., step discontinuities in surface albedo. Given this strongmodel of the scene, depth may be recovered by analyzing 1D intensity profilesacross the blurred edges.

This approach was first proposed by Pentland for a single defocused image,as a qualitative measure over a sparse set of pixels containing detected edges[39]. Pentland performed linear regression over the pixels around a given edgeto obtain the standard deviation of a Gaussian blur kernel, σ, which he thenrelated to depth. The analysis was subsequently generalized for any rotationallysymmetric blur kernel [47]. In particular, the radius of gyration of the blur kernel(Sec. 5.1) is simply

√2 times the second moment of the line spread function for

a sharp edge.Asada, et al. proposed a more robust method for detecting sharp edges

and their depths, using a dense number of focus settings [5]. They implementedge detection using a voting scheme that accumulates differences between thegeometrically aligned images, where the differing defocus across input images

35

effectively provides multi-scale edge detection. To analyze a particular edge,they form a 2D spatial-focus image, collecting the intensities from pixels acrossthe edge and over the focus settings. Under the assumption of a rotationallysymmetric blur model, a simple model may be fit to the spatial-focus image,where the extremal constant-intensity lines correspond to the unblurred colors,and the intersection of these lines determines the depth of the edge.

Confocal lighting As already noted, active illumination can be used con-strain the frequency characteristics of the scene, and increase the robustness ofestimating the relative defocus (Sec. 7.2.1). However, an even stronger use ofactive illumination is confocal lighting, which involves selectively illuminatingthe equifocal surface with a light source sharing the same optical path as thelens [55].

By using a large aperture together with confocal lighting, parts of the sceneaway from the equifocal surface are both blurred and dark, which greatly en-hances contrast. In theory, we can directly obtain cross-sections of the scenefrom such images at different focus settings, and assemble them into a 3D vol-umetric model.

In its most basic design a confocal imaging system can be built by scanning apinhole mask across the sensor plane, with a controllable point light source thatfocuses to the same 3D point in the scene [55]. More practical designs, however,involve using coded masks, with multiple pinholes for both the detector and lightsource [29, 55]. Confocal lighting is particularly well-suited to semi-transparentscenes, such as the biological specimens found in microscopy, because opaqueoccluding objects may prevent all of the confocal lighting from reaching theequifocal surface.

Levoy, et al. suggest a version of confocal imaging implemented on macro-scale with opaque objects [29]. In their design, a single camera and projectorshare the same optical path using a beam splitter, and they use an array ofmirrors to create 16 virtual viewpoints. Then, multi-pixel tiles at a given focaldepth are illuminated according to coded masks, by focusing the projector fromthe virtual viewpoints. Although they obtain good results with their system formatting, the depth-of-field is currently too large to provide adequate resolutionfor 3D reconstruction.

7.4 Analysis

Feasibility of DFD Favaro, et al. provide a basic result that if the radianceof the scene can be arbitrarily controlled, for example, by active illumination,then any piecewise-smooth surface can be distinguished from all others, i.e.,fully reconstructed, from a set of defocused images [14].

They also formalize the “complexity” of scene radiance in terms of the degreeof a 2D linear basis, and show that two piecewise-smooth surfaces can in theorybe distinguished up to a resolution that depends on the this complexity, furtherlimited by the optics [14].

36

Optimal interval for DFD Using a perturbation analysis of thin lens model,assuming a pillbox defocus function, Schechner and Kiryati showed that theoptimal interval between the two focus settings, with respect to perturbationsat the Nyquist frequency, corresponds to the depth of field [42]. For smallerintervals no frequency will satisfy optimality, whereas for larger intervals theNyquist frequency will be suboptimal, but some lower frequency will be optimalwith respect to perturbations.

Shih, et al. give an alternate analysis of optimality to perturbations ininformation-theoretic terms, assuming a symmetric Gaussian defocus function[44]. According to their analysis, the lowest-variance unbiased estimator from apair of defocused images is attained when the corresponding levels of blur arerelated by σ1 =

√

3/2σ2. However, this result is difficult to apply in practice,since knowledge of these blur parameters implies that depth has already beenrecovered.

8 Future research directions

In the last twenty years, much progress has been made in 3D reconstructionfrom finite-aperture lenses. Our theoretical understanding of finite-apertureimaging has greatly improved [6, 37, 42], and various working practical systemshave also been proposed [13, 18, 21, 25, 39, 54, 60]. However, many of the ideas inthis field are fairly mature. For example, it is unlikely for fundamental progressto come from a new inverse filtering method based on local windowing, or froma new iterative optimization method applied to the standard formulation ofdeconvolution.

In our recent paper [21], we describe some new ideas, the details of whichwill be discussed more fully in our upcoming research proposal. Here we outlinecurrent limitations of 3D reconstruction from finite-aperture lenses and suggestpossible directions of future research.

New visual effects By capturing a scene over a variety of lens settings, weobtain a multi-dimensional dataset that is much richer than a standard image,where every pixel measurement represents the integration of a different cone ofrays. Of course, for this dataset to be useful, accurate geometric and radiometricimage alignment is prerequisite.

It would be interesting to use this richer representation to develop visualiza-tions of the scene, or to develop tools that a photographer could use to producenew artistic effects. For example, a photographer may wish to specify a non-physical depth-of-field that lies along an arbitrarily curved surface, but wherethe blur elsewhere in the scene is consistent with the new distorted optics. Sofar it remains to demonstrate an interesting effect of this kind that could nototherwise be implemented using simple 2D image processing.

High resolution lens calibration As high resolution 12Mpixel+ SLR cam-eras become more widely used, the absolute accuracy demanded of finite aper-

37

ture lens calibration will increase accordingly. For very high resolution sensors,previously negligible effects including subtle geometric lens distortions, and non-deterministic internal vibrations, can lead to alignment errors of several pixels.

For this reason, we have developed new empirical calibration methods, specif-ically designed to achieve geometric and radiometric image alignment that closelymatches the capabilities of existing high-end sensors [21]. In this direction, weare especially interested in exploiting the hitherto unnoticed, non-deterministicsub-pixel distortions in SLR cameras in order to achieve super-resolution, aswell as for other applications.

Simulating the full imaging model Ideally, we would like to be able toreconstruct general 3D scenes using the “full” model of defocus that properlyaccounts for occlusion and global scene geometry [6]. Although preliminarysteps have been made in this direction [6, 17], the methods proposed do notseem to generalize easily.

While for certain pathological scenes, e.g., a dense random “asteroid field”of voxels, 3D reconstruction from defocus may be infeasible, many other struc-tured but geometrically complex scenes could benefit from an improved ability tomodel occlusions. Our initial analysis suggests that voxel-based carving meth-ods may not be feasible in general, however additional assumptions may allowrestricted types of scenes to be reconstructed.

Low texture To handle low surface texture, we are exploring the possibilityof analyzing image measurements at multiple levels of detail and for multiplepixels simultaneously. We are also investigating methods for evaluating theconfidence of depth estimation, and perhaps adaptively increasing the region ofsupport in the scene until sufficient radiance variation is found.

Light field analysis Our initial analysis suggests that the 2D set of finite-aperture images acquired by varying focus setting and aperture setting can spanthe full 4D light field [37]. For slit apertures that scan across the lens, this isdefinitely the case, however our analysis for standard nested circular aperturesis not yet definitive. But in theory, the corresponding perfectly focused lightfield could be recovered from a set of finite-aperture images.

It may also be possible to extend the frequency-based sampling analysisof Ng [37] to show a deeper relationship between surface albedo, geometricstructure, defocus, and resolution.

Dynamic scenes One of the major drawbacks of 3D reconstruction fromfinite-aperture images is that the scene must remain static while multiple imagesare taken with different lens settings. Although a beam splitter may allow 3–4synchronized cameras to share the same optical path, if too many synchronizedimages are required then this approach becomes infeasible. Another possibilityis multiplexing over pixels in the sensor [37], or else multiplexing over time usinga high-speed camera.

38

While none of these solutions are completely satisfactory, addressing dy-namic scenes remains an important concern. We have observed that even forsupposedly static scenes, slight movement due to air currents and externals vi-brations, over a fifteen minute image acquisition session, may interfere withpixel-level 3D reconstruction.

References

[1] MIT, Product Engineering Processes, http://web.mit.edu/2.009/www/.

[2] M. Aggarwal and N. Ahuja. A pupil-centric model of image formation.International Journal of Computer Vision, 48(3):195–214, 2002.

[3] M. Aggarwal, H. Hua, and N. Ahuja. On cosine-fourth and vignetting ef-fects in real lenses. In Proc. International Conference on Computer Vision,volume 1, pages 472–479, 2001.

[4] K. Aizawa, K. Kodama, and A. Kubota. Producing object-based special ef-fects by fusing multiple differently focused images. IEEE Trans. on Circuits

and Systems for Video Technology, 10(2), Mar. 2000.

[5] N. Asada, H. Fujiwara, and T. Matsuyama. Edge and depth from focus.International Journal of Computer Vision, 26(2):153–163, 1998.

[6] N. Asada, H. Fujiwara, and T. Matsuyama. Seeing behind the scene: Analy-sis of photometric properties of occluding edges by the reversed projectionblurring model. IEEE Trans. on Pattern Analysis and Machine Intelli-

gence, 20(2):155–167, 1998.

[7] S. S. Bhasin and S. Chaudhuri. Depth from defocus in presence of par-tial self occlusion. In Proc. International Conference on Computer Vision,volume 2, pages 488–493, 2001.

[8] J. Buzzi and F. Guichard. Uniqueness of blur measure. In Proc. Interna-

tional Conference on Image Processing, volume 5, pages 2985–2988, 2004.

[9] S. Chaudhuri. Defocus morphing in real aperture images. J. Optical Society

of America A, 22(11):2357–2365, Nov 2005.

[10] T. Darrell and K. Wohn. Pyramid based depth from focus. In Proc. Com-

puter Vision and Pattern Recognition, pages 504–509, 1988.

[11] P. Debevec and J. Malik. Recovering high dynamic range radiance mapsfrom photographs. In Proc. ACM SIGGRAPH, pages 369–378, 1997.

[12] J. Ens and P. Lawrence. An investigation of methods for determining depthfrom focus. IEEE Trans. on Pattern Analysis and Machine Intelligence,15(2):97–108, 1993.

39

[13] H. Farid and E. P. Simoncelli. Range estimation by optical differentiation.J. Optical Society of America A, 15(7):1777–1786, 1998.

[14] P. Favaro, A. Mennucci, and S. Soatto. Observing shape from defocusedimages. International Journal of Computer Vision, 52(1):25–43, 2003.

[15] P. Favaro, S. Osher, S. Soatto, and L. A. Vese. 3D shape from anisotropicdiffusion. In Proc. Computer Vision and Pattern Recognition, volume 1,pages 179–186, 2003.

[16] P. Favaro and S. Soatto. Shape and radiance estimation from the infor-mation divergence of blurred images. In Proc. European Conference on

Computer Vision, volume 1, pages 755–768, 2000.

[17] P. Favaro and S. Soatto. Seeing beyond occlusions (and other marvels of afinite lens aperture). In Proc. Computer Vision and Pattern Recognition,volume 2, pages 579–586, 2003.

[18] P. Favaro and S. Soatto. A geometric approach to shape from defocus.IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(3), Mar.2005.

[19] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph.In Proc. ACM SIGGRAPH, pages 43–54, 1996.

[20] M. D. Grossberg and S. K. Nayar. Modeling the space of camera responsefunctions. IEEE Trans. on Pattern Analysis and Machine Intelligence,26(10):1272–1282, 2004.

[21] S. W. Hasinoff and K. N. Kutulakos. Confocal stereo. In Proc. European

Conference on Computer Vision, 2006. to appear.

[22] A. Hertzmann and S. M. Seitz. Example-based photometric stereo: Shapereconstruction with general, varying BRDFs. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 27(8):1254–1264, 2005.

[23] H. Jin and P. Favaro. A variational approach to shape from defocus. InProc. European Conference on Computer Vision, volume 2, pages 18–30,2002.

[24] S. B. Kang and R. S. Weiss. Can we calibrate a camera using an image ofa flat, textureless Lambertian surface? In Proc. European Conference on

Computer Vision, volume 2, pages 640–653, 2000.

[25] E. Krotkov. Focusing. International Journal of Computer Vision, 1(3):223–237, 1987.

[26] A. Kubota and K. Aizawa. Inverse filters for reconstruction of arbitrarilyfocused imagesfrom two differently focused images. In Proc. International

Conference on Image Processing, volume 1, pages 101–104, 2000.

40

[27] A. Kubota, K. Kodama, and K. Aizawa. Registration and blur estimationmethods for multiple differently focused images. In Proc. International

Conference on Image Processing, volume 2, pages 447–451, 1999.

[28] A. Kubota, K. Takahashi, K. Aizawa, and T. Chen. All-focused light fieldrendering. In Proc. Eurographics Symposium on Rendering, 2004.

[29] M. Levoy, B. Chen, V. Vaish, M. Horowitz, I. McDowall, and M. T. Bolas.Synthetic aperture confocal imaging. In Proc. ACM SIGGRAPH, pages825–834, 2004.

[30] M. Levoy and P. Hanrahan. Light field rendering. In Proc. ACM SIG-

GRAPH, pages 31–42, 1996.

[31] M. Levoy, R. Ng, A. Adams, M. Footer, and M. Horowitz. Light fieldmicroscopy. In Proc. ACM SIGGRAPH, 2006. to appear.

[32] M. McGuire, W. Matusik, H. Pfister, J. F. Hughes, and F. Durand. Defocusvideo matting. In Proc. ACM SIGGRAPH, pages 567–576, 2005.

[33] T. Mitsunaga and S. K. Nayar. Radiometric self calibration. In Proc.

Computer Vision and Pattern Recognition, pages 1374–1380, 1999.

[34] H. Nair and C. Stewart. Robust focus ranging. In Proc. Computer Vision

and Pattern Recognition, pages 309–314, 1992.

[35] S. Nayar, M. Watanabe, and M. Noguchi. Real-time focus range sensor.IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(12):1186–1198, Dec 1996.

[36] S. K. Nayar and Y. Nakagawa. Shape from focus: an effective approachfor rough surfaces. In Proc. International Conference on Robotics and Au-

tomation, volume 2, pages 218–225, 1990.

[37] R. Ng. Fourier slice photography. In Proc. ACM SIGGRAPH, pages 735–744, 2005.

[38] A. Pentland, T. Darrell, M. Turk, and W. Huang. A simple, real-timerange camera. In Proc. Computer Vision and Pattern Recognition, pages256–261, 1989.

[39] A. P. Pentland. A new sense for depth of field. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 9(4):523–531, July 1987.

[40] A. N. Rajagopalan and S. Chaudhuri. A variational approach to recover-ing depth from defocused images. IEEE Trans. on Pattern Analysis and

Machine Intelligence, 19(10):1158–1164, Oct. 1997.

[41] A. N. Rajagopalan and S. Chaudhuri. An MRF model-based approachto simultaneous recovery of depth and restoration from defocused images.IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(7):577–589,July 1999.

41

[42] Y. Y. Schechner and N. Kiryati. Depth from defocus vs. stereo: How differ-ent really are they? International Journal of Computer Vision, 39(2):141–162, Sept. 2000.

[43] Y. Y. Schechner, N. Kiryati, and R. Basri. Separation of transparent layersusing focus. International Journal of Computer Vision, 39(1):25–39, 2000.

[44] S.-W. Shih, P.-S. Kao, and W.-S. Guo. Error analysis and accuracy im-provement of depth from defocusing. In Proc. IPPR Conference on Com-

puter Vision, Graphics and Image Processing, volume 4, pages 250–257,2003.

[45] W. J. Smith. Modern Optical Engineering. McGraw-Hill, New York, 3rdedition, 2000.

[46] M. Subbarao. Parallel depth recovery by changing camera parameters. InProc. International Conference on Computer Vision, pages 149–155, 1988.

[47] M. Subbarao and N. Gurumoorthy. Depth recovery from blurred edges. InProc. Computer Vision and Pattern Recognition, pages 498–503, 1988.

[48] M. Subbarao and G. Surya. Depth from defocus: A spatial domain ap-proach. International Journal of Computer Vision, 13(3):271–294, Dec.1994.

[49] M. Subbarao and J.-K. Tyan. Selecting the optimal focus measure forautofocusing and depth-from-focus. IEEE Trans. on Pattern Analysis and

Machine Intelligence, 20(8):864–870, Aug. 1998.

[50] V. Vaish, R. Szeliski, C. L. Zitnick, and S. B. Kang. Reconstructing oc-cluded surfaces using synthetic apertures: Stereo, focus and robust mea-sures. In Proc. Computer Vision and Pattern Recognition, 2006. to appear.

[51] W. Wallace, L. H. Schaefer, and J. R. Swedlow. A workingperson’s guide todeconvolution in light microscopy. Biotechniques, 31(5):1076–1097, 2001.

[52] M. Watanabe and S. K. Nayar. Minimal operator set for passive depthfrom defocus. In Proc. Computer Vision and Pattern Recognition, pages431–438, 1996.

[53] M. Watanabe and S. K. Nayar. Telecentric optics for focus analysis. IEEE

Trans. on Pattern Analysis and Machine Intelligence, 19(12):1360–1365,Dec 1997.

[54] M. Watanabe and S. K. Nayar. Rational filters for passive depth fromdefocus. International Journal of Computer Vision, 27(3):203–225, 1998.

[55] R. H. Webb. Confocal optical microscopy. Reports on Progress in Physics,59(3):427–471, 1996.

42

[56] R. Willson. Modeling and calibration of automated zoom lenses. In Proc.

SPIE #2350: Videometrics III, pages 170–186, Oct 1994.

[57] R. Willson and S. Shafer. Dynamic lens compensation for active colorimaging and constant magnification focusing. Technical Report CMU-RI-TR-91-26, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA,Nov 1991.

[58] R. Willson and S. Shafer. What is the center of the image? J. Optical

Society of America A, 11(11):2946–2955, Nov 1994.

[59] Y. Xiong and S. Shafer. Depth from focusing and defocusing. In Proc.

Computer Vision and Pattern Recognition, pages 68–73, June 1993.

[60] Y. Xiong and S. Shafer. Moment and hypergeometric filters for high pre-cision computation of focus, stereo and optical flow. International Journal

of Computer Vision, 22(1):25–59, Feb. 1997.

[61] N. Xu, K. Tan, H. Arora, and N. Ahuja. Generating omnifocus images usinggraph cuts and a new focus measure. In Proc. International Conference on

Pattern Recognition, volume 4, pages 697–700, 2004.

[62] L. Zhang, N. Snavely, B. Curless, and S. M. Seitz. Spacetime faces: High-resolution capture for modeling and animation. In Proc. ACM SIGGRAPH,pages 548–558, Aug. 2004.

[63] N. F. Zhang, M. T. Postek, R. D. Larrabee, A. E. Vladar, W. J. Keery,and S. N. Jones. Image sharpness measurement in the scanning electronmicroscope—Part III. Scanning, 21:246–252, 1999.

[64] T. E. Zickler, P. N. Belhumeur, and D. J. Kriegman. Helmholtz stereopsis:Exploiting reciprocity for surface reconstruction. International Journal of

Computer Vision, 49(2-3):215–227, 2002.

[65] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski.High-quality video view interpolation using a layered representation. InProc. ACM SIGGRAPH, pages 600–608, 2004.

43

Documents

3D Reconstruction from Finite-Aperture Lensespeople.csail.mit.edu/hasinoff/pubs/depthoral.pdf · 2006-04-14 · 3D Reconstruction from Finite-Aperture Lenses (Ph.D. Depth Oral Report)