Focal Sweep Camera for Space-Time Refocusingchangyin/files/focal-sweep...33 all-in-focus image (Hausler, 1972; Nagahara et al., 2008). In 34 this paper, we make use of focal sweep

Noname manuscript No.(will be inserted by the editor)

Focal Sweep Camera for Space-Time Refocusing

Changyin Zhou · Daniel Miau · Shree K. Nayar

Received: date / Accepted: date

Abstract In this paper, we present an imaging system that1

captures a sequence of images in a scene as focus is swept2

over a large depth range. We refer to it as a space-time focal3

stack and propose algorithms for space-time image refocus-4

ing at sensor resolution – when a user clicks on a pixel, we5

show the user the image that was captured at the time when6

the pixel is best focused.7

We first describe how to synchronize focus sweeping8

with image capturing so that the summed depth of field of9

a focal stack efficiently covers the entire depth range with-10

out gaps or overlap. We then propose an efficient algorithm11

for computing a space-time in-focus index map from a fo-12

cal stack. It states which time each pixel is best focused13

and the algorithm is specifically designed to enable a seam-14

less refocusing experience, even in textureless regions or at15

depth discontinuities. We implemented two prototype cam-16

eras to capture a large number of space-time focal stacks,17

and made an interactive refocusing viewer available online18

at www.focalsweep.com.19

Keywords Focal Sweep · Focal Stack · Depth Recovery ·20

Refocusing · Extended Depth of Field21

Changyin ZhouDepartment of Computer Science, Columbia UniversityE-mail: [email protected].

Daniel MiauDepartment of Computer Science, Columbia UniversityE-mail: [email protected]

Shree K. NayarDepartment of Computer Science, Columbia UniversityE-mail: [email protected]

* Changyin Zhou is a software engineer at Google at the timeof submission.

1 Introduction22

Depth of field (DOF) is the range of scene depths that ap-23

pear focused in an image. A conventional lens camera has a24

limited DOF due to a fundamental trade-off between DOF25

and signal-to-noise ratio (SNR). One can increase DOF by26

stopping down lens aperture, but a small aperture limits the27

number of photons to be captured in a given time and there-28

fore reduces SNR. Focal sweep is one of the extended depth29

of field (EDOF) techniques that overcome this limitation. In30

an EDOF focal sweep imaging system, focus is swept over a31

large depth range as an image is captured to produce a single32

all-in-focus image (Hausler, 1972; Nagahara et al., 2008). In33

this paper, we make use of focal sweep for a new application34

– space-time image refocusing.35

We capture a sequence of images in the period of focal36

sweep. Concatenating all the captured images in the tem-37

poral dimension forms a 3-dimensional space-time volume.38

Figure 1 (a) shows an XYT 3D space-time focus volume of a39

synthetic scene of colored balls in motion. To peek into the40

3D structure, Figure 1 (b) shows one 2D XT slice of the 3D41

volume, in which balls appear as double-cones. The apex of42

a double-cone shows the time when a ball is focused. This43

phenomenon has been observed and well studied in decon-44

volution microscopy (McNally et al., 1999). Unlike the tra-45

ditional static focus volume, the space-time focus volume46

captures object motions as well. For objects that are in mo-47

tion in both x and y directions, these double-cones appear48

tilted like the cyan cone shown in Figure 1 (b). By finding49

the apexes in the 3D volume (shown as the yellow band in50

Figure 1 (b)), we can obtain a space-time in-focus image,51

and the shape of the band reveals the 3D structure of scene.52

In practice, one can only capture a limited number of53

images due to constraints on time and camera capacity. He54

or she must usually decide how many images will be cap-55

tured and where the sensor should be positioned at for each56

2

(a) Space-time focal Volume

(b) A 2-D slice of space-time focal volume

(c) Integration of the Slice over T

T

X

projec�on

slice

Fig. 1 Space-time focus volume. (a) A space-time focus volume of asynthetic scene of color balls with motion. Objects move as the focuschanges with time in the T dimension. (b) A 2D XT slice of the 3Dvolume, in which each small ball appears as double-cones. The double-cones of moving balls are tilted. (c) Integrating the volume along theT dimension produces an EDOF image as captured by a typical fo-cal sweep technique. Each object appears sharp in the EDOF imageregardless of this depth.

capture point so that every object in the depth range appears57

focused at least in one of the captured images. We show a58

capturing strategy that yields efficient and complete focus59

sampling of 3D focus volumes.60

Our major challenge in designing algorithm was to com-61

pute an in-focus index map (or depth map) that enables a62

seamless refocusing experience. First, since users expect that63

each tap will bring them to the image layer where the region64

boundary is well focused, even in texture-less regions, there65

cannot be any holes in the index map. Second, the index pre-66

cision must be high enough (especially for textured region67

or region boundary) so that the refocused layer will appear68

perfectly focused. As we will show in Section 6, our algo-69

rithm design meets the needs of both of these challenges.70

Most existing refocusing techniques (e.g., plenoptic cam-71

eras (Lippmann, 1908; Ng et al., 2005), camera array (Wilburn72

et al., 2005), depth from defocus (Levin et al., 2007)) cap-73

ture an image in one moment. The proposed focal sweep74

camera differentiates itself from these techniques by captur-75

ing a space-time focus stack in a longer period when objects76

can be moving. While capturing an instant of time can be77

favorable in some situations, capturing a duration of time78

yields a unique and appealing user experience – the focus79

transition becomes naturally intertwined with object motion.80

In addition, our image refocusing is done at sensor resolu-81

tion without any trade-off in image quality.82

We implement two prototypes of focal sweep cameras:83

one using a voice coil actuator to drive the sensor sweep and84

one using a linear actuator to drive the lens sweep. Both of85

them are able to synchronize focal sweep and image captur-86

ing to capture focal stacks in a complete and efficient man-87

ner. A collection of focal sweep photographs has been cap-88

tured by using our prototypes, and users can refocus these89

images interactively on www.focalsweep.com.90

2 Related Work91

2.1 Focal Sweep and Focal Stack92

A conventional lens camera has a finite depth of field. A93

variety of EDOF techniques have been proposed to extend94

depth of field in the past several decades (Castro and Ojeda-95

Castaneda, 2004; Cossairt et al., 2010; Dowski and Cathey,96

1995; George and Chi, 2003; Guichard et al., 2009; Indebe-97

touw and Bai, 1984; Mouroulis, 2008; Nagahara et al., 2008;98

Poon and Motamedi, 1987). Focal sweep is one of the EDOF99

techniques. A focal sweep EDOF camera captures a single100

image when its focus is quickly swept over a large range101

of depth. Hausler (1972) extends DOF of microscopy by102

sweeping the specimen along optical axis during exposure.103

Nagahara et al. (2008) extended DOF for consumer photog-104

raphy by sweeping the image sensor. Nagahara et al. (2008)105

also show that the point-spread-function (PSF) of a focal106

sweep camera is depth invariant, allowing one to deconvolve107

a captured image with a single PSF to recover a sharp image108

without knowing the 3D structure of the scenes. We build a109

similar imaging system as in Nagahara et al. (2008), but use110

it to capture image stacks for image refocusing.111

Several techniques have been proposed to capture a stack112

of images for extended depth of field and 3D reconstruction.113

In deconvolution microscopy, for example, a stack of im-114

ages of specimens are captured at different foci to form a115

3D image (McNally et al., 1999; Sibarita, 2005). 3D point-116

spread-functions in the 3D images are shown to be depth117

invariant double-cones. By deconvolving with the 3D PSF,118

a sharp 3D image can be recovered. We observe a similar119

double-cone structure in image stacks captured using our120

own imaging system. Kuthirummal et al. (2011) integrates121

focal stacks into an EDOF image and recover all-in-focus122

images by deconvolution as in focal sweep EDOF. Guichard123

et al. (2009) and Cossairt and Nayar (2010) make use of124

chromatic aberration to capture images of different foci in125

the three color channels with a single shot, and then by com-126

bining the sharpness from all color channels to produce an127

all-in-focus images. Agarwala et al. (2004) propose using a128

3

global maximum contrast image objective to merge a focal129

stack into a single all-in-focus images.130

Hasinoff et al. (2009) compare the optimality of various131

capture strategies for reducing optical blur in a comprehen-132

sive framework where both sensor noise model and deblur-133

ring error are taken into account. Their analysis shows that134

focal stack photography has two performance advantages in135

extending depth of field over one-shot photography: 1) it al-136

lows one captures a given DOF faster; 2) it achieves higher137

SNR in a given exposure time.138

Hasinoff and Kutulakos (2009) consider the problem of139

minimizing the capture time to capture a scene with a given140

DOF and a given exposure level. This is highly related to the141

optimization problem in our technique and similar analysis142

on camera DOF can be seen in both papers. While Hasinoff143

and Kutulakos (2009) emphasize on the lens f-number and144

the image number for focal stack capturing, we optimize the145

speed of focal sweep in synchronization with image expo-146

sure for a given lens f-number.147

Future analysis in (Kutulakos and Hasinoff, 2009) shows148

that focal stack photography can not only capture a given149

DOF with reduced exposure time, but also achieve higher150

SNR with restricted exposure time. Kutulakos and Hasinoff151

(2009) use a similar algorithm as in Agarwala et al. (2004)152

to synthesize EDOF images from focal stacks by assuming153

that scenes are static. While their algorithms are optimized154

to produce artifact-free images with minimal blur, our algo-155

rithm is proposed to yield a seamless space-time refocusing156

experience.157

2.2 Light Field Camera and Image Refocusing158

The concept of light field has been used for a long history. In159

the early 20th century, Ives (1930); Lippmann (1908) have160

proposed plenotic camera designs to capture light fields. The161

idea of light field resurfaced in the community of computer162

vision and graphic in the late 1990s when Levoy and Han-163

rahan (1996) and Gortler et al. (1996) described the presen-164

tation of 4D light field and show how new views can be ren-165

dered by using light field data. A stack of images with dif-166

ferent focus can also be rendered as well from a light field,167

and then be used for image refocusing.168

A number of light field cameras have been designed and169

made in recent years. Levoy et al. (2006) use a plenoptic170

camera to capture the light field of specimens and propose171

algorithms to compute a focal stack from a single light field172

image, which can be processed as in deconvolution microscopy173

to produce a 3D sharp volume. Ng et al. (2005) and Ng174

(2006) use the same plenoptic camera design and empha-175

sizes its application in image refocusing. Georgeiv et al.176

(2006) and Georgiev and Intwala (2006) show a number of177

variants of light field camera designs for different trade-off178

between spatial and angular resolution. Light field cameras179

can also be built using camera arrays (Wilburn et al., 2005).180

2.3 Depth from Focus or Defocus181

Two other techniques that are closely related to our proposed182

technique is depth from focus and depth from defocus. In-183

stead of capturing a complete focal stack, depth from focus184

or defocus captures only a few images with varying focus or185

aperture patterns and uses focus as a cue to compute depth186

maps of scenes (Chaudhuri and Rajagopalan, 1999; Nayar187

et al., 1996; Pentland, 1987; Rajagopalan and Chaudhuri,188

1997; Zhou et al., 2009).189

Focus measure is a critical technique for all depth from190

focus approaches, but there are several difficulties associ-191

ated with it. The space-scale effect (Perona and Malik, 1990)192

presents a major difficulty since it can lead to depth ambi-193

guity at different scales. In addition, image patches at depth194

discontinuities may cross multiple depth layers and make195

focus measure inaccurate. Furthermore, since focus mea-196

sure does not reveal anything about depth in non-textured197

regions, techniques such as plane fitting (Tao et al., 2001),198

graph-cut (Boykov and Kolmogorov, 2004), and belief pro-199

rogation (Yedidia et al., 2001)) have been employed to fill200

the resulting holes in the depth map.201

3 Space-time focus volume, Focus Sampling, and202

Refocusing203

Concatenating all the images captured during focal sweep204

in the temporal dimension forms a 3D space-time volume205

that encodes more visual information about scenes than a206

single image. Figure 1 (a) shows a space-time focus volume207

of a synthetic scene of colored balls in motion and (b) shows208

one 2D XT slice of the 3D volume, in which balls appear as209

double-cones. As mentioned in the introduction, the double-210

cones appear tilted for objects in motion in both x and y211

directions; by finding the apexes in the 3D volume (shown212

as a yellow band in Figure 1 (b)), we can obtain a space-213

time in-focus image; and the shape of the band reveals the214

3D structure of the scene. A motion along the optical axis is215

usually negligible in practice, since (according to the Thin216

Lens Law) a small sensor motion can lead to a huge focus217

sweep on the object side.218

3.1 Space-time Focal Stack and Focus Sampling219

Within a given time budget, one can only capture a finite220

number of images during focal sweep due to the limit of221

frame rate and SNR consideration. The photographer must222

decide how many images to capture and where the sensor223

4

Lens DOFSensor

z zz0

∆

0

∆

Lens DOFSensor

zz

(a) DOF of single shot

(b) DOFs of mul�ple shots

A

A

Fig. 2 Efficient and complete focus sampling. (a) Left: A geometryillustration of depth of field. Objects in the range [Z1, Z2] will appearfocused when v and z satisfy the Thin Lens Law. Right: The Thin LensLaw is shown as an orange line in the reciprocal domain. Z1 and Z2 canbe easily located in the reciprocal domain (or in diopter) by |Zi− Z| =u · c/N . (b) In order to have an efficient and complete focus sampling,the DOFs of consecutive sensor positions (e.g., vi−1, vi, vi+1) musthave no gap or overlap.

should be positioned at for each capture point so that ev-224

ery object in the depth range will appear focused in at least225

one of the captured images. This is, in essence, a sampling226

problem of the 3D focus volume and we argue that an ideal227

capture should satisfy two conditions:228

– Completeness: the DOFs of all captured images shouldsum up to cover the entire desired depth range. If thedesired depth range is DOF ∗, we have

DOF1⋃

DOF2⋃

DOF3...⋃

DOFn⊃DOF∗ (1)

– Efficiency: all DOFs should have no overlap, so only aminimal number of images are required.

DOF1⋂

DOF2⋂

DOF3...⋂

DOFn=∅ (2)

Hasinoff and Kutulakos (2009) refer to an image se-229

quence as Sequence with Sequential DOFs, if the end-point230

of one image’s DOF is the start-point of the next image’s231

DOF. The idea capture sequence described above is a se-232

quence with sequential DOFs that covers the entire desired233

depth range.234

We start our DOF analysis from the Thin Lens Law,1/f = 1/u+ 1/z, where f is the focal length of the lens, uis the sensor-lens distance, and z is the object distance. Asa common practice, we transform the equation to the recip-rocal domain. The Thin Lens Law is now a linear equation:f = v+z, where x = 1/x. The reciprocal of object distance,z, is often referred to as diopter. Depth of field, [z1, z2], isthe depth range where the blur radius is less than the circleof confusion, c. As is common in digital photography, pixel

size is used as circle of confusion in this paper. The DOF inthe reciprocal domain, z1 and z2, can be derived as:{z1 = z + u · c/Az2 = z − u · c/A

(3)

where A is the aperture diameter of the lens. Both the po-235

sition and range of DOF changes with the sensor position.236

Figure 2 (a) shows the geometry of DOF for sensor posi-237

tions u on the left and illustrates the DOF in the reciprocal238

domain on the right. The yellow line in the figure represents239

the Thin Lens Law. For an arbitrary sensor position u, its240

range of DOF in the reciprocal domain is ∆ = 2 · u · c/A241

according to Equation 3.242

For an efficient and complete focus sampling, we re-quire every consecutive DOFs have no overlap and no gapas shown in Figure 2 (b). From Eqation 3, we derive:

|ui − ui+1| = (ui + ui+1) · c/A. (4)

In consumer photography, we have z � u and so ui ≈ f .By approximating Eqn 4 we have:

δu ≈ 2 · c ·N, (5)

where N = f/A is the f-number of the lens. Equation 5shows that to achieve an efficient and complete focus sam-ple, we only need to move the sensor at a constant step, andthe step size is determined by the pixel size and f-number.Notice that this is a constant step in the normal domain. Inthe reciprocal domain, the step is not constant as shown inFigure 2 (b). In particular, for a fixed frame rate P , it indi-cates that a sensor should be swept at a constant speed:

s =δu

δt= 2 · c ·N · P (6)

.243

In the case when the overall time budget is too short (or244

the sensor is moving too slow) to perform a complete focus245

sample, deblurring must be done to recover the sharpness of246

objects in DOF gaps. Hasinoff et al. (2009) proposes a com-247

prehensive framework for optimizing focus sampling in this248

case by taking noise model, capturing overhead, and the ef-249

fect of deblurring into account. In this paper, we concentrate250

on the problem of how to sample the focus in an efficient251

and complete manner for space-time image refocusing. By252

avoiding aggressive deblurring in the process, we can save253

a large amount of computation and produce higher quality254

refocusing results that are more natural and artifact-free.255

3.2 Space-time In-focus Index Map and Refocusing256

In a system where the speed of focal sweep is much faster257

than the speed of an object’s motion along the optical axis258

5

(z), the object appears in focus only once in the focus vol-259

ume (or, equivalently, in only one image of the focal stack).260

Let F1, F2, . . . , Fk be the focal stack of k images. For each261

pixel, we find the index of the frame where the pixel is best262

focused. The indices of all the best-focused pixels comprise263

a space-time in-focus index map.264

In a dynamic scene, both the focus and objects them-265

selves are free to move, which leads to ambiguities in the266

definition of the space-time in-focus index map. Hence, there267

are different ways of defining the space-time in-focus index268

map:269

1. For each pixel (x, y), look into a small tube at (x, y)270

in the 3D XYT focal stack and find which layer appears271

the sharpest. If N(x, y) is the sharpest layer (or time),272

N(x, y) is the in-focus index map for the focal sweep.273

2. Explicitly consider object motion. At an arbitrary layer274

(or time) t, for any object at a spatial location (x, y),275

we could track this object and find that the object is276

best focused at the layer (or time) t′ and spatial loca-277

tion (x′, y′). In this case, the in-focus index map is t′ =278

N1(x, y, t), which is a 3D index map.279

3. Track all objects at location (x, y) in the space-time fo-280

cal stack to the layers where they are best focused, and281

then among all the final layers, pick the layer closest to282

the present layer. In this case, we have the space-time fo-283

cal stack as t′ = N2(x, y, t) = argmint |N1(x, y, t) −284

t0|.285

In this paper, we choose the first definition for simplicity.286

A refocusing viewer takes a user click (x, y) as input, finds287

the next frame t by N(x, y), and smoothly transitions the288

image from the current frame to the next frame. The other289

two choices of definition can yield different refocusing be-290

haviors and user experiences. Ideally, the choice should be291

made based on user intention or preference when they click292

on a pixel in the viewer. We leave the study of the other two293

(or more) definitions and their impacts on user experience to294

future work.295

With this definition, a space-time in-focus index map296

N(x, y) is a mapping from a spatial location (x, y) to a tem-297

poral point t = N(x, y). The focus is swept over a depth298

range over time, so each index is also related to a depth,299

although it might correspond to a depth at a different tempo-300

ral point. For the correlation between depth and focal sweep,301

users will be able to observe objects move with refocus vari-302

ation.303

4 Focal Sweep Camera304

4.1 Prototypes305

Focal sweep can be implemented in multiple ways. One way306

is to directly sweep the image sensor as we do in our analy-307

sis. A variety of actuators such as voice coil motors, piezo-308

electic motors, ultrasonic transducers, and DC motors could309

be used to translate the sensor in a designed manner dur-310

ing capture duration. Another way is to sweep camera lens.311

With the auto-focus mechanism that is commonly built in312

many commercial lenses, it may also be programmed to per-313

form focal sweep photography. Liquid lenses (Ren and Wu,314

2007; Ren et al., 2006) are yet another way of performing315

focal sweep, and they are power efficient. Liquid lenses fo-316

cus at different distances when different voltages are applied317

to them.318

We built two prototype focal sweep cameras as shown in319

Figure 3. Prototype 1, as shown in Figure 3 (a), uses a Fu-320

jinon HF9HA-1B, 9mm, F/1.4, c-Mount lens, and a Point-321

grey Flea 3 camera with a max resolution of 1328 × 1048,322

and its senor is driven by a voice coil actuator (BEI LA15-323

16-024). This setting is similar to the one used in (Nagahara324

et al., 2008) for extended depth of field. The sensor is teth-325

ered to a laptop via a USB 3.0 cable and synced with the326

motor start/stop signal. The voice coil motor and the mo-327

tor controller are able to translate the sensor at the speed of328

1.47 mm/s. The major advantage of this implementation is329

that all of the parts are off-the-shelf components. This first330

prototype demonstrates that a focal sweep camera can be331

built with minimal effort. A collection of focal sweep pho-332

tographs captured by this prototype are shown on the web-333

site www.focalsweep.com.334

Prototype 2, as shown in Figure 3 (b), is a more com-335

pact design, in which a sensor secured on a structure and the336

lens can be translated during the sensor’s integration time. In337

this prototype we use a compact linear actuator instead of a338

voice coil motor, allowing us to reduce the camera’s overall339

size. The same lens and camera as in Prototype 1 are used.340

During the integration time, the sensor is translated from the341

near focus position to the far focus position. With this pro-342

totype, we are able to translate the sensor at a top speed of343

0.9mm/s. The major advantage of this implementation is344

its compactness and its close resemblance to existing cam-345

era architectures.346

4.2 Camera Settings347

As in conventional photography, users first determine the348

frame rate P and f-number N according to the speed of ob-349

ject motion, the lighting condition, and the desired amount350

of defocus in the captured images for each scene. Then, the351

ideal speed of sensor sweep s can be computed using Equa-352

tion 6. This ideal speed of focal sweep is independent of353

camera focus and distance range of scenes. This indepen-354

dence makes configuration friendly to users.355

Although the sweep speed is independent of camera fo-356

cus and scene depth range, the number of images, k, and the357

overall time to capture the image stack, T , are highly related.358

6

(a) A side view of Prototype 1 (b) CAD rendering of Prototype 2 (c) A side view of Prototype 2

Fig. 3 Two focal sweep camera prototypes. (a) Prototype 1 drives sensor sweep using a voice coil; (b) Prototype 2 drives lens sweep using a linearactuator.

6 8 10 12 14 16 18 20 220

0.2

0.4

0.6

0.8

1

1.2

focal length (mm)

over

al c

aptu

re �

me

(sec

)

(a) f - T plot

N = 1.4c = 2.2 µmP = 120 fps

6 8 10 12 14 16 18 20 22

focal length (mm)0

20

40

60

80

100

tota

l num

ber o

f im

ages

(b) f - k plot

N = 1.4c = 2.2 µmP = 120 fps

0

20

40

60

80

tota

l num

ber o

f im

ages

(c) Z - T plot�

min

nearest distance

(1/m)0 2 4 6 8

o 0.5 0.25 0.16 0.125o (m)

N = 1.4c = 2.2 µmP = 120 fpsf = 9 mm

0

0.2

0.4

0.6

0.8

1

over

al c

aptu

re �

me

(sec

)

(d) Z - k plot�

min

nearest distance

(1/m)0 2 4 6 8

o 0.5 0.25 0.16 0.125o (m)

N = 1.4c = 2.2 µmP = 120 fpsf = 9 mm

Fig. 4 For a given pixel size, frame rate, and f-number, the overallcapture time and total image count are highly related to focal lengthand scene distance range. (a) shows the plot of the overall capture timeT with respect to focal length f . (b) shows the f − k plot. (c) and (d)show the plots of T and k with respect to the depth range (in diopter),respectively, for a camera with a 9mm lens. In each plot, the red spotindicates the most typical setting in our implementation.

Figure 4 (a) shows how the overall capture time T varies359

with camera focal length f in a camera where N = 1.4,360

c = 3.3µm, and P = 120 fps. T increases proportionally to361

the square of f . Figure 4 (b) plots the total number of cap-362

tured images, k, with respect to focal length f for the same363

camera. Again, k is linearly proportional to f2.364

In the most common scenarios, the desired scene ranges365

from a certain distance,Zmin, to infinity. Figure 4 (c) and (d)366

plot the overall capture time T and total image count k with367

respect to Zmin, which is the inverse of distance (or dioper).368

We can see that both are linear. The closer the foreground369

is to the camera the longer the capture time. The x-axis is370

labeled by both dioper and distance for easy reference.371

It can be noted from the figure that the required capture372

time and image counts have a huge range at different set-373

tings. The red dot in each plot indicates a typical setting in374

2. Image Stabiliza�on

camera parameters

image stack, {F0i}

image stack, {Fi} in-focus image, I0

image stack pyramid, {F(i)i}

in-focus image pyramid, Ij

multi-scale index maps, D(i)

an initial index map, D0

8. Index interpola�on

in-focus index map, D

segment map, L

1. Capture a space-�me focal stack

4. Build a pyramid of image stack and in-focus image

5. Compute space-�me index maps at mul�-scales

6. Merge mul�-scle maps into a reliable sparse map.

9. User interac�ve refocus

3. Compute space-�me in-focus image

7. Image over-segmenta�on

Fig. 6 An overall diagram illustrates the process from capturing aspace-time focal stack, to generating an in-focus index map, and tointeractive image refocusing.

our implementation. We use a 9mm lens and our scenes’375

depths range from 0.4m to infinity, so it takes us 0.2sec to376

capture 20 images. Our prototypes are also able to capture377

scenes with smaller Zmin, but it will take a long time to cap-378

ture the sweep as shown in Figure 4 (c). Figure 5 illustrates379

an sample of space-time focal stack that was captured using380

our first prototype camera.381

5 Algorithm382

Figure 6 shows an overview of the newly proposed algo-383

rithm. After a stack of images, {F 0i }, are captured, we first384

apply a typical multi-scale optical flow algorithm to estimate385

frame-to-frame global transformations due to hand shake,386

then stabilize the image stacks. The stabilized image stack387

7

(a) A Space-time focal stack (b) 2D slice of the stack

T

X

(c) The first frame (d) The last frame

Fig. 5 A sample space-time focal stack captured using our focal sweep camera prototype 1. (a) A space-time focal stack of 25 images; (b) A 2Dslice of the 3D stack; (c) The first frame of the stack where the foreground is in focus; (d) The last frame of the stack where the background isin focus. The capturing frame rate is 120fps. It took the focal sweep camera about 0.2sec to capture the whole sequence, so that we can capturemotions of the persons.

{Fi} is then used to compute a space-time in-focus image388

(Section 5.1) and index maps at various scales (Section 5.2).389

In Section 5.3, we describe a new approach to merging multi-390

scale index maps into one high-quality index map.391

There are two key ideas in the algorithm. First, we use392

a pyramid strategy to handle non-textured regions. For each393

pixel, we estimate its index (the frame it is best focused) at394

multiple scales. Due to the space-scale effect (Perona and395

Malik, 1990), the index may not be consistent at different396

scales, especially in regions with weak textures, depth dis-397

continuities, or object motions. This inconsistency is one of398

the fundamental difficulties in the algorithm’s design, and399

we show a simple yet effective solution.400

Second, at any given scale, we propose a novel approach401

to computing the index map. In literature, it is common to402

first estimate an index map (or depth map) using focus mea-403

sure, and then use the index map to produce an all-in-focus404

image (Agarwala et al., 2004; Hasinoff and Kutulakos, 2009).405

In this paper, however, we take a different strategy. We first406

compute an all-in-focus image without knowing the index407

map, and then use the all-in-focus image to estimate the in-408

dex map. We will show the advantage of using this new strat-409

egy.410

5.1 Space-time In-Focus Image411

Given a focal stack, we first compute a space-time in-focus412

image without the index map. The idea is inspired by the413

EDOF technique using focal sweep. Kuthirummal et al. (2011)414

show that the mean of a focal stack preserves image de-415

tails, and deconvolve the averaged image with a (1/x)-shape416

integral point-spread-function (IPSF) to recover an all-in-417

focus EDOF image without knowing the depth map. This418

approach is further shown to be robust in regions of depth419

edges, occlusions, and even object motion. In Figure 7, we420

show the mean image of a space-time focal stack (a) and the421

EDOF image after deconvolution (b).422

Although both (a) and (b) preserve most high frequencyinformation, the average image yields a low contrast (espe-cially when the number of images increases), and the decon-

volved EDOF image (b) is prone to image artifacts. In addi-tion, deconvolution is computationally expensive, especiallyfor mobile devices. In this paper, we compute a space-timein-focus image as a weighted sum of all images:

I(x) =ΣiWi(x) · Fi(x)

ΣiWi(x) + ε, (7)

where the weights Wi(x) are defined as the variance of theLaplacian patch 4(Pi(x, d)), which is centered at x in theith frame.

Wi(x) = V(4Pi(x, d)). (8)

With this strategy, severely blurred patches will carry much423

less weight than sharper patches do, reducing the hazy ef-424

fects that one can see in the average image from Figure 7(a).425

As shown in (c), the weighted sum is sharp and has high con-426

trast even without deconvolution. Figure 7 shows an EDOF427

image computed by a typical strategy (Agarwala et al., 2004;428

Kutulakos and Hasinoff, 2009).429

5.2 Space-time In-focus Index Maps at Various Scales430

For each pixel (x, y), we look for the frame where its sur-rounding patch is most similar in high frequencies to thatin the computed in-focal image I(x, y). Then, the in-focusindex map M(x) is estimated as:

M(x, y) = argminiS(Fi(x, y), I(x, y)), (9)

where S measures the high frequency similarity between Fi

and I at each pixel and is defined as

S(P,Q) = | 4 (P−Q)| ⊗ u(r), (10)

where P and Q denote the patches at P and Q, respectively,431

⊗ is convolution, and u(r) is a pillbox function of radius r.432

We compute an index map at every scale level of the433

focal stack pyramid, and get M (i), i = 1, 2, . . . , k, where434

k is the total level of the pyramid. At each level, the focal435

stack reduces the spatial resolution by 2 × 2 from its upper436

level. We use d = 7, k = 7, r = 5 in our implementation.437

Figure 8 (b) shows the index map pyramid.438

8

(a) Mean image (b) Mean image a�er deblurring (c) Weighted average image Patch in frame 1

Patch in frame 5

Patch in frame 25

Fig. 7 Space-time in-focus images computed using different approaches and their close-ups. (a) The mean of all images in the stack; (b) The meanimage deconvolved using an integral PSF; (c) Weighted average of all images in the stack; (d) Patches at the best focused frames.

(a) A pyramid of space-�me in-focus images

(b) A pyramid of space-�me index maps

(c) Reliable index map

(d) Image segmenta�on

(e) Final index map

(f) Index map (comparison)

Fig. 8 (a) A pyramid of space-time in-focus images; (b) A pyramid of space-time index maps; (c) A reliable index map that computed from (b)using index consistence; (d) An over-segmentation of the full-resolution in-focus image; (e) Our final depth map computed from (c) and (d) byhole-filling; (f) An index map computed using a traditional algorithm which uses difference-of-Gaussians as focus measure (Agarwala et al., 2004;Hasinoff and Kutulakos, 2009) and Graph-cut for global optimization(Boykov and Kolmogorov, 2004).

5.3 Merging and Interpolating Index Maps439

Due to the space-scale effects and depth discontinuity, theindex map computed at different scales can be significantlydifferent as we can see in Figure 8 (b). We propose con-structing a reliable but sparse index map by only acceptingindices that are consistent in all levels:

D0(x)=

Mi(x), ifmax[Mi(x)]−min[Mi(x)] < τ

∅, otherwise

(11)

τ is set as a small number to enforce consistence. One sam-440

ple is shown in Figure 8 (c). The pixels with no index as-441

signed are shown in black. The observation is that the index442

map is dense in textured region, and sparse in non-textured443

regions and depth boundaries. We then over-segment the in-444

focus regions and do index interpolation in each segment445

according to two simple rules:446

– If the segment has at least m valid indices, do interpola-447

tion by plane fitting.448

– If the number of valid indices is less than m, do nearest449

neighbor interpolation.450

9

Figure 8 (d) shows a segmentation result using Graph-cut451

(d), and Figure 8 (e) shows the index map after interpolation.452

We can see that the index map is sharp at depth boundary,453

and smooth in non-textured regions.454

With the estimated index map M(x, y), we can do im-455

age refocusing. In the refocusing viewer, for any pixel (x, y)456

that a user taps, we transition the displayed image from the457

present image to the image indexed by M(x, y). The transi-458

tion is made smooth by sequentially displaying the images459

between the present index to M(x, y). We have made our460

refocusing viewer available online at www.focalsweep.com.461

6 Experiments462

In all our experiments shown in this paper, we set m =463

10, τ = 1, d = 7, r = 5. We compare index maps com-464

puted using the proposed technique with that using a tradi-465

tional depth estimation algorithm, which maximize a sim-466

ple focus measure. There are various definitions of focus467

measures (Nayar and Nakagawa, 1990; Nayar et al., 1996;468

Subbarao and Tyan, 1998; Xiong and Shafer, 1993). We469

adopted the one used in the photomontage method (Agar-470

wala et al., 2004), which defines focus measure as a sim-471

ple local contrast according to the Difference-of-Gaussians472

filter, and we further polished the results using Graph-cut473

(Boykov and Kolmogorov, 2004). Graph-cut as a global op-474

timization technique helps to fill up the holes and smooth475

the index map, shown in Figure Figure 8 (f). Our result as476

shown in (e) produces better results in non-textured or spec-477

ular regions and depth discontinuities, which are important478

for image refocusing.479

Figure 9 shows even more space-time focal stacks that480

we captured using Prototype 1, as well as the computed in-481

focus images and index maps. It is impossible to demon-482

strate user interactive image refocusing in paper form, so483

we will only show the computed index maps and in-focus484

images here. In the index maps that the proposed algorithm485

computes, there are no obvious holes or artifacts, even in486

textureless regions. The index map is also sharper at depth487

boundaries. To experience the interactive refocusing of these488

scenes, a refocus viewer is available at www.focalsweep.com.489

7 Discussion490

In this paper, we present a focal sweep imaging system to491

capture space-time focal stacks for image refocusing. The492

proposed camera sweeps focus at such a speed that the summed493

DOF of the captured focal stack efficiently covers the entire494

desired depth range. The major benefit of this design lies495

in the fact that the camera directly captures all the images496

that are required for image refocusing. By avoiding image497

synthesis and rendering (which are common in many other498

competing designs), this design provides users high-quality499

full-resolution images at every focus with minimal compu-500

tation cost.501

Unlike other image refocusing techniques, our technique502

does space-time image refocusing since each refocused im-503

age is captured at a different time point. Users will see ob-504

jects move into (or out of) focus when they tap to refocus.505

The combination of refocusing and motion yields a unique506

user experience.507

Due to object motion, each pixel that a user taps on508

might correspond to different objects at different focus lay-509

ers (or time points). For example, a defocused object in mo-510

tion often appears blended with its background object, and511

there would be cases when it is preferred to estimate object512

motion and do image refocusing along the estimated motion513

trajectory. Solving the ambiguity often requires a deeper un-514

derstanding to user intentions. In this paper, however, we515

choose to stay simple in face of the ambiguity. For each516

pixel being clicked, we check the pixel at all focus layers517

and choose the sharpest one without explicitly considering518

object motion. This simplicity not only improves the robust-519

ness of the refocusing algorithm, but also makes refocusing520

results easier to predict, which could be beneficial for users521

to adapt to the particular refocusing. There are other pos-522

sible refocusing choices as discussed in Section 3.2, which523

deal with the ambiguity in different manners. We decide to524

leave them as future work.525

We implemented two prototype focal sweep cameras,526

one using a voice coil to drive sensor sweep and one using527

a linear actuator to drive lens sweep. A more compact and528

promising implementation could be made by making use of529

the auto-focus mechanism, which commonly exists in many530

commercial lenses. To do so, one will need the capability to531

control and synchronize focus with image capturing.532

Acknowledgments533

This research was funded in part by ONR Grant No. N00014-534

11-1-0285, ONR Grant No. N00014-08-1-0929, and DARPA535

Grant No. W911NF-10-1-0214.536

References537

Agarwala A, Dontcheva M, Agrawala M, Drucker S, Col-538

burn A, Curless B, Salesin D, Cohen M (2004) Interactive539

digital photomontage. In: ACM Transactions on Graphics540

(TOG), ACM, vol 23, pp 294–302541

Boykov Y, Kolmogorov V (2004) An experimental compari-542

son of min-cut/max-flow algorithms for energy minimiza-543

tion in vision. Pattern Analysis and Machine Intelligence,544

IEEE Transactions on 26(9):1124–1137545

10

(a) First frame (focus on foreground)

(b) Last frame(focus on background)

(c) Space-�me in-focus image (d) Space-�me in-focus index map

Fig. 9 More experimental results. Each row corresponds results with a scene. From left to right, (a) and (b) are the first and last frames capturedwith focal sweep, (c) are the computed space-time in-focus images, and (d) are the estimated space-time in-focus index maps. The resulting indexmaps are used for image refocusing, as demonstrated in our website www.focalsweep.com. [(c) is to be updated.]

Castro A, Ojeda-Castaneda J (2004) Asymmetric phase546

masks for extended depth of field. Applied Optics547

43(17):3474–3479548

Chaudhuri S, Rajagopalan A (1999) Depth from defocus: a549

real aperture imaging approach. Springer Verlag550

Cossairt O, Nayar S (2010) Spectral Focal Sweep: Extended551

depth of field from chromatic aberrations. In: Interna-552

tional Conference on Computational Photography, pp 1–8553

Cossairt O, Zhou C, Nayar S (2010) Diffusion coded pho-554

tography for extended depth of field. In: SIGGRAPH,555

ACM, pp 1–10556

Dowski E, Cathey W (1995) Extended depth of field through557

wave-front coding. Applied Optics 34(11):1859–1866558

George N, Chi W (2003) Extended depth of field using a log-559

arithmic asphere. Journal of Optics A: Pure and Applied560

Optics 5:S157561

Georgeiv T, Zheng K, Curless B, Salesin D, Nayar S, Int-562

wala C (2006) Spatio-Angular Resolution Tradeoff in In-563

tegral Photography. In: In Eurographics Symposium on564

Rendering565

Georgiev T, Intwala C (2006) Light field camera design for566

integral view photography. Tech. rep., Adobe567

Gortler S, Grzeszczuk R, Szeliski R, Cohen M (1996) The568

lumigraph. In: Proceedings of the 23rd annual conference569

on Computer graphics and interactive techniques, ACM,570

pp 43–54571

Guichard F, Nguyen H, Tessieres R, Pyanet M, Tarchouna572

I, Cao F (2009) Extended depth-of-field using sharpness573

transport across color channels. Technical Paper, DXO574

Labs575

Hasinoff S, Kutulakos K (2009) Light-efficient photography.576

IEEE Pattern Analysis and Machine Intelligence 1(1):1577

11

Hasinoff S, Kutulakos K, Durand F, Freeman W (2009)578

Time-constrained photography. In: IEEE International579

Conference on Computer Vision, pp 333–340580

Hausler G (1972) A method to increase the depth of focus581

by two step image processing. Optics Communications582

6(1):38–42583

Indebetouw G, Bai H (1984) Imaging with Fresnel zone584

pupil masks: extended depth of field. Applied Optics585

23(23):4299–4302586

Ives H (1930) Parallax panoramagrams made with a large587

diameter lens. Journal of the Optical Society of America588

A 20(6):332–340589

Kuthirummal S, Nagahara H, Zhou C, Nayar S (2011) Flex-590

ible depth of field photography. PAMI 33(1):58–71591

Kutulakos K, Hasinoff S (2009) Focal stack photography:592

High-performance photography with a conventional cam-593

era. Proc 11th IAPR Conference on Machine Vision Ap-594

plications pp 332–337595

Levin A, Fergus R, Durand F, Freeman W (2007) Image and596

depth from a conventional camera with a coded aperture.597

ACM Transactions on Graphics (TOG) 26(3):70–es598

Levoy M, Hanrahan P (1996) Light field rendering. In: Pro-599

ceedings of the 23rd annual conference on Computer600

graphics and interactive techniques, ACM, pp 31–42601

Levoy M, Ng R, Adams A, Footer M, Horowitz M (2006)602

Light field microscopy. In: ACM Transactions on Graph-603

ics (TOG), ACM, vol 25, pp 924–934604

Lippmann G (1908) La photographie integrale. Comptes-605

Rendus, Academie des Sciences (146):446–551606

McNally J, Karpova T, Cooper J, Conchello J (1999) Three-607

dimensional imaging by deconvolution microscopy.608

Methods 19(3):373–385609

Mouroulis P (2008) Depth of field extension with spherical610

optics. Optics Express 16(17):12,995–13,004611

Nagahara H, Kuthirummal S, Zhou C, Nayar S (2008) Flex-612

ible depth of field photography. In: European Conference613

on Computer Vision614

Nayar S, Nakagawa Y (1990) Shape from focus: An ef-615

fective approach for rough surfaces. In: ICRA, IEEE, pp616

218–225617

Nayar S, Watanabe M, Noguchi M (1996) Real-time focus618

range sensor. PAMI 18(12):1186–1198619

Ng R (2006) Digital light field photography620

Ng R, Levoy M, Bredif M, Duval G, Horowitz M, Hanra-621

han P (2005) Light field photography with a hand-held622

plenoptic camera. Stanford Computer Science Technical623

Report 2624

Pentland A (1987) A new sense for depth of field. IEEE Pat-625

tern Analysis and Machine Intelligence (4):523–531626

Perona P, Malik J (1990) Scale-space and edge detection us-627

ing anisotropic diffusion. PAMI 12(7):629–639628

Poon T, Motamedi M (1987) Optical/digital incoherent im-629

age processing for extended depth of field. Applied Optics630

26(21):4612–4615631

Rajagopalan A, Chaudhuri S (1997) Optimal selection of632

camera parameters for recovery of depth from defocused633

images. In: CVPR, IEEE, pp 219–224634

Ren H, Wu S (2007) Variable-focus liquid lens. Opt Express635

15(10):5931–5936636

Ren H, Fox D, Anderson P, Wu B, Wu S (2006) Tunable-637

focus liquid lens controlled using a servo motor. Opt Ex-638

press 14(18):8031–8036639

Sibarita J (2005) Deconvolution microscopy. Microscopy640

Techniques pp 1288–1291641

Subbarao M, Tyan J (1998) Selecting the optimal focus642

measure for autofocusing and depth-from-focus. PAMI643

20(8):864–870644

Tao H, Sawhney H, Kumar R (2001) A global matching645

framework for stereo computation. In: Computer Vision,646

2001. ICCV 2001. Proceedings. Eighth IEEE Interna-647

tional Conference on, IEEE, vol 1, pp 532–539648

Wilburn B, Joshi N, Vaish V, Talvala E, Antunez E, Barth649

A, Adams A, Horowitz M, Levoy M (2005) High perfor-650

mance imaging using large camera arrays. ACM Transac-651

tions on Graphics (TOG) 24(3):765–776652

Xiong Y, Shafer S (1993) Depth from focusing and defocus-653

ing. In: CVPR, IEEE, pp 68–73654

Yedidia J, Freeman W, Weiss Y (2001) Generalized belief655

propagation. Advances in neural information processing656

systems pp 689–695657

Zhou C, Lin S, Nayar S (2009) Coded aperture pairs for658

depth from defocus. In: IEEE International Conference659

on Computer Vision660

Documents

Focal Sweep Camera for Space-Time Refocusingchangyin/files/focal-sweep...33 all-in-focus image (Hausler, 1972; Nagahara et al., 2008). In 34 this paper, we make use of focal sweep