Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Noname manuscript No.(will be inserted by the editor)
Focal Sweep Camera for Space-Time Refocusing
Changyin Zhou · Daniel Miau · Shree K. Nayar
Received: date / Accepted: date
Abstract In this paper, we present an imaging system that1
captures a sequence of images in a scene as focus is swept2
over a large depth range. We refer to it as a space-time focal3
stack and propose algorithms for space-time image refocus-4
ing at sensor resolution – when a user clicks on a pixel, we5
show the user the image that was captured at the time when6
the pixel is best focused.7
We first describe how to synchronize focus sweeping8
with image capturing so that the summed depth of field of9
a focal stack efficiently covers the entire depth range with-10
out gaps or overlap. We then propose an efficient algorithm11
for computing a space-time in-focus index map from a fo-12
cal stack. It states which time each pixel is best focused13
and the algorithm is specifically designed to enable a seam-14
less refocusing experience, even in textureless regions or at15
depth discontinuities. We implemented two prototype cam-16
eras to capture a large number of space-time focal stacks,17
and made an interactive refocusing viewer available online18
at www.focalsweep.com.19
Keywords Focal Sweep · Focal Stack · Depth Recovery ·20
Refocusing · Extended Depth of Field21
Changyin ZhouDepartment of Computer Science, Columbia UniversityE-mail: [email protected].
Daniel MiauDepartment of Computer Science, Columbia UniversityE-mail: [email protected]
Shree K. NayarDepartment of Computer Science, Columbia UniversityE-mail: [email protected]
* Changyin Zhou is a software engineer at Google at the timeof submission.
1 Introduction22
Depth of field (DOF) is the range of scene depths that ap-23
pear focused in an image. A conventional lens camera has a24
limited DOF due to a fundamental trade-off between DOF25
and signal-to-noise ratio (SNR). One can increase DOF by26
stopping down lens aperture, but a small aperture limits the27
number of photons to be captured in a given time and there-28
fore reduces SNR. Focal sweep is one of the extended depth29
of field (EDOF) techniques that overcome this limitation. In30
an EDOF focal sweep imaging system, focus is swept over a31
large depth range as an image is captured to produce a single32
all-in-focus image (Hausler, 1972; Nagahara et al., 2008). In33
this paper, we make use of focal sweep for a new application34
– space-time image refocusing.35
We capture a sequence of images in the period of focal36
sweep. Concatenating all the captured images in the tem-37
poral dimension forms a 3-dimensional space-time volume.38
Figure 1 (a) shows an XYT 3D space-time focus volume of a39
synthetic scene of colored balls in motion. To peek into the40
3D structure, Figure 1 (b) shows one 2D XT slice of the 3D41
volume, in which balls appear as double-cones. The apex of42
a double-cone shows the time when a ball is focused. This43
phenomenon has been observed and well studied in decon-44
volution microscopy (McNally et al., 1999). Unlike the tra-45
ditional static focus volume, the space-time focus volume46
captures object motions as well. For objects that are in mo-47
tion in both x and y directions, these double-cones appear48
tilted like the cyan cone shown in Figure 1 (b). By finding49
the apexes in the 3D volume (shown as the yellow band in50
Figure 1 (b)), we can obtain a space-time in-focus image,51
and the shape of the band reveals the 3D structure of scene.52
In practice, one can only capture a limited number of53
images due to constraints on time and camera capacity. He54
or she must usually decide how many images will be cap-55
tured and where the sensor should be positioned at for each56
2
(a) Space-time focal Volume
(b) A 2-D slice of space-time focal volume
(c) Integration of the Slice over T
T
X
projec�on
slice
Fig. 1 Space-time focus volume. (a) A space-time focus volume of asynthetic scene of color balls with motion. Objects move as the focuschanges with time in the T dimension. (b) A 2D XT slice of the 3Dvolume, in which each small ball appears as double-cones. The double-cones of moving balls are tilted. (c) Integrating the volume along theT dimension produces an EDOF image as captured by a typical fo-cal sweep technique. Each object appears sharp in the EDOF imageregardless of this depth.
capture point so that every object in the depth range appears57
focused at least in one of the captured images. We show a58
capturing strategy that yields efficient and complete focus59
sampling of 3D focus volumes.60
Our major challenge in designing algorithm was to com-61
pute an in-focus index map (or depth map) that enables a62
seamless refocusing experience. First, since users expect that63
each tap will bring them to the image layer where the region64
boundary is well focused, even in texture-less regions, there65
cannot be any holes in the index map. Second, the index pre-66
cision must be high enough (especially for textured region67
or region boundary) so that the refocused layer will appear68
perfectly focused. As we will show in Section 6, our algo-69
rithm design meets the needs of both of these challenges.70
Most existing refocusing techniques (e.g., plenoptic cam-71
eras (Lippmann, 1908; Ng et al., 2005), camera array (Wilburn72
et al., 2005), depth from defocus (Levin et al., 2007)) cap-73
ture an image in one moment. The proposed focal sweep74
camera differentiates itself from these techniques by captur-75
ing a space-time focus stack in a longer period when objects76
can be moving. While capturing an instant of time can be77
favorable in some situations, capturing a duration of time78
yields a unique and appealing user experience – the focus79
transition becomes naturally intertwined with object motion.80
In addition, our image refocusing is done at sensor resolu-81
tion without any trade-off in image quality.82
We implement two prototypes of focal sweep cameras:83
one using a voice coil actuator to drive the sensor sweep and84
one using a linear actuator to drive the lens sweep. Both of85
them are able to synchronize focal sweep and image captur-86
ing to capture focal stacks in a complete and efficient man-87
ner. A collection of focal sweep photographs has been cap-88
tured by using our prototypes, and users can refocus these89
images interactively on www.focalsweep.com.90
2 Related Work91
2.1 Focal Sweep and Focal Stack92
A conventional lens camera has a finite depth of field. A93
variety of EDOF techniques have been proposed to extend94
depth of field in the past several decades (Castro and Ojeda-95
Castaneda, 2004; Cossairt et al., 2010; Dowski and Cathey,96
1995; George and Chi, 2003; Guichard et al., 2009; Indebe-97
touw and Bai, 1984; Mouroulis, 2008; Nagahara et al., 2008;98
Poon and Motamedi, 1987). Focal sweep is one of the EDOF99
techniques. A focal sweep EDOF camera captures a single100
image when its focus is quickly swept over a large range101
of depth. Hausler (1972) extends DOF of microscopy by102
sweeping the specimen along optical axis during exposure.103
Nagahara et al. (2008) extended DOF for consumer photog-104
raphy by sweeping the image sensor. Nagahara et al. (2008)105
also show that the point-spread-function (PSF) of a focal106
sweep camera is depth invariant, allowing one to deconvolve107
a captured image with a single PSF to recover a sharp image108
without knowing the 3D structure of the scenes. We build a109
similar imaging system as in Nagahara et al. (2008), but use110
it to capture image stacks for image refocusing.111
Several techniques have been proposed to capture a stack112
of images for extended depth of field and 3D reconstruction.113
In deconvolution microscopy, for example, a stack of im-114
ages of specimens are captured at different foci to form a115
3D image (McNally et al., 1999; Sibarita, 2005). 3D point-116
spread-functions in the 3D images are shown to be depth117
invariant double-cones. By deconvolving with the 3D PSF,118
a sharp 3D image can be recovered. We observe a similar119
double-cone structure in image stacks captured using our120
own imaging system. Kuthirummal et al. (2011) integrates121
focal stacks into an EDOF image and recover all-in-focus122
images by deconvolution as in focal sweep EDOF. Guichard123
et al. (2009) and Cossairt and Nayar (2010) make use of124
chromatic aberration to capture images of different foci in125
the three color channels with a single shot, and then by com-126
bining the sharpness from all color channels to produce an127
all-in-focus images. Agarwala et al. (2004) propose using a128
3
global maximum contrast image objective to merge a focal129
stack into a single all-in-focus images.130
Hasinoff et al. (2009) compare the optimality of various131
capture strategies for reducing optical blur in a comprehen-132
sive framework where both sensor noise model and deblur-133
ring error are taken into account. Their analysis shows that134
focal stack photography has two performance advantages in135
extending depth of field over one-shot photography: 1) it al-136
lows one captures a given DOF faster; 2) it achieves higher137
SNR in a given exposure time.138
Hasinoff and Kutulakos (2009) consider the problem of139
minimizing the capture time to capture a scene with a given140
DOF and a given exposure level. This is highly related to the141
optimization problem in our technique and similar analysis142
on camera DOF can be seen in both papers. While Hasinoff143
and Kutulakos (2009) emphasize on the lens f-number and144
the image number for focal stack capturing, we optimize the145
speed of focal sweep in synchronization with image expo-146
sure for a given lens f-number.147
Future analysis in (Kutulakos and Hasinoff, 2009) shows148
that focal stack photography can not only capture a given149
DOF with reduced exposure time, but also achieve higher150
SNR with restricted exposure time. Kutulakos and Hasinoff151
(2009) use a similar algorithm as in Agarwala et al. (2004)152
to synthesize EDOF images from focal stacks by assuming153
that scenes are static. While their algorithms are optimized154
to produce artifact-free images with minimal blur, our algo-155
rithm is proposed to yield a seamless space-time refocusing156
experience.157
2.2 Light Field Camera and Image Refocusing158
The concept of light field has been used for a long history. In159
the early 20th century, Ives (1930); Lippmann (1908) have160
proposed plenotic camera designs to capture light fields. The161
idea of light field resurfaced in the community of computer162
vision and graphic in the late 1990s when Levoy and Han-163
rahan (1996) and Gortler et al. (1996) described the presen-164
tation of 4D light field and show how new views can be ren-165
dered by using light field data. A stack of images with dif-166
ferent focus can also be rendered as well from a light field,167
and then be used for image refocusing.168
A number of light field cameras have been designed and169
made in recent years. Levoy et al. (2006) use a plenoptic170
camera to capture the light field of specimens and propose171
algorithms to compute a focal stack from a single light field172
image, which can be processed as in deconvolution microscopy173
to produce a 3D sharp volume. Ng et al. (2005) and Ng174
(2006) use the same plenoptic camera design and empha-175
sizes its application in image refocusing. Georgeiv et al.176
(2006) and Georgiev and Intwala (2006) show a number of177
variants of light field camera designs for different trade-off178
between spatial and angular resolution. Light field cameras179
can also be built using camera arrays (Wilburn et al., 2005).180
2.3 Depth from Focus or Defocus181
Two other techniques that are closely related to our proposed182
technique is depth from focus and depth from defocus. In-183
stead of capturing a complete focal stack, depth from focus184
or defocus captures only a few images with varying focus or185
aperture patterns and uses focus as a cue to compute depth186
maps of scenes (Chaudhuri and Rajagopalan, 1999; Nayar187
et al., 1996; Pentland, 1987; Rajagopalan and Chaudhuri,188
1997; Zhou et al., 2009).189
Focus measure is a critical technique for all depth from190
focus approaches, but there are several difficulties associ-191
ated with it. The space-scale effect (Perona and Malik, 1990)192
presents a major difficulty since it can lead to depth ambi-193
guity at different scales. In addition, image patches at depth194
discontinuities may cross multiple depth layers and make195
focus measure inaccurate. Furthermore, since focus mea-196
sure does not reveal anything about depth in non-textured197
regions, techniques such as plane fitting (Tao et al., 2001),198
graph-cut (Boykov and Kolmogorov, 2004), and belief pro-199
rogation (Yedidia et al., 2001)) have been employed to fill200
the resulting holes in the depth map.201
3 Space-time focus volume, Focus Sampling, and202
Refocusing203
Concatenating all the images captured during focal sweep204
in the temporal dimension forms a 3D space-time volume205
that encodes more visual information about scenes than a206
single image. Figure 1 (a) shows a space-time focus volume207
of a synthetic scene of colored balls in motion and (b) shows208
one 2D XT slice of the 3D volume, in which balls appear as209
double-cones. As mentioned in the introduction, the double-210
cones appear tilted for objects in motion in both x and y211
directions; by finding the apexes in the 3D volume (shown212
as a yellow band in Figure 1 (b)), we can obtain a space-213
time in-focus image; and the shape of the band reveals the214
3D structure of the scene. A motion along the optical axis is215
usually negligible in practice, since (according to the Thin216
Lens Law) a small sensor motion can lead to a huge focus217
sweep on the object side.218
3.1 Space-time Focal Stack and Focus Sampling219
Within a given time budget, one can only capture a finite220
number of images during focal sweep due to the limit of221
frame rate and SNR consideration. The photographer must222
decide how many images to capture and where the sensor223
4
Lens DOFSensor
z zz0
∆
0
∆
Lens DOFSensor
zz
(a) DOF of single shot
(b) DOFs of mul�ple shots
A
A
Fig. 2 Efficient and complete focus sampling. (a) Left: A geometryillustration of depth of field. Objects in the range [Z1, Z2] will appearfocused when v and z satisfy the Thin Lens Law. Right: The Thin LensLaw is shown as an orange line in the reciprocal domain. Z1 and Z2 canbe easily located in the reciprocal domain (or in diopter) by |Zi− Z| =u · c/N . (b) In order to have an efficient and complete focus sampling,the DOFs of consecutive sensor positions (e.g., vi−1, vi, vi+1) musthave no gap or overlap.
should be positioned at for each capture point so that ev-224
ery object in the depth range will appear focused in at least225
one of the captured images. This is, in essence, a sampling226
problem of the 3D focus volume and we argue that an ideal227
capture should satisfy two conditions:228
– Completeness: the DOFs of all captured images shouldsum up to cover the entire desired depth range. If thedesired depth range is DOF ∗, we have
DOF1⋃
DOF2⋃
DOF3...⋃
DOFn⊃DOF∗ (1)
– Efficiency: all DOFs should have no overlap, so only aminimal number of images are required.
DOF1⋂
DOF2⋂
DOF3...⋂
DOFn=∅ (2)
Hasinoff and Kutulakos (2009) refer to an image se-229
quence as Sequence with Sequential DOFs, if the end-point230
of one image’s DOF is the start-point of the next image’s231
DOF. The idea capture sequence described above is a se-232
quence with sequential DOFs that covers the entire desired233
depth range.234
We start our DOF analysis from the Thin Lens Law,1/f = 1/u+ 1/z, where f is the focal length of the lens, uis the sensor-lens distance, and z is the object distance. Asa common practice, we transform the equation to the recip-rocal domain. The Thin Lens Law is now a linear equation:f = v+z, where x = 1/x. The reciprocal of object distance,z, is often referred to as diopter. Depth of field, [z1, z2], isthe depth range where the blur radius is less than the circleof confusion, c. As is common in digital photography, pixel
size is used as circle of confusion in this paper. The DOF inthe reciprocal domain, z1 and z2, can be derived as:{z1 = z + u · c/Az2 = z − u · c/A
(3)
where A is the aperture diameter of the lens. Both the po-235
sition and range of DOF changes with the sensor position.236
Figure 2 (a) shows the geometry of DOF for sensor posi-237
tions u on the left and illustrates the DOF in the reciprocal238
domain on the right. The yellow line in the figure represents239
the Thin Lens Law. For an arbitrary sensor position u, its240
range of DOF in the reciprocal domain is ∆ = 2 · u · c/A241
according to Equation 3.242
For an efficient and complete focus sampling, we re-quire every consecutive DOFs have no overlap and no gapas shown in Figure 2 (b). From Eqation 3, we derive:
|ui − ui+1| = (ui + ui+1) · c/A. (4)
In consumer photography, we have z � u and so ui ≈ f .By approximating Eqn 4 we have:
δu ≈ 2 · c ·N, (5)
where N = f/A is the f-number of the lens. Equation 5shows that to achieve an efficient and complete focus sam-ple, we only need to move the sensor at a constant step, andthe step size is determined by the pixel size and f-number.Notice that this is a constant step in the normal domain. Inthe reciprocal domain, the step is not constant as shown inFigure 2 (b). In particular, for a fixed frame rate P , it indi-cates that a sensor should be swept at a constant speed:
s =δu
δt= 2 · c ·N · P (6)
.243
In the case when the overall time budget is too short (or244
the sensor is moving too slow) to perform a complete focus245
sample, deblurring must be done to recover the sharpness of246
objects in DOF gaps. Hasinoff et al. (2009) proposes a com-247
prehensive framework for optimizing focus sampling in this248
case by taking noise model, capturing overhead, and the ef-249
fect of deblurring into account. In this paper, we concentrate250
on the problem of how to sample the focus in an efficient251
and complete manner for space-time image refocusing. By252
avoiding aggressive deblurring in the process, we can save253
a large amount of computation and produce higher quality254
refocusing results that are more natural and artifact-free.255
3.2 Space-time In-focus Index Map and Refocusing256
In a system where the speed of focal sweep is much faster257
than the speed of an object’s motion along the optical axis258
5
(z), the object appears in focus only once in the focus vol-259
ume (or, equivalently, in only one image of the focal stack).260
Let F1, F2, . . . , Fk be the focal stack of k images. For each261
pixel, we find the index of the frame where the pixel is best262
focused. The indices of all the best-focused pixels comprise263
a space-time in-focus index map.264
In a dynamic scene, both the focus and objects them-265
selves are free to move, which leads to ambiguities in the266
definition of the space-time in-focus index map. Hence, there267
are different ways of defining the space-time in-focus index268
map:269
1. For each pixel (x, y), look into a small tube at (x, y)270
in the 3D XYT focal stack and find which layer appears271
the sharpest. If N(x, y) is the sharpest layer (or time),272
N(x, y) is the in-focus index map for the focal sweep.273
2. Explicitly consider object motion. At an arbitrary layer274
(or time) t, for any object at a spatial location (x, y),275
we could track this object and find that the object is276
best focused at the layer (or time) t′ and spatial loca-277
tion (x′, y′). In this case, the in-focus index map is t′ =278
N1(x, y, t), which is a 3D index map.279
3. Track all objects at location (x, y) in the space-time fo-280
cal stack to the layers where they are best focused, and281
then among all the final layers, pick the layer closest to282
the present layer. In this case, we have the space-time fo-283
cal stack as t′ = N2(x, y, t) = argmint |N1(x, y, t) −284
t0|.285
In this paper, we choose the first definition for simplicity.286
A refocusing viewer takes a user click (x, y) as input, finds287
the next frame t by N(x, y), and smoothly transitions the288
image from the current frame to the next frame. The other289
two choices of definition can yield different refocusing be-290
haviors and user experiences. Ideally, the choice should be291
made based on user intention or preference when they click292
on a pixel in the viewer. We leave the study of the other two293
(or more) definitions and their impacts on user experience to294
future work.295
With this definition, a space-time in-focus index map296
N(x, y) is a mapping from a spatial location (x, y) to a tem-297
poral point t = N(x, y). The focus is swept over a depth298
range over time, so each index is also related to a depth,299
although it might correspond to a depth at a different tempo-300
ral point. For the correlation between depth and focal sweep,301
users will be able to observe objects move with refocus vari-302
ation.303
4 Focal Sweep Camera304
4.1 Prototypes305
Focal sweep can be implemented in multiple ways. One way306
is to directly sweep the image sensor as we do in our analy-307
sis. A variety of actuators such as voice coil motors, piezo-308
electic motors, ultrasonic transducers, and DC motors could309
be used to translate the sensor in a designed manner dur-310
ing capture duration. Another way is to sweep camera lens.311
With the auto-focus mechanism that is commonly built in312
many commercial lenses, it may also be programmed to per-313
form focal sweep photography. Liquid lenses (Ren and Wu,314
2007; Ren et al., 2006) are yet another way of performing315
focal sweep, and they are power efficient. Liquid lenses fo-316
cus at different distances when different voltages are applied317
to them.318
We built two prototype focal sweep cameras as shown in319
Figure 3. Prototype 1, as shown in Figure 3 (a), uses a Fu-320
jinon HF9HA-1B, 9mm, F/1.4, c-Mount lens, and a Point-321
grey Flea 3 camera with a max resolution of 1328 × 1048,322
and its senor is driven by a voice coil actuator (BEI LA15-323
16-024). This setting is similar to the one used in (Nagahara324
et al., 2008) for extended depth of field. The sensor is teth-325
ered to a laptop via a USB 3.0 cable and synced with the326
motor start/stop signal. The voice coil motor and the mo-327
tor controller are able to translate the sensor at the speed of328
1.47 mm/s. The major advantage of this implementation is329
that all of the parts are off-the-shelf components. This first330
prototype demonstrates that a focal sweep camera can be331
built with minimal effort. A collection of focal sweep pho-332
tographs captured by this prototype are shown on the web-333
site www.focalsweep.com.334
Prototype 2, as shown in Figure 3 (b), is a more com-335
pact design, in which a sensor secured on a structure and the336
lens can be translated during the sensor’s integration time. In337
this prototype we use a compact linear actuator instead of a338
voice coil motor, allowing us to reduce the camera’s overall339
size. The same lens and camera as in Prototype 1 are used.340
During the integration time, the sensor is translated from the341
near focus position to the far focus position. With this pro-342
totype, we are able to translate the sensor at a top speed of343
0.9mm/s. The major advantage of this implementation is344
its compactness and its close resemblance to existing cam-345
era architectures.346
4.2 Camera Settings347
As in conventional photography, users first determine the348
frame rate P and f-number N according to the speed of ob-349
ject motion, the lighting condition, and the desired amount350
of defocus in the captured images for each scene. Then, the351
ideal speed of sensor sweep s can be computed using Equa-352
tion 6. This ideal speed of focal sweep is independent of353
camera focus and distance range of scenes. This indepen-354
dence makes configuration friendly to users.355
Although the sweep speed is independent of camera fo-356
cus and scene depth range, the number of images, k, and the357
overall time to capture the image stack, T , are highly related.358
6
(a) A side view of Prototype 1 (b) CAD rendering of Prototype 2 (c) A side view of Prototype 2
Fig. 3 Two focal sweep camera prototypes. (a) Prototype 1 drives sensor sweep using a voice coil; (b) Prototype 2 drives lens sweep using a linearactuator.
6 8 10 12 14 16 18 20 220
0.2
0.4
0.6
0.8
1
1.2
focal length (mm)
over
al c
aptu
re �
me
(sec
)
(a) f - T plot
N = 1.4c = 2.2 µmP = 120 fps
6 8 10 12 14 16 18 20 22
focal length (mm)0
20
40
60
80
100
tota
l num
ber o
f im
ages
(b) f - k plot
N = 1.4c = 2.2 µmP = 120 fps
0
20
40
60
80
tota
l num
ber o
f im
ages
(c) Z - T plot�
min
nearest distance
(1/m)0 2 4 6 8
o 0.5 0.25 0.16 0.125o (m)
N = 1.4c = 2.2 µmP = 120 fpsf = 9 mm
0
0.2
0.4
0.6
0.8
1
over
al c
aptu
re �
me
(sec
)
(d) Z - k plot�
min
nearest distance
(1/m)0 2 4 6 8
o 0.5 0.25 0.16 0.125o (m)
N = 1.4c = 2.2 µmP = 120 fpsf = 9 mm
Fig. 4 For a given pixel size, frame rate, and f-number, the overallcapture time and total image count are highly related to focal lengthand scene distance range. (a) shows the plot of the overall capture timeT with respect to focal length f . (b) shows the f − k plot. (c) and (d)show the plots of T and k with respect to the depth range (in diopter),respectively, for a camera with a 9mm lens. In each plot, the red spotindicates the most typical setting in our implementation.
Figure 4 (a) shows how the overall capture time T varies359
with camera focal length f in a camera where N = 1.4,360
c = 3.3µm, and P = 120 fps. T increases proportionally to361
the square of f . Figure 4 (b) plots the total number of cap-362
tured images, k, with respect to focal length f for the same363
camera. Again, k is linearly proportional to f2.364
In the most common scenarios, the desired scene ranges365
from a certain distance,Zmin, to infinity. Figure 4 (c) and (d)366
plot the overall capture time T and total image count k with367
respect to Zmin, which is the inverse of distance (or dioper).368
We can see that both are linear. The closer the foreground369
is to the camera the longer the capture time. The x-axis is370
labeled by both dioper and distance for easy reference.371
It can be noted from the figure that the required capture372
time and image counts have a huge range at different set-373
tings. The red dot in each plot indicates a typical setting in374
2. Image Stabiliza�on
camera parameters
image stack, {F0i}
image stack, {Fi} in-focus image, I0
image stack pyramid, {F(i)i}
in-focus image pyramid, Ij
multi-scale index maps, D(i)
an initial index map, D0
8. Index interpola�on
in-focus index map, D
segment map, L
1. Capture a space-�me focal stack
4. Build a pyramid of image stack and in-focus image
5. Compute space-�me index maps at mul�-scales
6. Merge mul�-scle maps into a reliable sparse map.
9. User interac�ve refocus
3. Compute space-�me in-focus image
7. Image over-segmenta�on
Fig. 6 An overall diagram illustrates the process from capturing aspace-time focal stack, to generating an in-focus index map, and tointeractive image refocusing.
our implementation. We use a 9mm lens and our scenes’375
depths range from 0.4m to infinity, so it takes us 0.2sec to376
capture 20 images. Our prototypes are also able to capture377
scenes with smaller Zmin, but it will take a long time to cap-378
ture the sweep as shown in Figure 4 (c). Figure 5 illustrates379
an sample of space-time focal stack that was captured using380
our first prototype camera.381
5 Algorithm382
Figure 6 shows an overview of the newly proposed algo-383
rithm. After a stack of images, {F 0i }, are captured, we first384
apply a typical multi-scale optical flow algorithm to estimate385
frame-to-frame global transformations due to hand shake,386
then stabilize the image stacks. The stabilized image stack387
7
(a) A Space-time focal stack (b) 2D slice of the stack
T
X
(c) The first frame (d) The last frame
Fig. 5 A sample space-time focal stack captured using our focal sweep camera prototype 1. (a) A space-time focal stack of 25 images; (b) A 2Dslice of the 3D stack; (c) The first frame of the stack where the foreground is in focus; (d) The last frame of the stack where the background isin focus. The capturing frame rate is 120fps. It took the focal sweep camera about 0.2sec to capture the whole sequence, so that we can capturemotions of the persons.
{Fi} is then used to compute a space-time in-focus image388
(Section 5.1) and index maps at various scales (Section 5.2).389
In Section 5.3, we describe a new approach to merging multi-390
scale index maps into one high-quality index map.391
There are two key ideas in the algorithm. First, we use392
a pyramid strategy to handle non-textured regions. For each393
pixel, we estimate its index (the frame it is best focused) at394
multiple scales. Due to the space-scale effect (Perona and395
Malik, 1990), the index may not be consistent at different396
scales, especially in regions with weak textures, depth dis-397
continuities, or object motions. This inconsistency is one of398
the fundamental difficulties in the algorithm’s design, and399
we show a simple yet effective solution.400
Second, at any given scale, we propose a novel approach401
to computing the index map. In literature, it is common to402
first estimate an index map (or depth map) using focus mea-403
sure, and then use the index map to produce an all-in-focus404
image (Agarwala et al., 2004; Hasinoff and Kutulakos, 2009).405
In this paper, however, we take a different strategy. We first406
compute an all-in-focus image without knowing the index407
map, and then use the all-in-focus image to estimate the in-408
dex map. We will show the advantage of using this new strat-409
egy.410
5.1 Space-time In-Focus Image411
Given a focal stack, we first compute a space-time in-focus412
image without the index map. The idea is inspired by the413
EDOF technique using focal sweep. Kuthirummal et al. (2011)414
show that the mean of a focal stack preserves image de-415
tails, and deconvolve the averaged image with a (1/x)-shape416
integral point-spread-function (IPSF) to recover an all-in-417
focus EDOF image without knowing the depth map. This418
approach is further shown to be robust in regions of depth419
edges, occlusions, and even object motion. In Figure 7, we420
show the mean image of a space-time focal stack (a) and the421
EDOF image after deconvolution (b).422
Although both (a) and (b) preserve most high frequencyinformation, the average image yields a low contrast (espe-cially when the number of images increases), and the decon-
volved EDOF image (b) is prone to image artifacts. In addi-tion, deconvolution is computationally expensive, especiallyfor mobile devices. In this paper, we compute a space-timein-focus image as a weighted sum of all images:
I(x) =ΣiWi(x) · Fi(x)
ΣiWi(x) + ε, (7)
where the weights Wi(x) are defined as the variance of theLaplacian patch 4(Pi(x, d)), which is centered at x in theith frame.
Wi(x) = V(4Pi(x, d)). (8)
With this strategy, severely blurred patches will carry much423
less weight than sharper patches do, reducing the hazy ef-424
fects that one can see in the average image from Figure 7(a).425
As shown in (c), the weighted sum is sharp and has high con-426
trast even without deconvolution. Figure 7 shows an EDOF427
image computed by a typical strategy (Agarwala et al., 2004;428
Kutulakos and Hasinoff, 2009).429
5.2 Space-time In-focus Index Maps at Various Scales430
For each pixel (x, y), we look for the frame where its sur-rounding patch is most similar in high frequencies to thatin the computed in-focal image I(x, y). Then, the in-focusindex map M(x) is estimated as:
M(x, y) = argminiS(Fi(x, y), I(x, y)), (9)
where S measures the high frequency similarity between Fi
and I at each pixel and is defined as
S(P,Q) = | 4 (P−Q)| ⊗ u(r), (10)
where P and Q denote the patches at P and Q, respectively,431
⊗ is convolution, and u(r) is a pillbox function of radius r.432
We compute an index map at every scale level of the433
focal stack pyramid, and get M (i), i = 1, 2, . . . , k, where434
k is the total level of the pyramid. At each level, the focal435
stack reduces the spatial resolution by 2 × 2 from its upper436
level. We use d = 7, k = 7, r = 5 in our implementation.437
Figure 8 (b) shows the index map pyramid.438
8
(a) Mean image (b) Mean image a�er deblurring (c) Weighted average image Patch in frame 1
Patch in frame 5
Patch in frame 25
Fig. 7 Space-time in-focus images computed using different approaches and their close-ups. (a) The mean of all images in the stack; (b) The meanimage deconvolved using an integral PSF; (c) Weighted average of all images in the stack; (d) Patches at the best focused frames.
(a) A pyramid of space-�me in-focus images
(b) A pyramid of space-�me index maps
(c) Reliable index map
(d) Image segmenta�on
(e) Final index map
(f) Index map (comparison)
Fig. 8 (a) A pyramid of space-time in-focus images; (b) A pyramid of space-time index maps; (c) A reliable index map that computed from (b)using index consistence; (d) An over-segmentation of the full-resolution in-focus image; (e) Our final depth map computed from (c) and (d) byhole-filling; (f) An index map computed using a traditional algorithm which uses difference-of-Gaussians as focus measure (Agarwala et al., 2004;Hasinoff and Kutulakos, 2009) and Graph-cut for global optimization(Boykov and Kolmogorov, 2004).
5.3 Merging and Interpolating Index Maps439
Due to the space-scale effects and depth discontinuity, theindex map computed at different scales can be significantlydifferent as we can see in Figure 8 (b). We propose con-structing a reliable but sparse index map by only acceptingindices that are consistent in all levels:
D0(x)=
Mi(x), ifmax[Mi(x)]−min[Mi(x)] < τ
∅, otherwise
(11)
τ is set as a small number to enforce consistence. One sam-440
ple is shown in Figure 8 (c). The pixels with no index as-441
signed are shown in black. The observation is that the index442
map is dense in textured region, and sparse in non-textured443
regions and depth boundaries. We then over-segment the in-444
focus regions and do index interpolation in each segment445
according to two simple rules:446
– If the segment has at least m valid indices, do interpola-447
tion by plane fitting.448
– If the number of valid indices is less than m, do nearest449
neighbor interpolation.450
9
Figure 8 (d) shows a segmentation result using Graph-cut451
(d), and Figure 8 (e) shows the index map after interpolation.452
We can see that the index map is sharp at depth boundary,453
and smooth in non-textured regions.454
With the estimated index map M(x, y), we can do im-455
age refocusing. In the refocusing viewer, for any pixel (x, y)456
that a user taps, we transition the displayed image from the457
present image to the image indexed by M(x, y). The transi-458
tion is made smooth by sequentially displaying the images459
between the present index to M(x, y). We have made our460
refocusing viewer available online at www.focalsweep.com.461
6 Experiments462
In all our experiments shown in this paper, we set m =463
10, τ = 1, d = 7, r = 5. We compare index maps com-464
puted using the proposed technique with that using a tradi-465
tional depth estimation algorithm, which maximize a sim-466
ple focus measure. There are various definitions of focus467
measures (Nayar and Nakagawa, 1990; Nayar et al., 1996;468
Subbarao and Tyan, 1998; Xiong and Shafer, 1993). We469
adopted the one used in the photomontage method (Agar-470
wala et al., 2004), which defines focus measure as a sim-471
ple local contrast according to the Difference-of-Gaussians472
filter, and we further polished the results using Graph-cut473
(Boykov and Kolmogorov, 2004). Graph-cut as a global op-474
timization technique helps to fill up the holes and smooth475
the index map, shown in Figure Figure 8 (f). Our result as476
shown in (e) produces better results in non-textured or spec-477
ular regions and depth discontinuities, which are important478
for image refocusing.479
Figure 9 shows even more space-time focal stacks that480
we captured using Prototype 1, as well as the computed in-481
focus images and index maps. It is impossible to demon-482
strate user interactive image refocusing in paper form, so483
we will only show the computed index maps and in-focus484
images here. In the index maps that the proposed algorithm485
computes, there are no obvious holes or artifacts, even in486
textureless regions. The index map is also sharper at depth487
boundaries. To experience the interactive refocusing of these488
scenes, a refocus viewer is available at www.focalsweep.com.489
7 Discussion490
In this paper, we present a focal sweep imaging system to491
capture space-time focal stacks for image refocusing. The492
proposed camera sweeps focus at such a speed that the summed493
DOF of the captured focal stack efficiently covers the entire494
desired depth range. The major benefit of this design lies495
in the fact that the camera directly captures all the images496
that are required for image refocusing. By avoiding image497
synthesis and rendering (which are common in many other498
competing designs), this design provides users high-quality499
full-resolution images at every focus with minimal compu-500
tation cost.501
Unlike other image refocusing techniques, our technique502
does space-time image refocusing since each refocused im-503
age is captured at a different time point. Users will see ob-504
jects move into (or out of) focus when they tap to refocus.505
The combination of refocusing and motion yields a unique506
user experience.507
Due to object motion, each pixel that a user taps on508
might correspond to different objects at different focus lay-509
ers (or time points). For example, a defocused object in mo-510
tion often appears blended with its background object, and511
there would be cases when it is preferred to estimate object512
motion and do image refocusing along the estimated motion513
trajectory. Solving the ambiguity often requires a deeper un-514
derstanding to user intentions. In this paper, however, we515
choose to stay simple in face of the ambiguity. For each516
pixel being clicked, we check the pixel at all focus layers517
and choose the sharpest one without explicitly considering518
object motion. This simplicity not only improves the robust-519
ness of the refocusing algorithm, but also makes refocusing520
results easier to predict, which could be beneficial for users521
to adapt to the particular refocusing. There are other pos-522
sible refocusing choices as discussed in Section 3.2, which523
deal with the ambiguity in different manners. We decide to524
leave them as future work.525
We implemented two prototype focal sweep cameras,526
one using a voice coil to drive sensor sweep and one using527
a linear actuator to drive lens sweep. A more compact and528
promising implementation could be made by making use of529
the auto-focus mechanism, which commonly exists in many530
commercial lenses. To do so, one will need the capability to531
control and synchronize focus with image capturing.532
Acknowledgments533
This research was funded in part by ONR Grant No. N00014-534
11-1-0285, ONR Grant No. N00014-08-1-0929, and DARPA535
Grant No. W911NF-10-1-0214.536
References537
Agarwala A, Dontcheva M, Agrawala M, Drucker S, Col-538
burn A, Curless B, Salesin D, Cohen M (2004) Interactive539
digital photomontage. In: ACM Transactions on Graphics540
(TOG), ACM, vol 23, pp 294–302541
Boykov Y, Kolmogorov V (2004) An experimental compari-542
son of min-cut/max-flow algorithms for energy minimiza-543
tion in vision. Pattern Analysis and Machine Intelligence,544
IEEE Transactions on 26(9):1124–1137545
10
(a) First frame (focus on foreground)
(b) Last frame(focus on background)
(c) Space-�me in-focus image (d) Space-�me in-focus index map
Fig. 9 More experimental results. Each row corresponds results with a scene. From left to right, (a) and (b) are the first and last frames capturedwith focal sweep, (c) are the computed space-time in-focus images, and (d) are the estimated space-time in-focus index maps. The resulting indexmaps are used for image refocusing, as demonstrated in our website www.focalsweep.com. [(c) is to be updated.]
Castro A, Ojeda-Castaneda J (2004) Asymmetric phase546
masks for extended depth of field. Applied Optics547
43(17):3474–3479548
Chaudhuri S, Rajagopalan A (1999) Depth from defocus: a549
real aperture imaging approach. Springer Verlag550
Cossairt O, Nayar S (2010) Spectral Focal Sweep: Extended551
depth of field from chromatic aberrations. In: Interna-552
tional Conference on Computational Photography, pp 1–8553
Cossairt O, Zhou C, Nayar S (2010) Diffusion coded pho-554
tography for extended depth of field. In: SIGGRAPH,555
ACM, pp 1–10556
Dowski E, Cathey W (1995) Extended depth of field through557
wave-front coding. Applied Optics 34(11):1859–1866558
George N, Chi W (2003) Extended depth of field using a log-559
arithmic asphere. Journal of Optics A: Pure and Applied560
Optics 5:S157561
Georgeiv T, Zheng K, Curless B, Salesin D, Nayar S, Int-562
wala C (2006) Spatio-Angular Resolution Tradeoff in In-563
tegral Photography. In: In Eurographics Symposium on564
Rendering565
Georgiev T, Intwala C (2006) Light field camera design for566
integral view photography. Tech. rep., Adobe567
Gortler S, Grzeszczuk R, Szeliski R, Cohen M (1996) The568
lumigraph. In: Proceedings of the 23rd annual conference569
on Computer graphics and interactive techniques, ACM,570
pp 43–54571
Guichard F, Nguyen H, Tessieres R, Pyanet M, Tarchouna572
I, Cao F (2009) Extended depth-of-field using sharpness573
transport across color channels. Technical Paper, DXO574
Labs575
Hasinoff S, Kutulakos K (2009) Light-efficient photography.576
IEEE Pattern Analysis and Machine Intelligence 1(1):1577
11
Hasinoff S, Kutulakos K, Durand F, Freeman W (2009)578
Time-constrained photography. In: IEEE International579
Conference on Computer Vision, pp 333–340580
Hausler G (1972) A method to increase the depth of focus581
by two step image processing. Optics Communications582
6(1):38–42583
Indebetouw G, Bai H (1984) Imaging with Fresnel zone584
pupil masks: extended depth of field. Applied Optics585
23(23):4299–4302586
Ives H (1930) Parallax panoramagrams made with a large587
diameter lens. Journal of the Optical Society of America588
A 20(6):332–340589
Kuthirummal S, Nagahara H, Zhou C, Nayar S (2011) Flex-590
ible depth of field photography. PAMI 33(1):58–71591
Kutulakos K, Hasinoff S (2009) Focal stack photography:592
High-performance photography with a conventional cam-593
era. Proc 11th IAPR Conference on Machine Vision Ap-594
plications pp 332–337595
Levin A, Fergus R, Durand F, Freeman W (2007) Image and596
depth from a conventional camera with a coded aperture.597
ACM Transactions on Graphics (TOG) 26(3):70–es598
Levoy M, Hanrahan P (1996) Light field rendering. In: Pro-599
ceedings of the 23rd annual conference on Computer600
graphics and interactive techniques, ACM, pp 31–42601
Levoy M, Ng R, Adams A, Footer M, Horowitz M (2006)602
Light field microscopy. In: ACM Transactions on Graph-603
ics (TOG), ACM, vol 25, pp 924–934604
Lippmann G (1908) La photographie integrale. Comptes-605
Rendus, Academie des Sciences (146):446–551606
McNally J, Karpova T, Cooper J, Conchello J (1999) Three-607
dimensional imaging by deconvolution microscopy.608
Methods 19(3):373–385609
Mouroulis P (2008) Depth of field extension with spherical610
optics. Optics Express 16(17):12,995–13,004611
Nagahara H, Kuthirummal S, Zhou C, Nayar S (2008) Flex-612
ible depth of field photography. In: European Conference613
on Computer Vision614
Nayar S, Nakagawa Y (1990) Shape from focus: An ef-615
fective approach for rough surfaces. In: ICRA, IEEE, pp616
218–225617
Nayar S, Watanabe M, Noguchi M (1996) Real-time focus618
range sensor. PAMI 18(12):1186–1198619
Ng R (2006) Digital light field photography620
Ng R, Levoy M, Bredif M, Duval G, Horowitz M, Hanra-621
han P (2005) Light field photography with a hand-held622
plenoptic camera. Stanford Computer Science Technical623
Report 2624
Pentland A (1987) A new sense for depth of field. IEEE Pat-625
tern Analysis and Machine Intelligence (4):523–531626
Perona P, Malik J (1990) Scale-space and edge detection us-627
ing anisotropic diffusion. PAMI 12(7):629–639628
Poon T, Motamedi M (1987) Optical/digital incoherent im-629
age processing for extended depth of field. Applied Optics630
26(21):4612–4615631
Rajagopalan A, Chaudhuri S (1997) Optimal selection of632
camera parameters for recovery of depth from defocused633
images. In: CVPR, IEEE, pp 219–224634
Ren H, Wu S (2007) Variable-focus liquid lens. Opt Express635
15(10):5931–5936636
Ren H, Fox D, Anderson P, Wu B, Wu S (2006) Tunable-637
focus liquid lens controlled using a servo motor. Opt Ex-638
press 14(18):8031–8036639
Sibarita J (2005) Deconvolution microscopy. Microscopy640
Techniques pp 1288–1291641
Subbarao M, Tyan J (1998) Selecting the optimal focus642
measure for autofocusing and depth-from-focus. PAMI643
20(8):864–870644
Tao H, Sawhney H, Kumar R (2001) A global matching645
framework for stereo computation. In: Computer Vision,646
2001. ICCV 2001. Proceedings. Eighth IEEE Interna-647
tional Conference on, IEEE, vol 1, pp 532–539648
Wilburn B, Joshi N, Vaish V, Talvala E, Antunez E, Barth649
A, Adams A, Horowitz M, Levoy M (2005) High perfor-650
mance imaging using large camera arrays. ACM Transac-651
tions on Graphics (TOG) 24(3):765–776652
Xiong Y, Shafer S (1993) Depth from focusing and defocus-653
ing. In: CVPR, IEEE, pp 68–73654
Yedidia J, Freeman W, Weiss Y (2001) Generalized belief655
propagation. Advances in neural information processing656
systems pp 689–695657
Zhou C, Lin S, Nayar S (2009) Coded aperture pairs for658
depth from defocus. In: IEEE International Conference659
on Computer Vision660