10
Optics & Laser Technology 33 (2001) 167–176 www.elsevier.com/locate/optlastec A video-rate range sensor based on depth from defocus Ovidiu Ghita , Paul F. Whelan Vision Systems Laboratory, School of Electronic Engineering, Dublin City University, Dublin 9, Ireland Received 7 August 2000; accepted 30 January 2001 Abstract Recovering the depth information derived from dynamic scenes implies real-time range estimation. This paper addresses the implemen- tation of a bifocal range sensor which estimates the depth by measuring the relative blurring between two images captured with dierent focal settings. To recover the depth accurately even in cases when the scene is textureless, one possible solution is to project a structured light on the scene. As a consequence, in the scene’s spectrum a spatial frequency derived from illumination pattern is evident. The resulting algorithm involves only simple local operations, this assures the possibility of computing the depth at a rate of 10 frames per second. The experimental results indicate that the accuracy of this proposed sensor compares well with that oered by other methods such as stereo and motion parallax, while avoiding the problems caused by occlusion and missing parts. c 2001 Elsevier Science Ltd. All rights reserved. Keywords: Range sensing; Image blurring model; Active illumination; Focus operator 1. Introduction Depth information plays a key role in machine vision and has a strong relationship with the real world in robotic appli- cations by allowing three-dimensional (3-D) scene interpre- tation. The 3-D information can be obtained using various techniques [1]. Among other approaches for 3-D estimation, depth from defocus (DFD) methods have recently attracted a great deal of interest. Initially developed by Krotkov [2] and Pentland [3], the DFD methods use the direct relation- ship between the depth, camera parameters and the degree of blurring in several images (usually two as in the case of the present implementation). In contrast with other tech- niques such as stereo or motion parallax where solving the correspondence between dierent local features represents a dicult problem, DFD relies only on simple local operators. Also, the DFD techniques are not hampered by problems generated by occlusions or missing parts since the images to be analysed are identical except the depth of eld. Historically, the DFD techniques have evolved as a passive sensing strategy [4 – 6]. In this regard, Xiong and Shafer [7] propose a novel approach to determine dense and accurate depth structure using the maximal resemblance Corresponding author. Tel.: +353-1-704-5869; fax: +353-1-704- 5508. E-mail address: [email protected] (O. Ghita). estimation. Subbarao and Surya [8] reformulate the prob- lem as one of regularised deconvolution where the depth is obtained by analysing the local information contained in a sequence of images acquired with dierent camera parame- ters. Later, Watanabe and Nayar [9] argued that the use of focus operators such as the Laplacian of Gaussian results in poor depth estimation. Thus, they developed a set of broad- band rational operators to produce accurate and dense depth maps. However, if the scene under investigation has a weak texture or is textureless, the depth estimation achieved when passive DFD is employed is far from accurate. Fortunately, there is a solution to this problem and is oered by active DFD, when a structured light is projected on the scene. Con- sequently, a strong texture derived from illumination pattern is forced on imaged surfaces and as an immediate result the spectrum will contain a dominant frequency. The use of ac- tive illumination was initially suggested by Pentland et al. [10] where the apparent blurring of a pattern generated by a slide projector is measured to obtain the depth informa- tion. Then, Nayar et al. [11] developed a symmetrical pat- tern organised as a rectangular grid which was optimised for a specic camera. In this paper we describe the implementation of a bifocal sensor able to generate 256 × 256 depth maps at a rate of 10 frames per second. In the past, the performance of the DFD range sensors was limited by the variation in image magni- cation between images acquired under dierent focal levels. 0030-3992/01/$ - see front matter c 2001 Elsevier Science Ltd. All rights reserved. PII:S0030-3992(01)00010-X

A video-rate range sensor based on depth from defocus

Embed Size (px)

Citation preview

Page 1: A video-rate range sensor based on depth from defocus

Optics & Laser Technology 33 (2001) 167–176www.elsevier.com/locate/optlastec

A video-rate range sensor based on depth from defocusOvidiu Ghita ∗, Paul F. Whelan

Vision Systems Laboratory, School of Electronic Engineering, Dublin City University, Dublin 9, Ireland

Received 7 August 2000; accepted 30 January 2001

Abstract

Recovering the depth information derived from dynamic scenes implies real-time range estimation. This paper addresses the implemen-tation of a bifocal range sensor which estimates the depth by measuring the relative blurring between two images captured with di/erentfocal settings. To recover the depth accurately even in cases when the scene is textureless, one possible solution is to project a structuredlight on the scene. As a consequence, in the scene’s spectrum a spatial frequency derived from illumination pattern is evident. The resultingalgorithm involves only simple local operations, this assures the possibility of computing the depth at a rate of 10 frames per second. Theexperimental results indicate that the accuracy of this proposed sensor compares well with that o/ered by other methods such as stereo andmotion parallax, while avoiding the problems caused by occlusion and missing parts. c© 2001 Elsevier Science Ltd. All rights reserved.

Keywords: Range sensing; Image blurring model; Active illumination; Focus operator

1. Introduction

Depth information plays a key role in machine vision andhas a strong relationship with the real world in robotic appli-cations by allowing three-dimensional (3-D) scene interpre-tation. The 3-D information can be obtained using varioustechniques [1]. Among other approaches for 3-D estimation,depth from defocus (DFD) methods have recently attracteda great deal of interest. Initially developed by Krotkov [2]and Pentland [3], the DFD methods use the direct relation-ship between the depth, camera parameters and the degreeof blurring in several images (usually two as in the caseof the present implementation). In contrast with other tech-niques such as stereo or motion parallax where solving thecorrespondence between di/erent local features represents adi<cult problem, DFD relies only on simple local operators.Also, the DFD techniques are not hampered by problemsgenerated by occlusions or missing parts since the imagesto be analysed are identical except the depth of =eld.

Historically, the DFD techniques have evolved as apassive sensing strategy [4–6]. In this regard, Xiong andShafer [7] propose a novel approach to determine dense andaccurate depth structure using the maximal resemblance

∗ Corresponding author. Tel.: +353-1-704-5869; fax: +353-1-704-5508.

E-mail address: [email protected] (O. Ghita).

estimation. Subbarao and Surya [8] reformulate the prob-lem as one of regularised deconvolution where the depth isobtained by analysing the local information contained in asequence of images acquired with di/erent camera parame-ters. Later, Watanabe and Nayar [9] argued that the use offocus operators such as the Laplacian of Gaussian results inpoor depth estimation. Thus, they developed a set of broad-band rational operators to produce accurate and dense depthmaps. However, if the scene under investigation has a weaktexture or is textureless, the depth estimation achieved whenpassive DFD is employed is far from accurate. Fortunately,there is a solution to this problem and is o/ered by activeDFD, when a structured light is projected on the scene. Con-sequently, a strong texture derived from illumination patternis forced on imaged surfaces and as an immediate result thespectrum will contain a dominant frequency. The use of ac-tive illumination was initially suggested by Pentland et al.[10] where the apparent blurring of a pattern generated bya slide projector is measured to obtain the depth informa-tion. Then, Nayar et al. [11] developed a symmetrical pat-tern organised as a rectangular grid which was optimised fora speci=c camera.

In this paper we describe the implementation of a bifocalsensor able to generate 256×256 depth maps at a rate of 10frames per second. In the past, the performance of the DFDrange sensors was limited by the variation in image magni=-cation between images acquired under di/erent focal levels.

0030-3992/01/$ - see front matter c© 2001 Elsevier Science Ltd. All rights reserved.PII: S 0030 -3992(01)00010 -X

Page 2: A video-rate range sensor based on depth from defocus

168 O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176

Fig. 1. The image formation process. The depth can be determined bymeasuring the level of blurring.

This forced researchers to either implement computationallyintensive techniques such as image registration and warping[12] or to address this problem on an optical basis by usinga telecentric lens [13]. Another di<cult problem consistsin choosing the illumination pattern. Nevertheless, an opti-mal pattern is di<cult to manufacture and in this paper weproposed a simple solution to address the above mentionedproblems by employing image interpolation.

2. Depth from defocus

If the object to be imaged is placed in the focal plane, theimage formed on the sensing element is sharp since everypoint P from the object plane is refracted by the lens intoa point p on the sensor plane. Alternatively, if the object isshifted from the focal plane, the points situated in the objectplane are distributed over a patch on the sensing element. Asa consequence, the image formed on the sensing element isblurred. From this observation, the distance from sensor toeach object point can be estimated by evaluating the degreeof blurring which is determined by the size of the patchformed on the sensing element.

This can be observed in Fig. 1 where the image formationprocess is illustrated. Thus the diameter of the patch (blurcircle) d is of interest and can be easily determined by theuse of similar triangles:

D=2v

=d=2s− v ⇒ d= Ds

(1v− 1s

); (1)

where u is the object distance, v is the focal distance, D isthe aperture of the lens and s is the sensor distance. Sincethe parameter v can be expressed as a function of f and u,Eq. (1) becomes

1u

+1v

=1f

⇒ d= Ds(

1f

− 1u− 1s

): (2)

Note that d can be positive or negative depending onwhether the image plane is behind or in front of the focusedimage. Consequently, the level of blurring and the resultingimages are identical. To overcome this problem, we have to

Fig. 2. Estimating the depth from two images.

either constrain the sensor distance s to be always greaterthan the image distance v [3] or employ two images cap-tured with di/erent focal settings [5,6,9]. It is worth men-tioning that for the former case the depth can be determinedaccurately and uniquely only for places in the image withknown characteristics (e.g. sharp edges). The latter is nothampered by this restriction and for this present approach apair of images, i.e. the near and far focused images, separatedby a known distance b are employed to determine the blurcircle.

3. The blur model

The blurring e/ect can be seen as a convolution betweenthe focused image and the blurring function. The blurringfunction is also referred to as point spread function (PSF)and can be approximated by a 2-D Gaussian [3,8].

h(x; y) =1

2��2 e−x2+y2=2�2

; (3)

where � is the standard deviation of the Gaussian. The stan-dard deviation � (also referred to as spread parameter) is ofinterest because it indicates the local level of blurring con-tained in a defocused image. If the brightness is constantover a region of the image projected onto sensing element,the transformation from focused image into defocused im-age is a linear shift invariant operation. This represents theactual situation and as a consequence we can assume thatthe blur parameter � is proportional to d.

4. Estimating the depth of the scene

Since the PSF is a low pass =lter, the e/ect is a suppres-sion of high frequencies. Therefore, to isolate the e/ect of

Page 3: A video-rate range sensor based on depth from defocus

O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176 169

Fig. 3. Normal illumination versus active illumination. (a) Image captured using normal illumination. (b) Image captured when active illumination isemployed. (c) The Fourier spectrum of image a. (d) The Fourier spectrum of image b.

blurring it is necessary to extract the high frequency infor-mation derived from the scene. For this purpose, the nearand far focused images are =ltered with a 5 × 5 Laplacianoperator, where the output of this operator gives an indica-tion of the focus level. Fig. 2 illustrates the relation betweenthe outputs of the focus operator and the depth estimation.

5. Active illumination

The high frequencies derived from scene determine theaccuracy of the depth estimation. If the scene has a weaktexture or is textureless (like a blank sheet of plain paper) thedepth recovery is very far from accurate. Consequently, theapplicability of passive DFD is restricted to scenes with hightextural information. To address this limitation, Pentlandet al. [10] proposed to project a known pattern of light on thescene. As a result, an arti=cial texture is forced on the visible

surfaces and the depth can be obtained by measuring theapparent blurring of the projected pattern. The illuminationpattern was generated by a slide projector and selected in anarbitrary manner. In Fig. 3d the textural frequency derivedfrom an illumination pattern organised as a striped grid canbe observed.

Later, Noguchi and Nayar [14] developed a symmetricalpattern optimised for a speci=c camera. They use the as-sumption that the sensing element is organised as a rectan-gular grid. The optimisation procedure presented in the samepaper, consists of a detailed Fourier analysis and the result-ing model of illumination is a rectangular cell with uniformintensity which is repeated on a 2-D grid to obtain a periodicpattern. The resulting pattern is very dense and di<cult tofabricate and in our testing we found by use of image inter-polation that the problems caused by a sub-optimal patterncan be signi=cantly alleviated. This issue will be presentedin the next section.

Page 4: A video-rate range sensor based on depth from defocus

170 O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176

Fig. 4. (a, b) The near and far focused images after the application of the focus operator. (c) The resulting depth map.

6. Image interpolation

Since active illumination is employed, the depth esti-mation will have the same pattern. Due to magni=cationchanges between the near and far focused images the stripesdo not match perfectly together. As a consequence, the depthestimation is unreliable especially around the stripes bor-ders. This can be observed in Fig. 4 where the depth esti-mation is not continuous. Also, note the errors caused bychanges in magni=cation.

To compensate for this problem, Watanabe and Nayar[13] proposed to use a telecentric lens. This solution iselegant and e/ective but since the telecentric lens requiresa small external aperture, the illumination source necessary

to image the scene has to be very powerful. To avoid thiscomplication we propose to map the dark regions by us-ing image interpolation. Linear interpolation was found tobe su<cient in the case where a dense (10 lines per mm)illumination pattern was used. The e/ect of image interpo-lation is depicted in Fig. 5 where the quality of the depthestimation is signi=cantly improved.

7. Physical implementation

The aim of this implementation is to build a range sen-sor able to extract the depth information derived from dy-namic scenes. Thus, a key issue is to capture the near and far

Page 5: A video-rate range sensor based on depth from defocus

O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176 171

Fig. 5. (a, b) The e/ect of interpolation when this operation is applied to the images illustrated in Fig. 4(a, b). (c) The resulting depth map.

Fig. 6. The diagram of the bifocal range sensor.

Page 6: A video-rate range sensor based on depth from defocus

172 O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176

Fig. 7. A picture of the developed range sensor.

focused images at the same time. For this purpose, two OFGVISIONplus frame grabbers were utilised. The scene is im-aged using an AF MICRO NIKKOR 60 mm F 2.8 lens. Be-tween the NIKKOR lens and the sensing elements a 22 mmbeam splitter cube is placed. The sensing elements used forthis implementation are two low cost 256×256 VVL 1011CCMOS sensors. Nevertheless, the beam splitter introduceda supplementary distance between the CMOS sensors andthe lens that image the scene. As a consequence, the imagesprojected on the sensors’ active surface will be signi=cantlyout of focus. To overcome this problem, the camera headwas opened and the =rst sensor was set in contact with thebeam splitter inside the case. The second sensor was posi-tioned with a small gap (approximately 0:8 mm) from thebeam splitter surface using a multi-axis translator. The dis-tance between the lens and the beam splitter is about 1 mm.The diagram of the developed sensor is illustrated in Fig. 6,while Fig. 7 depicts the actual set-up.

The structured light is projected on the scene using anMP-1000 Projector with an MGP-10 Moire gratings (stripeswith density of 10 lines per mm). The lens attached to theprojector is the same type as that used to image the scene.Note that all equipment required by this implementationhave low cost.

The software is simple as long as it includes only localoperators. The Oowchart illustrated in Fig. 8 describes themain operations required to compute the depth map of a256 × 256 resolution.

As we mentioned earlier, the focus operator is modelledby a 5× 5 Laplacian operator. Since the Laplacian operatorenhances the high frequency noise, to compensate for thisissue a smoothing 5 × 5 Gaussian operator is applied. Theinformation obtained after image interpolation is used todetermine the depth map. As illustrated in Fig. 2 the defocusfunction can be implemented by the ratio between ∇2g1 and

Fig. 8. Data Oow during the computation process.

Fig. 9. Ideal depth versus estimated depth.

∇2g2. For the sake of computation e<ciency, the defocusfunction is implemented using a look-up table. Finally, thedepth map is smoothed with a 5× 5 Gaussian operator. Thedepth map is computed in 95 ms on a Pentium 133, 32 MbRAM and running Windows 98.

8. Camera calibration

Like for any other range sensor, the calibration proce-dure represents an important operation. This sensor requiresa two-step calibration procedure. The =rst step involves ob-taining a precise alignment between the near and far focusedsensing elements. To achieve this goal, the calibration is per-formed step by step using the multi-axis translator which isattached to one of the CMOS sensors. This procedure con-tinues until the misregistrations between the near and farfocused images are smaller than the errors caused by changesin magni=cation due to di/erent focal settings. As suggestedin [11], the second step involves a pixel by pixel calibra-tion in order to compensate for the errors caused by the

Page 7: A video-rate range sensor based on depth from defocus

O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176 173

Fig. 10. The near and far focused images for a scene which contains two planar objects situated at di/erent distances from the sensor.

Fig. 11. The depth estimated for the scene illustrated in Fig. 10.

imperfection of the optical and sensing equipment. Thisoperation consists of the following procedure: a planar tar-get is placed perpendicular to the optical axis of the sensorat a precise known distance. Then, the depth map is com-puted and the di/erences between the resulting depth val-ues and the real distances are recorded in a table. Neverthe-less, this procedure holds only if the errors introduced bythe optical equipment are constant. Our experiments indi-cated that most of the errors are caused by image curvature,errors that are constant and easy to correct. Consequently,

this pixel by pixel gain calibration is appropriate for thisimplementation.

9. Experiments and results

To evaluate the performance of the developed sensor, itwas tested on several indoor scenes. Initially, the sensor wastested on simple targets like planar surfaces, then on sceneswith a complex scenario. The accuracy and linearity are

Page 8: A video-rate range sensor based on depth from defocus

174 O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176

Fig. 12. The near and far focused images for a scene which contains a slanted planar object.

Fig. 13. The depth estimated for the scene illustrated in Fig. 12.

Fig. 14. The near and far focused images for a scene which contains various textureless objects.

estimated when the sensor is placed at a distance of 86 cmfrom the baseline of the workspace.

Fig. 11 illustrates the depth recovery for two texturedplanar objects (see Fig. 10) situated at di/erent distances in

front of the sensor. Fig. 13 show the depth map for a slantedplanar object illustrated in Figs. 12.

Fig. 15 shows the depth recovery for a more complexscene de=ned by textureless objects which is depicted in

Page 9: A video-rate range sensor based on depth from defocus

O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176 175

Fig. 15. The depth estimated for the scene illustrated in Fig. 14.

Fig. 16. The near and far focused images for a scene which contains various LEGO objects.

Fig. 17. The depth estimated for the scene illustrated in Fig. 16.

Page 10: A video-rate range sensor based on depth from defocus

176 O. Ghita, P. F. Whelan / Optics & Laser Technology 33 (2001) 167–176

Fig. 14. Fig. 17 illustrates the depth map for a scene whichcontains mildly specular LEGO objects with di/erent shapesand a large scale of colours (refer to Fig. 16)

For scenes containing non-specular objects the lowest ac-curacy (see Fig. 9) is 3.4% normalised in agreement with thedistance from the sensor. This accuracy is reported for bothtextured and textureless objects. When the scene containsobjects with specular properties the accuracy is a/ected inrelation to the degree of specularity. These results indicatethat the developed sensor may provide enough informationfor a large variety of visual applications including objectrecognition and advanced inspection.

10. Conclusions

In this paper we described the implementation of avideo-rate bifocal range sensor. Since the depth is estimatedby measuring the relative blurring between two imagescaptured with di/erent focal settings, this approach in con-trast with methods such as stereo and motion parallax isnot hampered by problems such as detecting the correspon-dence between features contained in a sequence of imagesor missing parts. Also, it is worth mentioning that DFDtechniques o/er the possibility of obtaining real-time depthestimation at a low cost.

Active DFD is preferred in many applications because itcan reliably estimate the depth even in cases when the sceneis textureless. The changes in magni=cation between im-ages captured with di/erent focal settings limited the perfor-mance of this technique and forced researchers to resort tocomputationally expensive techniques such as image regis-tration and warping. Nevertheless, such an implementationis not appropriate for real-time depth estimation, hence wepropose to tackle this problem by employing image inter-polation. Another advantage is the fact that this approachrelaxes the demand of having an optimal illumination

pattern. The consistency between theory and experimen-tal results has indicated that our implementation o/ers anattractive solution to estimate depth quickly and accurately.

References

[1] Jarvis R. A perspective on range-=nding techniques forcomputer vision. IEEE Trans Pattern Anal Machine Intelligence1983;5(2):122–39.

[2] Krotkov E. Focusing, Internat J Comput Vision 1987;1:223–37.[3] Pentland A. A new sense for depth of =eld. IEEE Trans Pattern

Anal Machine Intelligence 1987;9(4):523–31.[4] Asada N, Fujiwara H, Matsuyama T. Seeing behind the scene:

analysis photometric properties of occluding edges by the reversedprojection blurring model. IEEE Trans Pattern Anal MachineIntelligence 1998;20(2):157–66.

[5] Pentland A, Darrell T, Turk M, Huang W. A simple, real-time rangecamera, Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 1989. p. 256–61.

[6] Subbarao M. Parallel depth recovery by changing camera parameters,Proceedings of the IEEE Conference on Computer Vision, 1988. p.149–55.

[7] Xiong Y, Shafer SA. Depth from focusing and defocusing,Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 1993. p. 68–73.

[8] Subbarao M, Surya G. Depth from defocus: a spatial domainapproach. Internat J Comput Vision 1994;13(3):271–94.

[9] Watanabe M, Nayar SK. Rational =lters for passive depth fromdefocus, Technical Report CUCS-035-95, Department of ComputerScience, Columbia University, New York, USA, September 1995.

[10] Pentland A, Scherock S, Darrell T, Girod B. Simple range camerasbased on focal error. J Opt Soc Amer 1994;11(11):2925–35.

[11] Nayar SK, Watanabe M, Noguchi M. Real-time focus range sensor,Proceedings of the International Conference on Computer Vision,1995. p. 995–1001.

[12] Darrell T, Wohn K. Depth from defocus using a pyramid architecture.Pattern Recognition Lett 1990;11(12):787–96.

[13] Watanabe M, Nayar SK. Telecentric optics for computational vision,Proceedings of the Image Understanding Workshop (IUW 96), PalmSprings, February 1996.

[14] Noguchi M, Nayar SK. Real-time focus range sensor, Proceedingsof the International Conference on Pattern Recognition (ICPR 94),October 1994.