Generalized Multiple Baseline Stereo and Direct Virtual ...cvrr.ucsd.edu/publications/pdfs/Published IJCV - Range Stereo Final revised.pdfsearch is at the virtual viewpoint. For each

International Journal of Computer Vision, Special Issue on Multicamera Stereo, Volume 47, Numbers 1/2/3. April-June 2002 pp. 131-147.

1

Generalized Multiple Baseline Stereo and Direct Virtual View Synthesis Using Range-Space Search, Match, and Render

Kim C. Ng1, Mohan Trivedi2, and Hiroshi Ishiguro3

1AST – La Jolla Lab, STMicroelectronics, 4690 Executive Dr, San Diego CA 92121, USA

2Computer Vision & Robotics Research Lab, University of California–San Diego, CA 92093, USA 3Department of Computer and Communication Sciences, Wakayama University, Japan

Abstract A new “range-space” approach is described for synergistic resolution of both stereovision and reflectance (visual) modeling problems simultaneously. This synergistic approach can be applied to arbitrary camera arrangements with different intrinsic and extrinsic parameters, image types, image resolutions, and image number. These images are analyzed in a step-wise manner to extract 3-D range measurements and also to render a customized perspective view. The entire process is fully automatic. An extensive and detailed experimental validation phase supports the basic feasibility and generality of the Range-Space Approach.

Keywords: Range-space approach; generalized multiple baseline stereo; direct virtual view synthesis; wide-baseline stereo; volumetric matching template; matching curve characteristics; error region; volume growing; virtual walkthrough; image-based rendering, omni-directional video.

1. Introduction The Range-Space Approach requires three distinct

steps of analyzing multiple input images: 1) Search, 2) Match, and 3) Render. This approach is different from those approaches of [1] where the entire 3-D model of the scene is recovered. Such an approach is very inefficient because the depth of each pixel in the source images must be recovered regardless of whether or not that pixel is visible from the synthesized virtual camera view. In the Range-Space Approach, only the depth for those pixels that are visible in the virtual camera views are recovered. Thus, a virtual view is directly synthesized without the intermediate step of 3-D model recovery.

From the standpoint in the range space (3-D space), stereovision and direct virtual view synthesis from multiple images are, in fact, the same problems. The main challenge lies in recovering the first “true” voxel—a true voxel is any one along the viewing ray that intersects a true physical surface—out of the many possible true and false voxels along a given virtual ray. When the first true

voxel can be determined, both depth and color reflectance are recovered simultaneously for a virtual pixel. In the special case when a virtual view coincides with a real camera view, the color reflectance is known and only the depth needs to be recovered. The Range-Space Approach provides a robust and reliable solution to determine the first true voxel and thus recover both depth and color reflectance simultaneously. From the perspective of stereovision, this synergistic approach is a generalized multiple baseline stereo that can be applied to arbitrary camera arrangements with different intrinsic and extrinsic parameters, image types, image resolutions, and image number. From the perspective of visual modeling, this synergistic approach is a direct virtual view synthesis that does not require the intermediate step of 3-D model recovery.

In this work, solving the generalized multiple baseline stereo is equivalently solving the direct virtual view synthesis problem. The challenges encountered in stereo vision have a straightforward and effective solution in the Range-Space Approach. This research addresses five major challenges of wide-baseline stereo: Window Cutoff, Scaling Effects, Foreshortening Effects, Specular Highlights, and Occlusions. When searching in the range space, it enables arbitrary wide-baseline multi-camera configurations and it helps overcome the first three challenges. Cameras are selected through robust statistics to lessen the effects of Specular Highlights and Occlusions in the matching process.

The Range-Space Approach was introduced in [2]. Its complete technical details were disclosed in [3][4]. View synthesis using the Range-Space Approach can as well be integrated with object tracking for surveillance and monitoring applications [5]. Recently, [6] presented similar ideas to render a novel view with “unstructured” input images. The Range-Space Approach has achieved the first seven of the eight image-based rendering [7][8] goals mentioned in [6] synergistically and automatically with reliability and robustness. [9] described a simpler version of the Range-Space Approach to warp pixels of an omni-directional image to a recovered depth plane, which fits exactly in the view of an intended virtual image, to give the sensation of smooth virtual walkthrough. To


2

generalize the light-field representation for efficient rendering, [10] warped the corresponded pixels of omni-directional images to the depth planes that connect every two omni-directional images. Dense camera placement is needed to cover the continuous paths that intersect and form the image loops. Nonetheless, the Range-Space Approach can synthesize virtual views anywhere within the sparse numbers of widely separated image views.

In the following three sections, we present the details of Range-Space Search, Match, and Render stages. In Section 5, we show the results of virtual view synthesis and smooth virtual walkthroughs. Although the implementation of this approach is demonstrated with omni-directional images (ODI), the discussions are general and applicable to any multi-camera system.

2. Range-Space Search

YV Virtual Focal Length

Virtual Viewpoint Virtual Pixel

Next Voxel

Ray Length

ZW

XW

YW World Coordinate

ODVS 1 Initial Ray Length

ODVS 4

ODVS 2

ODVS 3

3D Back-Projection

Pixel Frustum

ZV

Figure 1 Illustration of the Range-Space Approach.

Figure 1 shows a view of four omni-directional vision sensors (ODVS) in a multiple baseline stereo configuration. The starting point for the range-space search is at the virtual viewpoint. For each pixel on that virtual image plane, we project a pixel frustum whose left, right, top, and bottom boundaries are aligned with the edges of the pixel and whose frustum tip is at the virtual viewpoint. The frustum extends outwards, away from the virtual camera and into the real-world environment. One such frustum is illustrated in the figure.

2.1. Search Length Determination Searching in the range space has different challenges

than searching directly in the disparity/image space. In the image space, the search is bounded by the source image resolution. An epipolar line is discretized directly by the number of pixels along the line. In the range space, there are no physical grids which limit the traveling along the pixel frustum. To discretize a frustum, we have developed

a way to use the input images’ resolution to guide the searching process in the range space. Pixel sizes of the input images define the logical searching interval in the range space. The searching is done in such a way that each interval occupies only a single voxel. The shape and volume of each voxel vary. The voxel, when back-projected to the input images, covers the most of one pixel area in every respective image. In addition, the overlapping volume of the searching interval is minimized to reduce repeated computation. Consequently, when moving from a voxel to the next in the range space, its corresponding movement in the image space is also at most one pixel in every respective image but is exactly one pixel in at least one of the images. That is something the classic plane-sweep and the cubical voxel space [11] techniques cannot guarantee.

Searching in the range-space has the following important benefits: 1) It gives us the maximum freedom to control the span of

range to traverse. 2) It allows us to choose the depth resolution. 3) The image-rectifying process is embedded in the

range-space search process. 4) The motion of each camera’s pixel on the epipolar

lines is directly controlled by the motion of the voxel in the range space with respect to the arrangement and resolution of each individual camera.

Next Voxel, Vk(l+1)

Virtual Image Plane

Virtual Viewpoint, VO

ODVS 1Virtual Rays

ODVS 2

ODVS 3

ODVS 4

Current Voxel, Vkl

Projected Pixel Location

Search Interval, 1Lkl

ρ

γ

ψ

OM

Figure 2 Range-Space Search. The goal is to preserve the maximum depth resolution with respect to the camera arrangements and the image resolutions. In this example, ODVS 2 that contains the least visual information decides where the next voxel is.

Figure 2 explains the searching algorithm in the range space. Points OV , klV , and MiO are known. ( )1+lkV is to

be determined. klV is the thl voxel on the thk virtual ray;

MiO is the thi mirror’s focal point. Then the angle γ can be calculated with the Law of Cosines. Based on the projected pixel location of klV and the assumption of a ray cone, the pixel’s angular resolution of an ODI can be


3

determined as r1tan 1−≈ρ , where r is the radius from

the ODI center to its projected pixel location. This angular resolution is a function of where the projected pixel lies in an ODI (the angular resolution is a fixed value in a rectilinear image). Also, this angular resolution defines the ray cone that intersects the two voxels, klV and )1( +lkV . The borders of the two voxels are on the perimeter edge of the ray cone. The interval that connects the two voxels passes through the center of the ray cone. Finally, the searching interval klL1 (or the 1-step search length from

the thl voxel to the th)1( +l voxel on the thk ray) of a camera is found using the Law of Sines. These searching intervals of all cameras are compared to each other and the smallest one decides the next voxel location. Given a voxel, klV , its corresponding pixel can be found in the ODI, { }NIII ,,1 Λ= , as { }kl

Nkl PP ,,1 Λ=klP , where

N1 II ∈∈ klN

kl PP ,,1 Λ .1 The newly determined next voxel

becomes the current voxel in the succeeding searching cycle. The searching cycle repeats until none of the pixels moves any further as the search length approaches infinity. In other words, all of the voxels from thereon are projected into the same set of pixels, klP .

The trajectory of the range-space search is sinusoidal when the virtual rays are projected to the ODI (the virtual rays will appear as straight epipolar lines when rectilinear images are used). In Figure 3, the sinusoidal curves have discontinuities that are caused by the non-matching regions. An image can be detected when it does not “see” a particular voxel on a virtual ray. It is now obvious that the range-space search can overcome the problem of Window Cutoff.

Rays Discontinued By Non-Matching Regions

Figure 3 Trajectory of virtual rays in the ODI.

1 An ideal ODI represents the world in the spherical coordinate

system. Every voxel should have a corresponding pixel in an ODI, provided that there is no occlusion.

2.2. Matching Template Derivation and Adjustments In this section, we describe how to solve the Scaling

and Foreshortening effects that are especially prominent in the wide-baseline omni-directional stereo configuration. Typical image-space matching techniques, which utilize fixed rectangular template, are unable to solve these two effects. In reality, the use of fixed image-space matching template for every camera means that the physical objects change shapes and sizes with respect to the cameras. Obviously, that is physically incorrect! To solve them, when verifying each subject voxel klV along a virtual ray, the size and the shape of an image-space matching template in each corresponding image have to be adjusted.

In the Range-Space Approach, we assume that every object has a spherical shape. When the assumed sphere is small relative to the object surface, the sphere approximates the object surface well. This spherical volume can therefore form a basis for our matching template. The range-space matching template is in effect a Volumetric Matching Template (See Figure 4). The back-projection of this volumetric matching template to the image space will generate an image-space matching template of different shape and size in each respective image. The subject voxel is at the center of the matching templates.

ρ

ρ

ρ ρ

ρ

ρ

Center Projection

VO

Vkl

Vk(l+1)

OM2

ODVS 2

ODVS 2

ODVS 1

ODVS 4

ODVS 3

VO

OC2

ρ ρρ

Template in Image Plane

Volumetric Template simplified as a 3-D Plane

Template Pixel

Template in Range Space

Figure 4 Volumetric matching template. To determine the first true voxel when searching

along a virtual ray, the Range-Space Approach performs template matching at each klV . Since the matching of a single voxel/pixel can be noisy, the volumetric matching template needs to contain a larger volume than a single subject voxel. This volume is calculated based upon the corresponding pixel angular resolution in each camera and the distance of each camera center to the subject voxel. With this approach, the spherical volume at each klV varies; however, the number of voxels which make up the


4

volume is fixed. At each klV , the farthest camera with the lowest angular resolution to the subject voxel will have the smallest image-space matching template. The opposite is true for the closest camera with the highest angular resolution that its image-space matching template will cover the largest image area. Figure 5 shows four image-space templates of four virtual voxels in each image using the methods discussed. Observe that the size and shape of the templates appear different in each camera.

Figure 5 Templates of four virtual voxels in four different cameras.

[12] had a similar concept of adjusting matching templates. They projected segmented patches to the range space, and then performed the deformable template matching by a hypothesis-and-verification procedure. This method requires starting out in the image space to get those edge features segmented. Segmentation of objects in the image is a difficult task and error-prone due to noise. Regarding the Foreshortening Effect, when compared to the work using Local Spatial Frequency representation [13], the Range-Space Approach with volumetric matching template seems more intuitive and direct.

3. Range-Space Match Matching is recognized as a challenging research

topic in machine vision. So far there has yet been any robust and reliable matching method demonstrated for complex, natural scenes. Methods such as coarse-to-fine or hierarchical methods [14] are sensitive to the choice of template sizes and incompleteness of edge segments. Although a larger template size ensures more stable performance, there is no guarantee that the disparity information obtained at the coarse scales is valid for generic image data. The disparity estimate might be wrong, might have a different value than at finer scales, or might not be present at all. Thus hierarchical approaches will fail under various circumstances. Feature-based matching is scene-complexity dependent and it is also difficult to extract stable useful features. To complicate matters further, every feature detected in one image can potentially be matched with every feature of the same

class in the other images. The number of possible matches explodes as the feature-density increases.

In the field of machine vision, a common way of finding these corresponding pixels is to optimize a matching objective function. Nevertheless, those matching algorithms often find many different sets of corresponding pixels to be the possible matches. These possible matches, when shown on a matching curve, appear as local minimums. A definite depth is difficult to determine because of the presence of these multiple local minimums. The goal of an optimization algorithm is to reduce these local minimums into a single local minimum or an obvious global minimum. They are rarely, however, successful.

Most matching algorithms would pick a local minimum with the lowest error among the many local minimums to be the best match and discard the rest, though that error be only slightly better than the others. Some algorithms would choose sparse numbers of pixel with an apparent global minimum to interpolate the rest of the pixels with techniques such as spline or Lagrange interpolation. These schemes can produce gross errors even when many cameras are employed [15]. In the real world, it is difficult to obtain a reliable global minimum; multiple local minimums always exist. An apparent global minimum can also be deceiving especially due to the color homogeneity in the scene. Even though a global minimum exists, the estimated range can be far away from its true 3-D location, because the lowest error can fall into various homogeneous locations. The estimated range can vary widely from its neighboring pixels.

In short, a matching algorithm should not arbitrarily choose a local minimum (with the lowest error) to be the best match; neither should a matching algorithm put its whole faith on a global minimum. Instead, any matching decision should be delayed until all the matching curves are collected for the entire view. These matching curves contain rich information that can be exploited for reliable and robust depth and color reflectance recovery.

3.1. Overview To achieve robustness and reliability in matching, we

modify the area-base matching approach using both color images and color-edge images. Each virtual pixel has two associated matching curves: one from the matching of color images (color match) and another from the matching of color-edge images (color-edge match). The color match and the color-edge match are two separated processes, but with the identical mechanisms to search, derive matching templates, and match. Both types of matching are processed in parallel at each klV . For the matching at each

klV , robust statistics is used to select a subset of images out of the set of input images, because some of the cameras may be occluded or having significant color deviation from the others caused by the specular reflection or the sensor noise. After the preliminary (first pass)


5

search and match steps are finished for the entire virtual image, matching curves of the color match and the color-edge match are collected. In the second pass, matching attributes are derived from these sets of matching curves. These derived matching attributes of every virtual pixel are analyzed to locate initial confident seeds. An initial confident seed is the first true voxel of a virtual pixel, which has the obvious global minimum of both color match and color-edge match estimated at the same 3-D location. The voxels of these confident seeds are growing simultaneously into the volume (see “error region” in the following subsection) of their less confident neighbors to help determine the first true voxel of these less confident neighbors. This Volume Growing process is based on the characteristics of the matching curves with the Continuity Constraint. There are various stages of Volume Growing, starting from the more confident types of matching curve characteristics down to the less confident types. For detailed implementation of the Volume Growing stages, refer to [3]. Essentially, the type with the matching curve of a single local minimum is more confident than the type with the matching curve of multiple local minimums. Iterations are carried out within each stage. The newly recovered first true voxel of a virtual pixel in this iteration will become a new seed in the next iteration. One stage stops when the recovered voxels cannot further grow into the volume of their less confident neighbors due to the Continuity Constraint and the matching curve characteristics specified in that stage. Then, the virtual pixels, whose first true voxel has not yet been determined, are examined and their first true voxel is determined in the subsequent stages. The final stage is to fill the occluded regions or the regions where have high matching errors (high deviation in colors) using geometrical interpolation.

Note that, the recovered depths from some other depth cues, such as depth from focus and structure from motion, can also be used to provide the initial confident seeds for Volume Growing. These depths, if reliable, can further strengthen the reliability of the Range-Space Approach. A few confident seeds are sufficient to trigger an avalanche effect of concurrent propagation to determine the first true voxel of their neighbors.

The following highlights the unique features of the range-space matching algorithms: 1) Volume Growing is used to propagate good matches

based on the characteristics derived from the matching curves with the Continuity Constraint. Confident seeds are located using edge conformity. Edgels that form the boundary closure of an object signal the possibility of geometrical discontinuity.

2) Volume Growing is used to integrate the matching results of other depth extraction cues.

3) Robust statistics is applied to solve the problems of specular highlights, occlusions, and having non-uniform numbers of cameras being used for matching along a virtual ray due to the non-matching regions.

4) A range-space match must recover both depth and color reflectance simultaneously. No reference camera/color can be assumed unless the virtual view coincides with an input real camera view.

5) Every virtual pixel is processed in parallel with identical mechanisms. Computation time is independent of the scene composition. The processing time varies linearly with the number of input cameras, the resolution of real and virtual images, and the locations of the real and virtual cameras.

3.2. Matching Attributes Derivation for Volume Growing

Error Region Starts at εS Ends at εE

Threshold

Local Minimum + Threshold

Average SSD (Sum of Square

Difference)

Local Minimum at εL

1-Step Search Length 1L

Candidate Edge

Match, which

produces A Confident

Seed

False Edge Match

Figure 6 Illustrating an error region and where the matching attributes are derived.

The concept of an “error region” is used for deriving matching attributes in the Volume Growing process (refer to Figure 6). An error region starts at Sε when the error falls below the threshold line and ends at Eε when the error rises above the threshold plus the local minimum error Lε . Eight attributes are derived from each error

region: depth (lLkV ε ), color (

lLkP ε ), matching error

(lLkSSD ε ), ray length (

lLkL ε ), ray length at Sε (

lSkL ε ), ray

length at Eε (lEkL ε ), number of voxels within the error

region (lkU ε ), and finally the average SSD in the error

region (lkSSD ε ).

lLkV ε ,

lLkP ε ,

lLkSSD ε , and

lLkL ε are

measured at the local minimum Lε . All the ray lengths are

calculated from the viewpoint origin OV . They are used as the bounding volume that ensures the Continuity Constraint in the Volume Growing process.

An error region covers a continuous region of voxels which forms a volume. The ideal case is to have a matching curve that has a single error region whose width is a single voxel. However, since window correlation is used for matching, the volume is always larger than one


6

voxel. This volume is the uncertainty in 3-D estimation. The volume/uncertainty is larger for homogeneous regions while smaller for highly textured regions. As the error region broadens, the difficulty in determining the first true voxel in that region increases.

The situation becomes even more difficult when multiples of these error regions (volumes) exist on a matching curve. Therefore, it is necessary to locate confident seeds to help guide those virtual pixels which have multiple error regions or have broad error regions on their matching curves. A confident seed comes from a virtual pixel that has a single error region for both the color and color-edge matches, and the local minimum of the color match lies at the same location as the local minimum of the color-edge match. The voxel at this shared local minimum is regarded as an initial confident seed and it is used to grow into the volume (error region) of their less confident neighbors. A seed can grow into its neighbor’s volume only when its voxel is contained within its neighbor’s volume (error region). This ensures the Continuity Constraint.

When using seeds with the concept of Error Region, we do not pick the lowest error along the matching curve of a virtual ray as the best match; all the error regions on a matching curve are equally likely to contain a true voxel. An error region with the lowest local minimum is not granted any preference over the rest of the error regions on the same matching curve. Both color match and color-edge match are used to support the correctness of each other’s global minimum. The color-edge match is insensitive to the thickness and incompleteness of the detected edges due to the verification support from the color match. The color match is also insensitive to the choice of a template size, because the matching errors are not utilized directly to finalize a good match. Thus, no large (or coarse-to-fine) template is necessary.

3.3. Assumptions For a desired virtual image size of S , there are S

number of virtual rays, { }S1 RRR ,,Λ= . After the range-space searching process, each ray in R is discretized into collinear voxels, ( ){ }kllkk VVV ,,, 11 −= ΛkR where

Sk ,,1 Λ= and ∞<≤ l1 . Considering the set of virtual rays, each ray will eventually intersects a physical surface. Therefore, on each ray, there exists at least one true voxel. This set of true voxels is denoted as

{ }kmT

mkT

kT VVV ,,, )1(1 −= Λk

T R where kkT RR ⊆ and

lm ≤≤1 . Since matching process includes noise, both true and

false matches co-exist in kR . Recall that multiples of these candidates—

lLkSSD ε (error of color match) and

lLk

E SSD ε (error of color-edge match)— are associated with any given virtual ray. Now that the concept of Error

Region is considered, multiples of these true and false voxels are contained within the error regions. When both color match and color-edge match are used, the set is modified as

{ }pnn kεP

kεP

kεP

kεP

kεP

kεP

kP EEEVVVR

1)(p01)(0

,,,;,,,−−

= ΛΛ

where ln ≤≤0 and np ≤ . When all klSSD of a virtual ray are above the threshold, no error region is found ( 0=n ). Usually, this happens due to the occlusions or the specular highlights.

To reliably extract the first true voxel out of kP R , we

make the following assumptions: 1) The first true voxel, 1k

TV , is seen by the majority number of input cameras so that its matching error is below the threshold. In short, the first true voxel is contained within an error region, i.e.

{ } φ≠∈∩ kT

kP RR 11 | k

Tk

T VV . 2) Every object surface in view is accompanied by sharp

edges. Particularly, all the edges, which form the closure of an object boundary, can be detected in the input images.

3) The color-edge match is supported by the color match. 4) Every virtual ray that passes through an object border

has a single local minimum (error region) for both the color match and the color-edge match.

5) The object surfaces are continuous. Therefore, the distance of objects varies smoothly with viewing direction, except at the object borders (defined by edges)— Continuity Constraint.

6) The images and the matching process have negligible noise.

Claim: A perfect virtual view can be synthesized. The information at an edge is a more reliable one,

which serves to locate an initial confident seed. Due to Assumption 3, { }kp

Pk

Pkb

P EEE ,,1 Λ∈∀ from the color-edge match, there is a corresponding voxel

{ }knP

kP

kaP VVV ,,1 Λ∈ from the color match, and

OkbP

OkaP VEVV −=− where OV is the viewpoint.

Together with Assumption 4, the set for the edgels at the

object borders becomes { }11

; LL kP

kP EV εε=

′kP R . Under

these assumptions, the first true voxel of the edgels, which form the closure of every object border, can be recovered. Thus, the initial confident seeds are found. With the Continuity Constraint, given a true voxel

akPV

ε in kP R ,

there exists at least another true voxel bk

PVε8 in the 8-

neighboring of kR , such that bk

PVε8 is within the object

borders and aba kk

Pk

P LVV εεε1

8 ≤− . The newly

recovered voxel is turned into another seed at the next


7

iteration. Eventually, a surface is grown within the supporting structures of the edges. The minimum number of surface is 1, while the maximum possible number of surfaces equals the number of rays having the set

{ }11

; LL kP

kP EV εε=

′kP R .

3.4. Demonstrations of the Searching and Matching Process

Figure 7 shows the range-space searching and matching process. The search paths in the windows show four different virtual pixels being processed simultaneously by four different threads in the developed software. Their searching and matching mechanisms are identical, as we have discussed.

The error plots are the matching errors of the third thread. Four input cameras are used in this particular example. kSSD1 , kSSD2 , kSSD3 , and kSSD4 in the upper

error plots are used to derive the curve kSSD on the lower error plots. There are three color-match outliers detected in the upper error plots. These outliers are not used to

derive the kSSD . (The same description holds for the color-edge match’s k

iE SSD and kE SSD .)

The scale of the Y-axis in the error plots is a normalized number of 100. The number 100 for the color match is equivalent to the normalized pixel error of 64*64; and for the color-edge match is equivalent to the normalized pixel error of 128*128. Errors higher than the number 100 are cropped and not displayed. The threshold line is at the number 25, which corresponds to 32*32 for the color match and corresponds to 64*64 for the color-edge match. When the error drops below the threshold line, its matching information is displayed in the message window. In this example, the color match has three local minimums (three error regions); however, the color-edge match does not have any. The lowest error of the color match is at the thth 25=k voxel, where its ray length is 32 inches measured from the virtual viewpoint origin, OV .


8

Outliers

Color-Edge Match ESSDk

Color Match SSDk

Local Minimums

Lowest Minimum

1 2

3 4

Error Plots for

Color Matches

Error Regions

Search Paths of four Virtual Pixels

Matching Templates

Minimum has not yet passed the threshold

Threshold Line

Close-up views of where the templates are extracted.

Figure 7 The range-space searching and matching process.

The image-space matching templates of the four input cameras are arranged as the number indicated. The close-up views show where these templates are extracted. The four indicated points in the close-up views are corresponding to the top-left, top-right, bottom-left, and bottom-right corners of each image-space matching template. The volumetric template sizes are 7x7 voxels. The image-space matching templates shown in the graphical user interface are enlarged views.

Figure 8 shows a few snap shots of the Volume Growing process. The size of a synthesized image is 212x140. The top left image is synthesized by picking the lowest error along each virtual ray. The red-coded region is due to the cropping of the original ODI, while the orange-coded regions are the low confident regions. Visual errors can be easily spotted. The final image is the result from Volume Growing after a total of 5 stages. The stages are indicated with numbers.


9

1 2

34

5

Seeds Preliminary

Final

Figure 8 Snap shots of the Volume Growing process. The final image (bottom-right) shows improvement over the initial image (top-left).

3.5. Differences between Matching in the Image Space and Matching in the Range Space

Marapane and Trivedi [14] grew regions in the image space to extract dense depth. When color of the region differs considerably, the growing process stopped. This kind of matching assumes that a region with uniform color is also uniform in depth. In reality, this assumption can be physically incorrect in many circumstances. It is especially sensitive to the changes in lighting condition and the sensor noise. Additionally, the requirement of prior scene segmentation to extract depth is impractical; especially the computational time increases as the scene complexity increases. Lhuillier [16] also grew region to acquire dense depth in the image space: using intensity threshold to merge regions and to forbid growing into non-textured area. Nonetheless, the demonstrated virtual views were convincing. Some works have also shown robustness in image segmentation by integrating region growing and edge detection [17][18] using illumination contrast. On the other hand, the range-space match performs the integration in the range space based on the characteristics of the matching curves with the Continuity Constraint using edge conformity.

It is also noteworthy that the image-space match [19] is a special case of the range-space match. That is when a

virtual view coincides with one of the input views. The number of camera pairs for the image-space match is smaller, due to the known matching colors at a reference camera. Refer to Figure 9, Camera 1 is assumed to be the reference camera.

SSD Error

SSD Error

Disparity/Baseline 3D Range

Camera1 – Camear2

Camera1 – Camera4

Camera1 – Camera3

Camera1 – Camera2

Camera1 – Camera4

Camera1 – Camera3

Camera2 – Camera3

Camera3 – Camera4

Camera2 – Camera4

Image-Space Match Range-Space Match

Broken segments due to non-matching regions

Figure 9 The image-space match and the range-space match. No reference camera can be assumed in the range-space match.

In effect, the Space Carving technique [20] is also a special case of the Range-Space Approach. When the approximate location of the objects of interest in a scene is known, the volume that encloses the objects can be determined. This enclosed volume is a truncated window section of the matching curves in the Range-Space Approach. The Range-Space Approach is also more general in the sense that the voxels that made up the enclosed volume are varying their sizes and shapes with respect to the camera arrangements and the image resolutions. However, Space Carving has many fixed-size cubical voxels that can be larger or smaller than necessary. Space Carving is using a single voxel for matching. It requires many densely surrounded cameras to reduce matching noise. Nevertheless, the homogeneous background and foreground still cause noise (“floating voxels”). On the other hand, the Range-Space Approach is using Volumetric Matching Template (a group of voxels), Robust Statistics, and matching curve characteristics to induce best matches with the Continuity Constraint. The homogeneous regions have to conform to the 3D estimated at the edges. As a result, no floating voxels will appear and no densely surrounded cameras are necessary.


10

4. Range-Space Render

ODVS1 ODVS2

ODVS3 ODVS4

θ2 θ1

Viewing Angle

Figure 10 A viewing angle is an angle between a virtual ray and a physical ray that connects to a voxel.

After the first true voxel is determined for a given virtual ray, both depth

akV ε and color akP ε of the virtual

pixel are recovered. The virtual pixel color is the weighted composite color of its corresponding pixel

akiP ε , where

akCSNi ε,,1Λ= , and

akCSN ε is the subset of the input cameras

that are not the matching outliers. The weights are a function of the viewing angles θ (refer to Figure 10) and the matching errors of the inliers. This composite virtual pixel color can be computed as

( )a

a

akCS a

akCS a

aak

CS a

akCS

akCS

a

kCS

kCS

N

j

kj

N

i

ki

kj

N

i

ki

N

ii

j

N

ii

k

N

N

PSSD

SSDSSD

P

ε

ε

ε

ε

εε

ε

ε

ε

ε

ε

ε

θ

θθ

2

1

1

1

1

1

1−

∑

∗

∑

−∑∗

∑

−∑

=

=

=

=

=

=

.

When using this weighted scheme, not a single camera can fully dominate the composite color, even when the virtual ray coincides with a physical ray. Therefore, in order for the synthesized view to be clear and sharp, so must the estimated 3D be accurate.

An alternative form of compositing the virtual pixel color can be expressed as

1

1

1

1

−

∑

∗

∑

−∑

=

=

=

=

a

akCS a

akCS

akCS

a

kCS

N

j

kj

N

ii

j

N

ii

k

N

P

Pε

ε

ε

ε

ε

ε

θ

θθ

.

This form does not include the matching error in its weighting function. When the virtual ray coincides with a physical ray, that camera will dominate the composite color.

4.1. Discussion on Virtual View Synthesis using the Range-Space Approach

In the ideal world we will have either extracted perfect 3D for the entire scene, or have had our input images sampled at twice the Nyquist sampling rate with

respect to the distances of the objects in the scene [21]. In reality, we nearly always face the situation which lies far from these two idealized cases. As a matter of fact, virtual view synthesis is more faults tolerant than 3D extraction. That is, a virtual pixel color can be synthesized correctly even though the recovered depth is not at its physically correct location. Therefore, we should consider virtual view synthesis as more of an image-based rendering problem than a 3D reconstruction/modeling problem. We can ask ourselves the questions: Do we need to have a complete 3D model before generating a new view? Will a partial model be sufficient? Do we really need a model at all?

The Range-Space Approach can deal with the two idealized cases, as well as those that are significantly far from ideal, with consistent mechanisms regardless of the situation. The approach can satisfy the three distinguishing requirements of true image-based rendering: 1) deal with the imperfectly extracted depth; 2) deal with the sparsely extracted 3D; 3) deal with the sparsely distributed cameras.

An important note regarding view synthesis is when we see a clear synthesized view, we are quite sure that the underlying 3D of the virtual pixels is accurate, unless the color in the scene is rather homogeneous. In our finding, the ability to extract accurate 3D on the high-frequency edgels is sufficient to synthesize clear virtual views. When using Volume Growing, the more completely we can estimate accurate 3D on the edgels (particularly those edgels which form a closure of the object boundary), the more reliable a clear virtual view we can synthesize. This argument is valid when the virtual viewpoint is bounded within the video cluster. In other words, the virtual view is interpolated within, and not extrapolated from, the video cluster. For interpolated viewpoints, the accuracy of 3D estimation on the homogeneous regions has little significance on the outcome of a clear synthesized view. The edgels work as the supporting structures for the homogeneous regions. Alternatively, the edgels work as the boundary conditions of the differential equations when the 3D estimation is thought as a variational problem by means of a time-evolving surface governed by a PDE. Note that an incomplete 3D estimation on the object border will possibly cause a “race condition” in the Volume Growing process. That is a race condition occurs when a depth-discontinued object becomes depth-connected with another object depending on which voxel becomes first available to its 8-neighbors. In other words, when a ray has multiple local minimums, which of its local minimums is selected depending upon which voxel is grown around the ray’s 8-neighbors. As such, to be able to grow each depth-discontinued object, there must be at least an initial seed point detected for each depth-discontinued object in the matching process. To avoid race conditions in Volume Growing, we need as much 3D information along the borders of an object as possible.


11

This will lead to a more reliable and clearer virtual synthesized view. Nevertheless, in our experiments with sparsely and imperfectly extracted 3D derived from sparsely distributed cameras, the Range-Space Approach is able to synthesize clear virtual views with high reliability and robustness.

5. Experimental Results

Figure 11 A scene for visual modeling.

7” 31”

258” 128”

20”

16”

Viewpoint

Selected Cameras

for Video

Cluster

36 Cameras

Figure 12 Sensor layout.

Virtual view synthesis experiments were performed inside a room (shown in Figure 11). The room size is 258 x 128 sq. in. It is a box-like structure. The omni-directional images were captured using a single hyperboloidal mirror with vertical field of view of about 270 degrees. The images were taken at regular intervals defined by the grids on the green cardboard (Figure 12). A total of 36 images are available to form various combinations of video clusters with many possible choices of baselines and image number. Camera 1 is the reference camera coordinate, from where all the estimated 3D in the experiments are measured. The highlighted cameras are the ones to form a video cluster for those virtual viewpoints.

5.1. Virtual View Synthesis

0

10

20

30

40

50

60

70

0 2 4 6 8 10View

Nor

mal

ized

Pix

el E

rror

(0-2

55) True Camera 22

Camera 22 w/o GrowingCamera 15Camera 16Camera 17Camera 21Camera 23Camera 27Camera 28Camera 29Camera 1

Figure 13 Error comparison of synthesized views versus real views.

For this experiment, we use Cameras 15, 16, 17, 21, 23, 27, 28, and 29 to synthesize views. The virtual viewpoint is at Camera 22, and Camera 22 is not an input camera. The synthesized panoramic view is compared with the views of Camera 22, the eight input cameras, and a camera that is situated afar and is uncorrelated. The view of an uncorrelated camera gives us a reference for comparison. The plot in Figure 13 clearly shows that the synthesized views have the lowest error when compared against the real views of Camera 22. None of the input cameras comes as close to the resemblance between the synthesized views and the real views of Camera 22. Camera 1, which is an uncorrelated camera, shows no relation to the synthesized views. Its error remains consistently high above the normalized pixel error of 25. Before Volume Growing, the average view synthesis error is 7.43. After Volume Growing, this average view synthesis error falls 1.21 to 6.22. The improvement is 16.35%.

Figure 14 shows real and synthesized views. Panorama 1 is the top most view in the figure and Panorama 6 is the bottom most view. Panoramas 1 to 3 are real. They are from Camera 29, 15, and 22, respectively. The baseline measured from Camera 15 to Camera 29 is 51 inches, which is the longest within the video cluster. Synthesized Panoramas 4 and 5 are supposed to resemble Panorama 3. Panorama 5 is the one before Volume Growing. Errors can be readily observed. Again, the red-coded regions in Panorama 5 are due to the cropping of the original ODI. The orange-coded regions are the occluded regions or the regions where having high color deviation. The color-coded regions with red, orange, and gray are not compared. Panorama 6 shows the difference between Panorama 3 (real) and Panorama 4 (synthesized). The errors are mainly at the high frequency regions. Figure 15 shows the close-up views before and after


12

Volume Growing. The dramatic improvement in view synthesis after Volume Growing can be easily observed.

1

2

3

4

5

6

Figure 14 Real and synthesized panoramic views.

Figure 15 Synthesized views before (left) and after Volume Growing (right).

5.2. Demonstration of Smooth 3-D Virtual Walkthroughs

The smooth walk path is shown in Figure 16. The walk path has a total of 168 inches in length. It is walking from the bottom yellow point to the top point. Three video clusters are available. Each cluster includes 10 cameras. Some cameras are shared by two video clusters. A total of 22 cameras were used for the entire path. When the viewpoints are within and a little beyond the cluster (based on the viewing direction and viewpoint location),

that cluster is selected to synthesize views. From views 1 to 39, we used the first cluster. From views 40 to 77, we used the second cluster. The rest was using the third cluster. Ninety views were synthesized along this path. In this paper, we show the views at discrete sampling intervals of every 8th view in Figure 17. The view sequence is from left to right and top to bottom. Figure 18 shows the 12 consecutive smooth views, from the 75th view to the 86th view. Smoothness in the views can be easily observed. The walkthrough includes both simultaneous translation and rotation. The peacock on the mural becomes larger in view and the chair is seen less at the end of the walk.

Figure 16 Smooth walk path and video clusters.


13

Figure 17 Twelve discrete synthesized views sampled from the smooth virtual walkthrough.

Figure 18 Twelve consecutive synthesized views extracted from the 75th view to the 86th view.

6. Concluding Remarks In this paper, we have introduced a range-space searching, matching, and rendering technique. This system allows viewers to freely walk through a dynamic environment with freely chosen viewpoints. When these views are synthesized from several wide-view ODI, only the necessary 3D is derived. This approach can be applied to arbitrary camera arrangements, image types, image resolutions, and image number. The processing time varies linearly with the number of input cameras, the

resolution of real and virtual images, and the locations of the real and virtual cameras. The range-space search overcomes the problems of scaling effects, foreshortening effects, and window cutoff—three of the five major research challenges of wide-baseline stereo. Cameras within a video cluster are chosen using robust statistics to handle occlusions and specular highlights. Volume Growing has the major effects on reducing most of the false matches. We have defined Error Regions, which are the basis for Volume Growing. Both color matches and color-edge matches are processed in parallel with identical


14

mechanisms. The derived matching attributes are combined and analyzed in the range space to locate confident seeds. These confident seeds are used to correct their 8-neighboring voxels with the Continuity Constraint. Low confident regions are filled via geometrical interpolation. Virtual pixel color is a function of viewing angles and matching errors. The results show clear and reliable virtual view synthesis. View synthesis has an overall average error of 6.22 normalized pixel error and an overall average improvement of 16.35% over the results before Volume Growing. Viewers can actively explore the scene. Smooth walkthroughs are constructed by assembling a sequence of synthesized views.

7. Acknowledgements Our research was supported by the California Digital

Media Innovation Program (DiMI) in partnership with Sony Electronics and Compaq Computers. We are pleased to acknowledge the assistance of our colleagues in the CVRR Laboratory, especially Mr. Rick Capella and Nils Lassiter who helped in the design of the AVIARY testbed.

REFERENCES [1] B. Heigl, R. Koch, M. Pollefeys, J. Denzler, and L. Van

Gool, “Plenoptic Modeling and Rendering from Image Sequences Taken by Hand-Held Camera,” Proc. of DAGM, p. 94-101, 1999.

[2] K. C. Ng, M. Trivedi, and H. Ishiguro, “3D Ranging and Virtual View Generation using Omni-view Cameras,” Proc. of SPIE Multimedia Systems and Applications, vol. 3528, Boston, November 1998.

[3] K. C. Ng, 3D Visual Modeling and Virtual View Synthesis: A Synergetic, Range-Space Stereo Approach using Omni-Directional Images, Ph.D. Dissertation, University of California, San Diego, March 2000.

[4] K. C. Ng, M. Trivedi, and H. Ishiguro, “Range-Space Approach for Generalized Multiple Baseline Stereo and Direct Virtual View Synthesis,” IEEE Workshop on Stereo and Multi-Baseline Vision, Kauai, Hawaii, December 9–10, 2001.

[5] K. C. Ng, H. Ishiguro, M. Trivedi, and T. Sogo, “Monitoring Dynamically Changing Environments by Ubiquitous Vision System,” IEEE Workshop on Visual Surveillance, Fort Collins, Colorado, p.67-73, June 1999.

[6] C. Buehle, M. Bosse, L. McMillan, S. Gotler, and M. Cohen, “Unstructured Lumigraph Rendering,” Proc. of Siggraph, Los Angeles, p.425-32, August 12-17, 2001.

[7] S. Kang, “A Survey of Image-based Rendering Techniques,” Proc. of SPIE, vol.3641, San Jose, California, p.2-16, January 1999.

[8] Z. Zhang, “Image-based Geometrically Correct Photorealistic Scene/Object Modeling: A Review,” Proc. Asian Conference on Computer Vision, p.279-88, 1998.

[9] K. C. Ng, H. Ishiguro, and M. Trivedi, “Multiple Omni-

Directional Vision Sensors (ODVS) based Visual Modeling Approach,” Conference & Video Proc. of IEEE Visualization, San Francisco, California, October 1999.

[10] D. Aliaga and I. Carlbom, “Plenoptic Stitching: A Scalable Method for Reconstructing 3D Interactive Walkthroughs,” Proc. of Siggraph, Los Angeles, p.443-50, August 12-17, 2001.

[11] S. Seitz and C. Dyer, “Photorealistic Scene Reconstruction by Voxel Coloring,” Int. Journal of Computer Vision, vol. 35, (no. 2), p.151-73, 1999.

[12] Z. Wang and N. Ohnishi, “Deformable Template based Stereo,” Proc. of IEEE International Conference on Systems, Man and Cybernetics, vol.5, Vancouver, BC, Canada, p.3884-9, October 1995.

[13] M. Maimone and S. Shafer, “Modeling Foreshortening in Stereo Vision using Local Spatial Frequency,” Proc. of IEEE/RSJ Int. Conference on Intelligent Robots and Systems, vol.1, Pittsburgh, PA, p.519-24, August 1995.

[14] S. Marapane and M. Trivedi, “Multi-primitive Hierarchical (MPH) Stereo Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.16, (no.3), March 1994.

[15] P. Rander, P. Narayanan, and T. Kanade, “Recovery of Dynamic Scene Structure from Multiple Image Sequences,” Proc. of International Conference on Multisensor Fusion and Integration for Intelligent Systems, Washington D.C., p.305-12, December 1996.

[16] M. Lhuillier, “Efficient Dense Matching for Textured Scenes using Region Growing,” Proc. of British Machine Vision Conference, Southampton, UK, vol.2, p.700-9, September 1998.

[17] T. Pavlidis and Y. Liow, “Integrating Region Growing and Edge Detection,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.12, (no.3), p.225-33, March 1990.

[18] J. Xuan, T. Adali, and Y. Wang, “Segmentation of Magnetic Resonance Brain Image: Integration Region Growing and Edge Detection,” Proc. of International Conference on Image Processing, Washington, DC, vol.3, p.544-7, October 1995.

[19] M. Okutomi and T. Kanade, “A Multiple Baseline Stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.15, (no.4), p.353-63, April 1993.

[20] K. Kutulakos and S. Seitz, “A Theory of Shape by Space Carving,” International Journal of Computer Vision, vol. 38, (no. 3), p. 199-218, 2000.

[21] J. Chai, X. Tong, S. Chan, and H. Shum, “Plenoptic Sampling,” Proc. of Siggraph, p.307-18, July 2000.

Documents

Generalized Multiple Baseline Stereo and Direct Virtual ...cvrr.ucsd.edu/publications/pdfs/Published IJCV - Range Stereo Final revised.pdfsearch is at the virtual viewpoint. For each