Interactive 3D Video for Multimedia Applications

Information Technology and Electrical Engineering - Devices and Systems, Materials and Technologies for the Future

Faculty of Electrical Engineering and Information Technology

Startseite / Index: http://www.db-thueringen.de/servlets/DocumentServlet?id=14089

54. IWK Internationales Wissenschaftliches Kolloquium

International Scientific Colloquium

07 - 10 September 2009 PROCEEDINGS

Impressum Herausgeber: Der Rektor der Technischen Universität llmenau Univ.-Prof. Dr. rer. nat. habil. Dr. h. c. Prof. h. c.

Peter Scharff Redaktion: Referat Marketing Andrea Schneider Fakultät für Elektrotechnik und Informationstechnik Univ.-Prof. Dr.-Ing. Frank Berger Redaktionsschluss: 17. August 2009 Technische Realisierung (USB-Flash-Ausgabe): Institut für Medientechnik an der TU Ilmenau Dipl.-Ing. Christian Weigel Dipl.-Ing. Helge Drumm Technische Realisierung (Online-Ausgabe): Universitätsbibliothek Ilmenau Postfach 10 05 65 98684 Ilmenau

Verlag: Verlag ISLE, Betriebsstätte des ISLE e.V. Werner-von-Siemens-Str. 16 98693 llmenau © Technische Universität llmenau (Thür.) 2009 Diese Publikationen und alle in ihr enthaltenen Beiträge und Abbildungen sind urheberrechtlich geschützt. ISBN (USB-Flash-Ausgabe): 978-3-938843-45-1 ISBN (Druckausgabe der Kurzfassungen): 978-3-938843-44-4 Startseite / Index: http://www.db-thueringen.de/servlets/DocumentServlet?id=14089

INTERACTIVE 3D VIDEO FOR MULTIMEDIA APPLICATIONS -CONCEPTS AND SYSTEM

Christian Weigel, Sebastian Schwarz, Torsten Korn, Martin Wallbohr

Institute for Media Technology, Ilmenau University of TechnologyPostfach 10 05 65, 98684, Ilmenau, Germany

email: [email protected]

ABSTRACT

In this paper we present an approach for interactive freeviewpoint video (also called 3D video objects) which,in its most simple form, only relies on two camerasto synthesize novel views. The algorithms we use foranalysis and synthesis are implemented using a cus-tomizable high performance software system. Due to amodular approach we can implement a number of mul-timedia applications. We show the possibilities of thesystem with an example where a 3D video object is em-bedded into a synthetic scene.

Index Terms— Free Viewpoint Video, 3D Video,3DTV

1. INTRODUCTION

At the moment 3D video is a widely discussed topicin the research community. Besides stereoscopic pre-sentation which is already entering the movie and TVmarket so called free viewpoint video gains a lot of in-terest as well. The idea behind free viewpoint video isto record a scene with a number of cameras in order toget a depth representation of the scene which is usedto synthesize novel views of it. A number of commonextraction techniques for 3D objects do have a com-puter graphic mesh as output [1] [2] [3]. These meshesare rendered using dynamic texturing with the RGB-images from the cameras. To obtain good result withthis method the input must consist of images from atleast 8 cameras. In contrast we only use pixel data tosynthesize novel views of the scenery.

In [4] we presented an algorithm for rendering freeviewpoint video that uses information from a stereocamera pair only. Compared to the model-based ap-proaches we used an image-based approach. Althoughnot sufficient for covering the whole viewing volumearound the object we could achieve good results usingextrapolation based on the trifocal transfer firstly intro-duced in [5] while there was much less effort for captur-ing the object compared to other techniques. In section2 of this paper we review the preprocessing and syn-thesis steps. The extension of the algorithm to handleadditional stereo pairs and the decision scheme which

stereo pair is used to synthesize the virtual view is pre-sented in section 3. The interactive viewing as wellas the combination of the free viewpoint object with asynthetic scene is described in section 4 followed bysome result given in section 5.

2. IMAGE-BASED FREE VIEWPOINT VIDEOFROM STEREO

Obviously using the information from only two cam-eras is not sufficient to get full virtual view latitudearound the object. But they already provide enough in-put data to achieve fair virtual view latitude at positionsclose to the stereo pair. For some application this is anadequate compromise. To obtain the representation weneed for interactive rendering form a stereo setup thefollowing steps are necessary: 1. stereo camera cali-bration, 2. image rectification, 3.disparity estimation.

The camera calibration step has two purposes, theestimation of the rectification matrices for a stereo im-age and the estimation of external and internal cameraparameters used for interactive rendering respectively.

Having only a stereo pair as input, we need to ar-range the cameras in a convergent setup to obtain moreimage information. In this setup the optical axes of thecameras approximately intersect at a point within theobject that is recorded. In order to simplify the subse-quent step of finding corresponding pixels across bothof the images they need to be rectified. Briefly sum-marized the rectification transforms the images such asthat they appear to be recorded with cameras which arein parallel alignment. In the ideal case after rectifica-tion all pixels in the left and right image that correspondto the same point in the 3D space rest on the same line(y-coordinate) of the image.

In a last step for all pixels we estimate the offsetbetween the pixel in the left image and the correspond-ing pixel in the right image being the projections of thesame 3D point. The estimation heavily depends on thealgorithm used and the quality of the rectification pro-cess. The disparity estimation algorithm we currentlyuse is based on the dissimilarity measure as describedby Birchfield and Tomasi [6]. Finally we filter the dis-

© 2009 - 54th Internationales Wissenschaftliches Kolloquium

parity image with a Median filter to reduce outliers. Weare aware that there are much advanced algorithms fordisparity estimation available [7]. However we recog-nized that most of them focus on baseline interpolation.While they perform very well for this purpose these al-gorithms are not always the best choice for a synthesisbased on trifocal transfer. Wrong disparities, especiallyin homogeneous image regions, do not always producevisible visual artifacts in baseline interpolation but cancause strong visual artifacts when using trifocal trans-fer. Nonetheless it is possible to choose different al-gorithms for disparity estimation. The disparities areusually stored as 8 bit grayscale values, the so calleddisparity map.

As input representation for interactive rendering weuse the original left and right images and the rectifieddisparity map. Figure 1 illustrates all steps.

A novel view is synthesized using the texture infor-mation of one reference view, one associated disparitymap, the rotation and translation between the two ref-erence cameras and a specified position and orientationof the virtual camera in relation to the left camera. Us-ing the method of the trifocal transfer a novel view canbe obtained by solving an over determined system ofequations [5] which is called forward mapping. A de-tailed description of how we obtain the trifocal tensorcan be found in [4].

The forward mapping causes a number of missingpixel assignments. One reason for this effect is occlu-sions in one of the reference images and thus missingdisparities which is a well known problem. The secondreason is missing image information, e.g. when plac-ing the virtual camera very close to the captured object.We solve this problem by a simple neighborhood fillingalgorithm which achieves good results with low com-putational complexity.

3. ADDING STEREO PAIRS

The algorithm in the previous section yields a respect-able synthesis result. Yet the synthesis with just onereference view has still some flaws. The two majorproblems to be solved are occlusions and the latitudeof the virtual view.

There are many solutions for implementing addi-tional reference views to solve these problems. E.g. us-ing the quadrifocal tensor [8], [9] or sampling densitymaps [10]. In this section we present a new method,combining some of those results for a high-quality real-time synthesis.

Basically only two reference views are needed togenerate a virtual view. We ascertained a rotation of10 degrees on the x-axis for the best offset between thereference views. This relatively small distance resultsin very few forward-mapping errors and barley partialocclusions, but still gives a comparatively good resultoff the baseline. We call this setup a stereo pair.

By using several stereo pairs in one scene we candramatically increase the virtual view latitude. But com-bining the image information from all stereo pairs wouldrequire a number of view synthesis performed in paral-lel. With today’s computers this is too time consumingto achieve real-time rates. Therefore we only use onepair as source for a one virtual view at a time. The de-cision about which stereo pair Si we use is primarilybased on the euclidean distance ν between the camerapair position Spair = [xpair, ypair, zpair]T and the vir-tual view position Svirt = [xvirt, yvirt, zvirt]T . Theclosest stereo pair (Sa for virtal view A and B in theexample case shown in fig. 2) mostly delivers the bestresult.

νpair =

√(xvirt − xpair)2 + (yvirt − ypair)2+

(zvirt − zpair)2(1)

In some situation this might not be true. Therefore weintroduce an “online” quality check algorithm based onthe sampling density. We calculate the displacement ρbetween the morphed pixel p′

i = [x′i, y

′i]

T and its originpi = [xi, yi]T .

ρ(Si) =∑N

i=0

√(x′

i − xi)2 + (y′i − yi)2

N(2)

N is the number of pixels per frame. The average dis-placement ρ is a degree for the quality of the view syn-thesis result [10]. Within a given radius σ (blue circlein fig.2) around the stereo pair Sa this pair is consid-ered “perfect” and no decision is required. Outside thisthreshold ReVOGS performs a trifocal transfer for Sa

and the next closest pair Sb and decides based on thesampling displacement ρ.

Sact =

Sa ∀ Svirt | ν(Sa) ≤ σSa ∀ Svirt | ν(Sa) > σ and ρ(Sa) ≤ ρ(Sb)Sb ∀ Svirt | ν(Sa) > σ and ρ(Sa) > ρ(Sb)

(3)

Once this decision is made only one transfer step is nec-essary unless the virtual view jumps or moves awayfrom the actual stereo pair. Then a new decision ismade. By reducing the necessary transfer steps to aminimum the system obtains its real-time qualities. Fig.2 shows a simple camera setup with two stereo pairs.

The stereo pair concept leads to a highly scalable,yet still real-time FVV-system with very little restric-tions. The setup between the individual stereo pairs isnot constricted and can be modified to match any ap-plication.

4. INTERACTIVE RENDERING

The Realistic Video Object Generation System brieflyReVOGS is a modular constructed imaging software

(a) (c) (e)

(b) (d)

Fig. 1. Images a and b: Left and right view captured by a convergent setup. Images c and d: Rectfied left and rightview. The images are also horizontally shifted in order to obtain always positive disparity values. Image e: Thedisparity map from left to right image (gamma level increased for illustration purposes).

Fig. 2. Camera setup example with two stereo pairs.Only virtual view C yields a qualitiy decision. ViewA is within σ(Sa) and view B within σ(Sa) and σ(Sb)but closer to Sa.

for view synthesis and quality assessment of free view-point video (i.e. 3D video objects - 3DVO). The sys-tem uses specific image processing modules which arecombined in a workflow specified by an XML file. Theworkflows of which each constitutes a different algo-rithm chain are intuitively adaptable by a GUI or onthe text level. In our first implementation as describedin section 2 we implemented a virtual camera path withdifferent viewing angles of the 3DVO by manually edit-ing a XML file. This limits the path to be fixed duringviewing. Since rendering on the CPU is fast enoughalready we have developed a new control module pro-viding an interactive rendering of 3DVOs. Addition-ally, we combined the 3DVO with a synthetic OpenGLscene which can be explored by an interactively mov-ing camera. We will briefly explain the two viewingmodes in the following sections.

4.1. Object observation

We call the basic display mode “observe mode”. Themode directly displays the result of the trifocal trans-fer which is a physically correct view of the 3DVO. Weimplemented an interactive control module to providean intuitive environment. This can be used, for exam-ple, for quality assessment and error analysis. The vir-tual camera can be moved in each direction and rotatedaround the 3DVO via mouse and keyboard input. Ac-cording to the number of stereo pairs in use and to pre-vent loss of orientation we set adjustable borders whichrestrict the rotation angles around the 3DVO. Within

these borders the user can view the 3DVO from anypoint of view.

4.2. Embedding into synthetic scenes

Many of today’s multimedia applications such as gamesor virtual meeting rooms rely on synthetic computergraphic scenes. Although the level of realism of inter-active computer graphics has been increased dramati-cally in the last few years it is still clearly visually dis-tinguishable from captured natural scenes. Our aim isto increase the level of realism of synthetic scenes byadding 3DVO to it. At a first glance using an image-based rendering approach as described in the previoussections seems to have disadvantages compared to amodel-based approach. In the model based approachthe scene representation (i.e. the mesh) is already thesame as for the synthetic scene thus combining the twois an easy task. Anyway we still find that the modelsoften look “computer graphics-like”. Also the usageof a mesh-based representation is not as flexible as ourmethod in terms of the number of cameras used to getthe representation.

We found a simple convincing way to integrate viewssynthesized by the trifocal transfer into the syntheticscene. We call this “scene mode” within our system.When the pixel positions of the virtual view have beensampled we simply project the result as texture onto abillboard. The billboard is nothing else than a simpleplane in the OpenGL 3D space that always faces to-wards the viewer.

The synthesis algorithm also provides an alpha maskas output which we also apply to the billboard to makeonly the part of the plane visible that is covered by theobject. The texture of the billboard is updated with theoutput of the synthesis algorithm each time the view-ing position is changed thus providing the impressionof moving around the object.

One problem raised by this method was not yet dis-cussed. The perspective projection rendering of thesynthetic scene causes the billboard and its texture tobe scaled in size when moving the virtual camera to-wards or away from it respectively. Since the virtualcamera position is also an input to the synthesis algo-rithm the texture would get scaled two times - once bythe scene projection and once by the synthesis algo-rithm. Fortunately we can easily avoid this effect bypassing only the rotational components from the vir-tual camera to the synthesis algorithm while discard-ing any translation. This can be done without changingthe virtual geometry of the object since both the syn-thesis algorithm and the synthetic scene rendering usethe same projection model which is a simple pin-holecamera. Just the hole filling technique described in sec-tion 2 (when applied to holes caused by up scaling) issubstituted by the interpolation method provided by thegraphics hardware.

Although it is not completely physically correct us-ing the billboard to display the 3DVO yields a verygood 3D impression in most use cases. Certainly thereare cases where the method does not work. Synthet-ics object cannot be covered partially (e.g. by an arm)from the 3DVO when it is very close to it for example.But for most of our use cases such close interactionsare not required and the 3D impression is kept.

Finally we implemented a similar interactive con-trol like in the “observe mode”. The user has the pos-sibility to fly through the scene and to examine all ob-jects, comparable to modern computer games.

5. RESULTS

The procedure described in the previous sections is quitesimple but provides satisfactory results. We can easilychange the viewing latitude by adding or removing thedata of additional stereo pairs respectively. The syn-thetic scene is easily exchangeable, the implementa-tion of the billboard is anything but difficult and thissimplification is barely noticeable. The viewing modesdemonstrate a sample application for different use caseslike virtual 3D worlds or subjective quality assessmentfor free viewpoint video. Fig. 3 shows some screen-shots. Although using only some simple synthetic prim-itives for demonstration purposes it gives a good im-pression of how 3DVO and synthetic scenes can worktogether. ReVOGS renders the virtual view obtainedfrom SD television input data (resolution 768x576) at10 frames per second using a non-optimized brute-forceimplementation of the view synthesis algorithm on astandard PC (Pentium 4 3.2GHz, 2GiB RAM).

6. REFERENCES

[1] Joel Carranza, Christian Theobalt, Marcus A.Magnor, and Hans-Peter Seidel, “Free-viewpointvideo of human actors,” ACM Trans. Graph.,2003.

[2] A. Smolic, K. Muller, P. Merkle, T. Rein,M. Kautzner, P. Eisert, and T. Wiegand, “Freeviewpoint video extraction, representation, cod-ing, and rendering,” in ICIP, 2004, pp. 3287–3290.

[3] E. de Aguiar, C. Stoll, N. Ahmed, and H.-P. Sei-del, “Performance capture from sparse multi-view video,” in Proc. of ACM SIGGRAPH 2008,2008.

[4] C. Weigel and L. Kreibich, “Advanced 3d videoobject synthesis based on trilinear tensors,” inProc. ISCE, 2006.

[5] S. Avidan and A. Shashua, “Novel View Synthe-sis by Cascading Trilinear Tensors,” IEEE Trans-

(a) (c) (e)

(b) (d) (f)

Fig. 3. The images show from left to right: The 3DVO in the observer mode (a,b). The 3DVO embedded in twosynthetic scenes (c-f).

actions on Visualization and Computer Graphics,vol. 4, no. 4, pp. 293–306, 1998.

[6] S. Birchfield and C. Tomasi, “A pixel dissimilar-ity measure that is insensitive to image sampling,”IEEE Trans.on Pattern Analysis and Machine In-telligence, vol. 20, pp. 401–406, 1998.

[7] R. Szeliski D. Scharstein, “Middlebury StereoEvaluation Web Page,” 2009.

[8] R. Hartley and A. Zisserman, Multiple view ge-

ometry in computer vision, Springer, Cambridge,2 edition, 2004.

[9] R. Hartley and F. Schaffalitzky, “Reconstructionfrom projections using grassman tensors,” in Pro-ceedings ECCV, Prague, 2004.

[10] E. Cooke and N. OConnel, “Multiple image viewsynthesis for free viewpoint,” in ProceedingsICIP, Genoa, Italy, 2005.

Documents

Interactive 3D Video for Multimedia Applications