2
General Bandwidth Reduction Approaches for Immersive LHRD Videoconferencing Malte Willert University Rostock Stephan Ohl University Rostock Oliver Staadt University Rostock ABSTRACT Systems for immersive videoconferencing utilizing LHRDs face the problem of generating large amounts of data. A solution to reduce this data is essential to realize such systems. We present dif- ferent possible approaches to do this and give an overview where these are applicable. This provides the reader with a first guide for design decisions when implementing such systems. Index Terms: Computer Graphics [I.3.7]: Three-Dimensional Graphics and Realism—Virtual Reality 1 I NTRODUCTION In recent years, the demand for systems that allow remote collab- oration and immersive videoconferencing for geographically dis- tributed users has increased steadily. Today, these systems deliver a video image corresponding approximately to HDTV resolution (e.g., [2]). For lifesize collaborative work and free viewpoint tele- immersive systems however there is a need for much higher reso- lution. Those systems record the users with sparse or dense camera arrays. Therefore, these systems face the problem of limitations with regard to available bandwidth and data throughput. Our sys- tem described in [6] which uses a large high resolution display wall (LHRD), for example, will generate more data per second than Gi- gabit Ethernet can transport. In this poster, we show how the ac- quired data could be reduced dynamically and losslessly. We do this by comparing different strategies. We give an overview how others cope with this problem and discuss further research ideas. The main contribution of our work is an overview and evaluation of strategies to reduce the overall amount of data. 2 PROBLEM ANALYSIS A free viewpoint lifesize teleimmersive system, like the one intro- duced in [6], typically consists of multiple nodes for acquisition, processing and rendering. An array of cameras and depth sensors is connected to a number of acquisition nodes. Furthermore, the local and remote users’ head positions are tracked for view-dependent rendering. The generated data need to be transmitted and processed before being rendered at the remote site. Using the formulas from [6] for an approximate requirements estimation we need a minimum of 12 cameras to be integrated into the wall (full-view distance 40 cm; two camera rows with 82 cm spacing; six camera columns with 70 cm spacing; upper row camera tilt 42 ). As mentioned in previous work, using this setup we need to acquire the images at 800 × 600 pixels to render in an appropriate resolution. This means that at a desired interactive framerate of 25 FPS the system will generate 137.4 MiB/s raw data. After demosaicing we end up with a total of 414 MiB/s. e-mail: [email protected] e-mail: [email protected] e-mail: [email protected] Table 1: Possible Architectures Type Back Channel Rendering Method Rendering Site A Yes Image Based Display Site B Yes Image Based Local Site C Yes Model Based Display Site D Yes Model Based Local Site E No Model Based Display Site F No Image Based Display Site This enormous amount of generated data cannot be processed and transmitted as-is and therefore need to be reduced. 3 OUR APPROACH A naive idea would be to use standard compression algorithms like real time H. 264 coding of the recorded video streams. This is not applicable in our case since we want to reduce the amount of data to be processed as well (see Section 2). Instead we want to take advantage of the observation that just a small part of the recorded scene is dynamic content and we know the positions of the local and remote user. These observations lead to the discussion of following ap- proaches: Changing camera resolution, dynamic camera selection, frustum selection, and change detection. We will analyze these approaches in terms of possible data re- duction and their applicability for different system architectures. For this analysis we use a model user with a height of 210 cm and width of 70 cm. His arm and leg span is considered to be 200 cm. In [3] the authors give a detailed overview of possible system ar- chitectures. Architecture decisions that we refer to are presented in Table 1. Back Channel means we know the remote user’s viewing frustum. 3.1 Resolution Adaption Compared to our LHRD we acquire images at moderate resolutions (55 Mio pixels 800 × 600 pixels). This is possible due to the combination of multiple cameras and the introduction of a virtual space in between of the two users. In [6] we have shown that in some cases camera resolution needs to be higher than in others. In general, it would be possible to record in the highest possible resolution and lower it dynamically. This has the disadvantage of generating more local data but we would benefit in situations where both users stand far away from the display. Reducing the moderate camera resolution would only make sense in the case of a large low resolution display at the remote site. We do not consider that asymmetric setting. 3.2 Dynamic Camera Selection First consider a system of design Type F and E. By utilizing the position of the tracked user we can calculate the cameras the user could possibly be seen by. If the user is as close to the screen as pos- sible he will be seen at least by one camera column. At twice the 155 IEEE Virtual Reality 2012 4-8 March, Orange County, CA, USA 978-1-4673-1246-2/12/$31.00 ©2012 IEEE

[IEEE 2012 IEEE Virtual Reality (VR) - Costa Mesa, CA, USA (2012.03.4-2012.03.8)] 2012 IEEE Virtual Reality (VR) - General bandwidth reduction approaches for immersive LHRD videoconferencing

  • Upload
    oliver

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE Virtual Reality (VR) - Costa Mesa, CA, USA (2012.03.4-2012.03.8)] 2012 IEEE Virtual Reality (VR) - General bandwidth reduction approaches for immersive LHRD videoconferencing

General Bandwidth Reduction Approaches for Immersive LHRD

Videoconferencing

Malte Willert∗

University Rostock

Stephan Ohl†

University Rostock

Oliver Staadt‡

University Rostock

ABSTRACT

Systems for immersive videoconferencing utilizing LHRDs facethe problem of generating large amounts of data. A solution toreduce this data is essential to realize such systems. We present dif-ferent possible approaches to do this and give an overview wherethese are applicable. This provides the reader with a first guide fordesign decisions when implementing such systems.

Index Terms: Computer Graphics [I.3.7]: Three-DimensionalGraphics and Realism—Virtual Reality

1 INTRODUCTION

In recent years, the demand for systems that allow remote collab-oration and immersive videoconferencing for geographically dis-tributed users has increased steadily. Today, these systems delivera video image corresponding approximately to HDTV resolution(e.g., [2]). For lifesize collaborative work and free viewpoint tele-immersive systems however there is a need for much higher reso-lution. Those systems record the users with sparse or dense cameraarrays. Therefore, these systems face the problem of limitationswith regard to available bandwidth and data throughput. Our sys-tem described in [6] which uses a large high resolution display wall(LHRD), for example, will generate more data per second than Gi-gabit Ethernet can transport. In this poster, we show how the ac-quired data could be reduced dynamically and losslessly. We dothis by comparing different strategies. We give an overview howothers cope with this problem and discuss further research ideas.The main contribution of our work is an overview and evaluation ofstrategies to reduce the overall amount of data.

2 PROBLEM ANALYSIS

A free viewpoint lifesize teleimmersive system, like the one intro-duced in [6], typically consists of multiple nodes for acquisition,processing and rendering. An array of cameras and depth sensors isconnected to a number of acquisition nodes. Furthermore, the localand remote users’ head positions are tracked for view-dependentrendering.

The generated data need to be transmitted and processed beforebeing rendered at the remote site. Using the formulas from [6] foran approximate requirements estimation we need a minimum of 12cameras to be integrated into the wall (full-view distance 40 cm;two camera rows with 82 cm spacing; six camera columns with70 cm spacing; upper row camera tilt ≈ 42◦). As mentioned inprevious work, using this setup we need to acquire the images at800×600 pixels to render in an appropriate resolution. This meansthat at a desired interactive framerate of 25 FPS the system willgenerate 137.4 MiB/s raw data. After demosaicing we end up witha total of 414 MiB/s.

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]

Table 1: Possible Architectures

Type Back Channel Rendering Method Rendering Site

A Yes Image Based Display Site

B Yes Image Based Local Site

C Yes Model Based Display Site

D Yes Model Based Local Site

E No Model Based Display Site

F No Image Based Display Site

This enormous amount of generated data cannot be processedand transmitted as-is and therefore need to be reduced.

3 OUR APPROACH

A naive idea would be to use standard compression algorithms likereal time H. 264 coding of the recorded video streams. This is notapplicable in our case since we want to reduce the amount of datato be processed as well (see Section 2). Instead we want to takeadvantage of the observation that just a small part of the recordedscene is dynamic content and we know the positions of the localand remote user.

These observations lead to the discussion of following ap-proaches: Changing camera resolution, dynamic camera selection,frustum selection, and change detection.

We will analyze these approaches in terms of possible data re-duction and their applicability for different system architectures.For this analysis we use a model user with a height of 210 cm andwidth of 70 cm. His arm and leg span is considered to be 200 cm.In [3] the authors give a detailed overview of possible system ar-chitectures. Architecture decisions that we refer to are presented inTable 1. Back Channel means we know the remote user’s viewingfrustum.

3.1 Resolution Adaption

Compared to our LHRD we acquire images at moderate resolutions(55 Mio pixels ≫ 800× 600 pixels). This is possible due to thecombination of multiple cameras and the introduction of a virtualspace in between of the two users. In [6] we have shown that insome cases camera resolution needs to be higher than in others.In general, it would be possible to record in the highest possibleresolution and lower it dynamically. This has the disadvantage ofgenerating more local data but we would benefit in situations whereboth users stand far away from the display. Reducing the moderatecamera resolution would only make sense in the case of a largelow resolution display at the remote site. We do not consider thatasymmetric setting.

3.2 Dynamic Camera Selection

First consider a system of design Type F and E. By utilizing theposition of the tracked user we can calculate the cameras the usercould possibly be seen by. If the user is as close to the screen as pos-sible he will be seen at least by one camera column. At twice the

155

IEEE Virtual Reality 20124-8 March, Orange County, CA, USA978-1-4673-1246-2/12/$31.00 ©2012 IEEE

Page 2: [IEEE 2012 IEEE Virtual Reality (VR) - Costa Mesa, CA, USA (2012.03.4-2012.03.8)] 2012 IEEE Virtual Reality (VR) - General bandwidth reduction approaches for immersive LHRD videoconferencing

full-view distance the user will be seen by two camera columns. Ev-ery time the distance increases again about the full-view distance,the user can be seen by an additional camera column. In the worst

Figure 1: Illustration of Camera and Frustrum Selection, selectedfrustrums for distant user in shaded grey, selected cameras for closeuser in white.

case, spreading his arms, parts of the user can be seen by six cam-eras already when standing at full-view distance. This becomesworse as he steps further back. Even if the user moves to the left-most or rightmost possible position he could possibly reach out hisarms and would still be visible by at least four cameras. Since weonly know the user’s head position we always have to consider theworst case. Utilizing this approach at full-view distance six cam-eras deliver a total of 207 MiB/s. At a distance of two meters theuser could possibly be seen by all cameras. A larger benefit canbe gained under very special user positions uncommon for confer-encing and communication. We can conclude that this approachis nearly useless for systems of Type F and E in terms of data re-duction. Systems of Type A can reduce the active cameras if therendering algorithm does not need all cameras for the actual viewresulting in less image data to transmit. Systems of Type B and Dcan reduce the locally processed and transmitted data. Type C canpossibly reduce the data structure update to be transmitted.

3.3 Dynamic Frustrum Selection

Dynamic camera selection fails if the user steps backward becauseshe becomes visible in almost all cameras. A simple, yet power-ful solution to overcome this limitation is to consider only partsof each camera’s frustum. Again we make use of the tracked lo-cal user’s position. We define a spatial region R around the userwhich describes her maximal possible extent in space. The enclos-ing axis-aligned bounding box of the projection of R defines ourdynamically selected view frustum. A simple R would be a box.

Now, if the user steps backward the approach scales naturally.She becomes visible in a lot of cameras, but only a smaller part ofthe frustum, a rectangular region of the camera image, is considered(see Figure 1). In the extreme case, where a user stands close tothe wall, dynamic frustum selection basically reduces to cameraselection. This approach is usable for systems of Type A and F toreduce the data to be transmitted and Type C, D and E can reducethe data that have to be processed.

3.4 Change Detection

With methods like the ones presented in [1] the changed image partscan be detected. Now only the changed image parts need to beprocessed and transmitted. Using our user model from above wecan approximate the area which is changed with a rectangle of hersize. Movements of the user’s arms do not need to be consideredbecause they just change the geometry but not the number of pixels.With the formulas from [6] we can now estimate the number of

pixels the user will occupy. If the user is as close to the screen aspossible (full-view distance) he will have the largest pixel count. Inthis worst case the user will be 1200 pixels high. His horizontalwidth will be 799 pixels. So the total number of occupied pixelsis 1200× 799 = 958800. 958800 pixels correspond to 2.81 MiB.So we should be able to reduce the amount of data generated from414 MiB/s to at least an upper limit of about 70.25 MiB/s. This is areduction to 17% of the original data. This approach can be utilizedby systems of Type A,C,E and F for reducing the amount of datato be transmitted and for systems of Type B and D to reduce theamount of data to be processed.

4 RELATED WORK

In [3] the authors make an extensive analysis of different systemarchitectures and an give extensive discussion about streaming inTelepresence Environments. Furthermore a strategy for systems ofType C to compress the data is presented. An interesting discussionon how to reduce the amount of data to be transmitted using cameraselection is presented in [7]. A successful implementation of this ispresented and evaluated. In [4] an approach for model driven datacompression usable for systems of Type C is presented. In [5] theauthors give a brief overview of their findings considering cameraselection.

5 CONCLUSION

We have discussed a number of general data reduction approachesfor an immersive LHDR videoconferencing system. These sim-ple approaches were discussed separately. We plan to imple-ment a combination of these approaches for the Extended WindowMetaphor for LHRDs [6] to evaluate its performance. One promis-ing cascade, for instance, would be first to apply dynamic frustumselection, second to select a resolution level and third to do a block-wise change detection.

Frustum selection and simple change detection algorithms arelow latency operations. In this paper we limited ourselves to datareduction but a discussion of latency is of equal importance and willbe part of future research.

ACKNOWLEDGEMENTS

This work was supported by EFRE fond of the European Commu-nity. We thank the anonymous reviewers for their valuable contri-bution.

REFERENCES

[1] A. Griesser, S. D. Roeck, A. Neubeck, and L. V. Gool. Gpu-based

foreground-background segmentation using an extended colinearity cri-

terion. In Proceedings of Vision, Modeling, and Visualization (VMV)

2005, pages 319–326. IOS Press, November 2005.

[2] M. Kuechler and A. Kunz. Holoport - a device for simultaneous video

and data conferencing featuring gaze awareness. In VR ’06: Proc. of

the IEEE conference on Virtual Reality, pages 81–88, Washington, DC,

USA, 2006. IEEE Computer Society.

[3] E. Lamboray, S. Wurmlin, and M. H. Gross. Data streaming in telepres-

ence environments. IEEE Trans. Vis. Comput. Graph., 11(6):637–648,

2005.

[4] J.-M. Lien, G. Kurillo, and R. Bajcsy. Multi-camera tele-immersion

system with real-time model driven data compression. The Visual Com-

puter, 26(1):3–15, 2010.

[5] A. Maimone and H. Fuchs. A first look at a telepresence system with

room-sized real-time 3d capture and life-sized tracked display wall. In

Proceedings of ICAT 2011,to appear, November 2011.

[6] M. Willert, S. Ohl, A. Lehmann, and O. G. Staadt. The ex-

tended window metaphor for large high-resolution displays. In

EGVE/EuroVR/VEC, pages 69–76, 2010.

[7] Z. Yang, W. Wu, K. Nahrstedt, G. Kurillo, and R. Bajcsy. Enabling

multi-party 3d tele-immersive environments with viewcast. TOMC-

CAP, 6(2), 2010.

156