Compression Of 16K Video For Mobile VR Playback Over 4K … · 2018. 11. 6. · Compression Of 16K Video For Mobile VR Playback Over 4K Streams. I. Vazquez and S. Cutchin {ikervazquezlopez}{stevencutchin}@boisestate.edu

International Conference on Artificial Reality and TelexistenceEurographics Symposium on Virtual Environments (2018)G. Bruder, S. Cobb, and S. Yoshimoto (Editors)

Compression Of 16K Video For Mobile VR Playback Over 4KStreams.

I. Vazquez and S. Cutchin{ikervazquezlopez}{stevencutchin}@boisestate.edu

Boise State University1910 University Drive, Boise ID, USA.

AbstractMobile virtual reality headset devices are currently constrained to playing back 4K video streams for hardware, network, andperformance reasons. This strongly limits the quality of 360◦ videos over 4K streams; which in turn translates to insufficientresolution for virtual reality video playback. Spherical stereo virtual reality videos can be currently captured at 8K and 16Kresolutions, with 8K being the minimal resolution for an acceptable quality video playback experience. In this paper, we presenta novel technique that uses object tracking to compress 16K spherical stereo videos captured by a still camera into a formatthat can be streamed over 4K channels while maintaining the 16K video resolution for typical video captures.

CCS Concepts• Computing methodologies → Computer graphics; Image compression;

1. Introduction

Mobile Virtual Reality (VR) devices are currently constrained toplaying back 4K video streams because of hardware, network, andperformance reasons. Virtual Reality requires real-time video play-back in order to maintain the users immerssive experience. Play-back with lag, buffering or jitter will end the users emotional VRexperience and the realism of the VR experience. Most existingmobile VR devices have hardware restrictions that impede the de-coding of videos at resolutions higher than 4K [BPS18]. Unfortu-nately, 4K video streams provide insufficient resolution for qualityVR video playback experiences [RAAT∗17]. This occurs for a fewreasons. First, 4K VR video is spherical and while an entire frameis 4K only at most 1/4 of the full frame is presented to the enduser at a time. This reduces the apparent resolution to 1024k atany given moment. Second, VR devices place the device very closeto the human eye and this yields a strong screen door effect, whichexacerbates the impact of the reduction in resolution to 1024K, pro-ducing particularly large and fat pixels. Third, since VR playback isstereoscopic the 4K video stream is typically shared between bothleft and right images. This leads to a resolution of 512K for eacheye, or a halving of the frame rate while maintaining 1024K frames.Given the screen door effect it is necessary for VR video to have aresolution of the full device resolution at any given moment in timefor a pleasant VR viewing experience. This requires a resolutionof 2960x1440 per frame for an effective experience. As such it isnecessary to stream video at an effective resolution of 8k or 16k fora quality VR video experience.

Standard compression algorithms such as MPEG [MWS06],achieve high compression rates while maintaining good quality forgeneral videos. However, for the reasons above, MPEG compressed4K VR video streaming is insufficient. Spherical stereo VR videoscan currently be captured at 8K and 16K resolutions, with 8K be-ing the minimal resolution for an acceptable quality video playbackexperience but, as previously stated, mobile VR headsets cannothandle these resolutions.

To address the mobile VR headset streaming constraint, we pro-pose a solution that segments the videos into multiple distinct ob-jects based on object identification and tracking with the back-ground identified as its own independent object. Identified ob-jects are then compressed as separate video streams using standardMPEG compression and streamed interleaved to a receiving play-back headset. For this early work, we limited the problem to theconsideration of still camera videos where the background remainsstill and objects move relative to the camera. Objects may beginand stop moving at any time in the video sequence. New objectsmay appear and old objects disappear as well as objects crossingover the paths of other objects. In our problem space backgroundpixels do not change their color or intensity and therefore, stream-ing them repeatedly is inefficient. Segmenting and streaming pixelsthat belong to objects in motion reduces the overall required band-width relatively to standard MPEG compression techniques appliedto entire frames. As a direct result, in our test cases the video qual-ity increases with a simultaneous reduction in overall stream band-width.

c© 2018 The Author(s)Eurographics Proceedings c© 2018 The Eurographics Association.

DOI: 10.2312/egve.20181320 https://diglib.eg.orghttps://www.eg.org

https://doi.org/10.2312/egve.20181320

I. Vazquez & S. Cutchin / Compression Of 16K Video For Mobile VR Playback Over 4K Streams.

In the rest of this paper, we present our initial testing harnessthat generates individual videos for each of the segmented objectscontained in the original 16K video as well as our initial qualitymetrics and compression comparisons.

The presented approach reduces the required amount of data forstreaming and increases the video playback quality compared tostandard methods such as MPEG. In some cases achieving com-pression improvements of ninety percent.

Our implemented technique depends upon the existence of a ded-icated GPU on the playback device with support for hardware tex-ture mapping with programmable u,v coordinates and polygonalgeometry.

2. Related work

Current work for VR video streaming focus on Field Of View(FOV) streaming where an interaction between the content providerand the remote user is required. In [SSHS16], authors presenta FOV tile based approach for High Efficiency Video Coding(HEVC) for HMD and state the constraints of FOV streaming, suchas encoding overhead when each user’s FOV is encoded. Follow-ing this methodology, Bassbouss generates in [BPS18] 16K contentwhich uses server-side transformations to stream the current FOVcontent to the end user. This content is used in [BSB17] where theypre-render relevant video FOVs to ensure an efficient streamingof high quality videos. One of the limitations of this approach isthat the FOV is not a rectangular area in the source equirectangularvideo, therefore in [RAAT∗17] they use an improved tiling methodto stream the content that address this.

To evaluate the quality of the streamed videos, most of theworks [BPS18,BSB17] reviewed use the Peak Signal-to-Noise Ra-tio (PSNR) [HTG12] metric due to its standardization. However,PSNR shows some problems when dealing with non-rectangularFOV content and because of that, authors in [XHYV17] presenta new evaluation framework including and extending a SphericalPSNR [YLG15] metric to an Area-Weighted PSNR (AW-PSNR).For each possible FOV in a VR video, the respective PSNR scoreis computed and then AW-PSNR weights the score using the areathat FOV captures from the equirectangular VR video frame.

3. Object based compression

In still camera VR videos, pixels belonging to moving objects havecolor and intensity changes while background pixels typically re-main relatively unchanged. Streaming background pixels that don’tchange is clearly inefficient and wastes bandwidth. In order to avoidsending unchanged pixels, we segment the video frames into indi-vidual objects that have some motion in the video, track them, andgenerate an individual video for each object. Each of these videoscaptures a single moving object through the entire video and thesize of the video is directly related to the pixel size of the capturedobject. By applying this technique, we get a set of smaller videoscapturing each of the objects and their surrounding area togetherwith their respective motion information.

We identify moving objects’ using a simple thresholded pixelcolor/intensity method. When an object moves, pixels that belong

to the object have different color and intensity values with respectto the previous frame. Similar to [TOB∗97] and as shown in Fig-ure 1b, we compute image differences from two contiguous videoframes to detect which pixels are part of an object. We can, by set-ting a threshold for this operation, discard noise in videos that couldlead to incorrect object segmentation. We empirically verified thata threshold value of 20 shows good performance for motion detec-tion by removing most of the noise.

We traverse all frames within the video and identify and numberall objects across the video.

There exist cases where simple thresholding is not enough toseparate objects from noise and, in order to discard the noise, fur-ther processing is required. We use image processing morpholog-ical operations such as erosion, dilation, open and, close to gener-ate object blobs and, at the same time, remove noisy pixels. Basedon texture composition of the object, our motion detection methodfails to detect objects of certain types (e.g. smooth surface objects,such as in the video used for testing, Figure 1a and 1e). As seen inFigure 1b, only the boundaries of the ball are detected as motionpixels and the object is split in two. By applying a combination ofimage morphological operations, we connect pixels from disjointregions that are part of a single object, Figure 1c. From this, forevery video frame, we generate blobs for each of the objects tosegment them from the background.

Blobs represent where objects with motion are in the frame butdo not provide any information about which pixels belong to whichobject. Connected Components (CC) assign the same unique id topixels belonging to a single tracked object. Using those ids, we seg-ment the pixels, and generate an individual binary mask per con-nected pixels. These masks locate objects in each of the frames.Objects have diverse shapes and sizes and instead of coding eachshape individually as in [TOB∗97] we use a bounding box ap-proach. Based on generated masks, as seen in Figure 1d, we com-pute the bounding box of each object at each frame. The size ofthe object could change over time and thus the size of the boundingbox can also change.

To address this, we fix the size of the bounding box of an objectto the maximum size the bounding box across all frames. In order todo that, first we track each object’s bounding box searching for theclosest one in the previous frame, and then, we assign the unique idof that object. The tracking algorithm at this time is quite naive andwe are working on more sophisticated solutions based on existingtechniques from the literature. The initial naive technique however,shows promising compression and quality results.

We collect initial position and all movement data for every objectacross all frames. Based on this movement information, the bound-ing box and the pixels for each object we generate individual videosfor the changing data within each individual object. This producesfor each object a video where its shape, color and surroundings arecaptured for every frame in the original video.

Object videos capture a small area of the input video, and whenthese videos are packed together with the tracking information forstreaming purposes, the background pixels are missing. We inferthe video background by keeping track of pixel values that do notbelong to objects. Once we process the entire video, each pixel has


96


(a) Input video. (b) Consecutiveframe difference.

(c) Mask genera-tion

(d) Bounding boxgeneration andtracking.

(e) Object video.

Figure 1: Pipeline of the object based compression method.

a set of candidate values from individual frames that can qualifyas the background value. We use a voting technique to determinethe final value of pixels that are part of the background image. Thisbackground selection method will fail if a moving object sits stillthroughout the majority of the video. We are developing an ex-tended tracking and masking technique to deal with this case andothers.

The last step of the method, involves aggregating all the com-puted components from the video (video background image, indi-vidual object videos and tracking information) which are packedand ready to stream them to the remote user. This format allowsdifferent types of streaming that we discuss as future work in Sec-tion 8

4. Video reconstruction

To reconstruct the original video from the compressed format, weuse the streamed object videos and their motion tracking informa-tion. In order to re-generate the original video frames, we use thesegmented video background inferred from background voting pro-cess. The voted background image becomes the background of eachframe. We know when (in which video frames) an object appearsand, making use of the object’s tracking information, we replace thecorresponding background region with frames of the object video.This simple compositing technique reconstructs the original video.

5. Testing environment

For this initial method, we limited the scope of the video compres-sion problem and created ideal world synthetic videos using stillcameras and constantly moving objects. Still camera videos havestationary backgrounds which have small changes or do not changeat all. We assumed that lighting conditions do not have smoothtransitions because our moving object detection method does notconsider small color/intensity changes. If a pixel’s value changessmoothly over time, the motion detection fails to detect it and wedo not create a video for that object.

Tracked objects that stop at a certain frame for a period of timeare not detected due to their lack of movement. Therefore, we onlyconsider constantly moving objects around the camera to avoidthose cases and we generated synthetic videos at 16K, 8K, 4K and2K frame sizes.

To evaluate video quality, the existing works [MWS06] use the

Size (Mb) PSNR (db)Video MPEG Ours MPEG Ours16K 180 82 31.62 77.452K 22.5 5.2 30.33 75.83

Table 1: Video size and PSNR score comparison.

standard metric Peak Signal-to-Noise Ratio (PSNR) [HTG12]. Weuse this metric to compare the resulting quality of MPEG com-pression algorithm and our initial methodology. We compressedraw videos using MPEG compression algorithm and our methodol-ogy, and then we compared the reconstructed video quality, i.e. thePSNR score, and the compressed file size of both videos.

6. Results

Our early testing results demonstrate an improved compression ratecompared against direct MPEG compression. However initial testsare currently limited. Additionally our compression ratio for the16K files is not as good as for our 2k files which appears counter-intuitive to the core method.

The PSNR values for our method were notably superior to theMPEG compression methodology with PSNR results being similarfor both testing sizes. It is possible that there is an opportunity forimprovement in our compositing method that would improve ourPSNR scores overall, particularly for the 16K version.

7. Conclusions

In this paper, we presented an initial method that shows promis-ing results for 16K video compression and streaming. We use naivemethods to get initial results which need improvement, but estab-lish a simple modular framework that supports extension via plu-gin for the core elements. This should enable rapid expansion andexperimentation with alternate choices for improved performanceand enhanced capabilities. The achieved file sizes and PSNR scoresdemonstrate that, compared to standard MPEG, our object basedcompression approach could be used for streaming 16K videos us-ing less bandwidth while maintaining high visual quality.

8. Future work

This initial compression method works well for a set of test caseswith application to VR viewing. However, a long list of improve-


97


ments are necessary for it to be useful in a general manner. Thefollowing items are planned for improving its function.

Replacing our initial naive methods with machine learning andbetter rule based solutions will increase the overall video qual-ity and the PSNR score compared to traditional methods such asMPEG.

Extension of our ideal world from Section 5 to work with allvarieties of moving objects such as: Objects can stop, start movingor stop and restart their movement, and we have to take those casesinto consideration when generating the compressed object videos.

It is crucial for video quality to identify still objects that move atsome point of the video because if not, they will not be displayedwhen they remain stationary due to the naive movement detectionmethod.

Additionally, in the longer term creating methods to deal with amoving camera and integration with MPEG streams would allowthe technique to function as a drop in for all video compressionsystems.

Finally, our object compression approach can utilize a spritebased playback system where videos are directly projected intosprites in front of the background. This would allow us to studyreplacing the bounding box objects with more complex polygonalobjects potentially increasing our compression ratio.

Packing frames of individual objects into a single frame andtransmitting their u and v coordinates, permits a direct GPU basedvideo reconstruction and playback.

If this long list of problems can be overcome the proposedmethod could be utilized as a general technique for video com-pression without limitation of video content.

References[BPS18] BASSBOUSS L., PHAM S., STEGLICH S.: Streaming and play-

back of 16k 360◦videos on the web. In Communications Conference(MENACOMM), IEEE Middle East and North Africa (2018), IEEE,pp. 1–5. 1, 2

[BSB17] BASSBOUSS L., STEGLICH S., BRAUN S.: Towards a highefficient 360◦video processing and streaming solution in a multiscreenenvironment. In Multimedia & Expo Workshops (ICMEW), 2017 IEEEInternational Conference on (2017), IEEE, pp. 417–422. 2

[HTG12] HUYNH-THU Q., GHANBARI M.: The accuracy of psnr in pre-dicting video quality for different video scenes and frame rates. Telecom-munication Systems 49, 1 (2012), 35–48. 2, 3

[MWS06] MARPE D., WIEGAND T., SULLIVAN G. J.: The h.264/mpeg4 advanced video coding standard and its applications. IEEEcommunications magazine 44, 8 (2006), 134–143. 1, 3

[RAAT∗17] RONDAO ALFACE P., AERTS M., TYTGAT D., LIEVENSS., STEVENS C., VERZIJP N., MACQ J.-F.: 16k cinematic vr stream-ing. In Proceedings of the 2017 ACM on Multimedia Conference (2017),ACM, pp. 1105–1112. 1, 2

[SSHS16] SKUPIN R., SANCHEZ Y., HELLGE C., SCHIERL T.: Tilebased hevc video for head mounted displays. In Multimedia (ISM), 2016IEEE International Symposium on (2016), IEEE, pp. 399–400. 2

[TOB∗97] TALLURI R., OEHLER K., BARMON T., COURTNEY J. D.,DAS A., LIAO J.: A robust, scalable, object-based video compressiontechnique for very low bit-rate coding. IEEE Transactions on Circuitsand Systems for Video Technology 7, 1 (1997), 221–233. 2

[XHYV17] XIU X., HE Y., YE Y., VISHWANATH B.: An evaluationframework for 360-degree video compression. In Visual Communica-tions and Image Processing (VCIP), 2017 IEEE (2017), IEEE, pp. 1–4.2

[YLG15] YU M., LAKSHMAN H., GIROD B.: A framework to evaluateomnidirectional video coding schemes. In Mixed and Augmented Reality(ISMAR), 2015 IEEE International Symposium on (2015), IEEE, pp. 31–36. 2


98

Documents

Compression Of 16K Video For Mobile VR Playback Over 4K … · 2018. 11. 6. · Compression Of 16K Video For Mobile VR Playback Over 4K Streams. I. Vazquez and S. Cutchin {ikervazquezlopez}{stevencutchin}@boisestate.edu