Compression-Aware Digital Pan/Tilt/Zoommsw3.stanford.edu/~mmakar/pub_files/Makar_Asilomar_2009.pdf · a static camera, the motion in the transmitted video results mostly from changing

Compression-Aware Digital Pan/Tilt/Zoom

Mina Makar, Aditya Mavlankar and Bernd Girod Information Systems Laboratory, Department of Electrical Engineering, Stanford University

{mamakar, maditya, bgirod}@stanford.edu

Abstract-We consider a video transmission system that supports digital pan/tilt/zoom by cropping the region-of-interest (RoI) chosen by the user and encoding it before transmission. With a static camera, the motion in the transmitted video results mostly from changing the RoI. We propose an efficient technique for cropping the RoI in a way that yields low-energy residual after motion compensation performed by a video encoder such as H.264/AVC. Experimental results indicate that the proposed technique can achieve a significant rate reduction of about 70% compared to compression-oblivious digital pan/tilt/zoom with negligible increase in system complexity.

1. INTRODUCTION

The availability of high-spatial-resolution digital video is on the rise. This is fueled by the increase in spatial resolution of sensors and growing capacities of storage devices. Also, algorithms for stitching a comprehensive high-resolution view from multiple cameras are available. Examples of such high-resolution video mosaicing techniques are [1–3].

The limited spatial resolution of displays and/or the limited transmission bit-rate often rule out the delivery of the high-resolution video. In this case, a downsampled or cropped version of the video might be sent instead. We propose a video transmission system that enables digital pan/tilt/zoom during the streaming session by cropping out and encoding the region-of-interest (RoI) chosen by the user. A schematic of this system is shown in Fig. 1. The video is downsampled to create a thumbnail video that serves as an overview for the user. The user can control the RoI interactively while watching the streamed video. The transmitter crops and/or resamples the relevant RoI from the original high-definition (HD) resolution, encodes it and streams it to the receiver. The

system also allows the user to go back and replay an RoI from the previously recorded video sequence.

The prior efforts [4–7] can be applied to interactive spatial browsing of high-resolution videos. Some work [4, 5] relies on JPEG 2000 [8] to provide spatial random access. Using JPEG 2000 independently on each frame cannot exploit the correlation among successive frames resulting in poor rate-distortion (R-D) performance. This is a severe shortcoming especially when the video has low motion. When using a video coding standard, the main challenge is to provide spatial random access. In our system, we crop the RoI from the raw video data and can employ any standard video coding scheme to compress the resulting RoI sequence efficiently. Notice that since the cropped RoI has a smaller size, we need to encode a smaller area compared to the size of the original recorded frame; hence, a lower resolution encoder can be used to encode the RoI.

As one possible application scenario, consider surveilla-nce. Surveillance video is usually recorded with a static camera and comprises little motion. However, the cropped RoI sequence entails high motion due to the pan/tilt/zoom operations. Unfortunately, this results in high bit-rates for regions that might have changed little in the recorded scene. In this paper, we develop an efficient technique to crop and resample the RoI in a way that reduces the coding bit-rate. The cropping technique handles panning, zooming in and zooming out each in a different, compression-aware manner. The developed technique requires no modifications of the sta-ndard H.264 codec we use and entails negligible complexity.

The rest of the paper is organized as follows. Section 2 explains the proposed compression-aware technique for pan

Fig. 1. RoI video transmission system with interactive digital pan/tilt/zoom.

and tilt operations. In Section 3, we analyze the zoom-in and zoom-out operations and show how they can be performed in a compression-aware manner. Finally, in Section 4, we present experimental results obtained for a control experiment on a static image and also results for two representative RoI sequences cropped from two HD videos.

2. PAN AND TILT OPERATIONS

Fig. 2 presents the cropping operation of the RoI. The HD frame dimensions are W x H, the RoI dimensions are w x h and the zoom factor is a. According to the RoI selection parameters obtained from the user, the HD frame is resized to size aW x aH. Following this, the appropriate RoI is cropped from the resized frame. This is equivalent to cropping an area with dimensions (w/a) x (h/a) from the HD frame and resizing this area by the factor a in order to fit the RoI dimensions. Let the RoI offset in the resized frame be (Δx, Δy), then the offset of the corresponding area in the HD frame is (Δx /a, Δy /a).

We assume that the camera is static and the video content has very low motion. We exploit the static nature of the scene and prepare the RoI in a compression-aware manner to improve the R-D performance.

In this section, we consider the pan and tilt operations, thus assuming that the zoom factor a is constant. We aim to mitigate two potential inefficiencies. First, even when the scene is static, there is still a nonzero difference between consecutive video frames due to camera-noise. The resulting prediction residual increases the bit-rate, especially when coding at high quality. Static portions should ideally be represented by a frame copy, thus resulting in much lower bit-rates. Second, imagine that the RoI is cropped at an arbitrary location. During motion estimation, the video encoder will choose a motion vector up to certain resolution (quarter pixel in H.264 standard [9]) with specific filters for sub-pixel interpolation defined in the standard. When the user pans or tilts the RoI, the motion compensation employed by the video encoder might turn out less than perfect, although the video contents appear unchanged for most of the scene.

Noise reduction techniques can address the first issue, while the second issue can be addressed by carefully choosing the shifts associated with the pans or tilts. A simple and effective denoising scheme applies a non-linear soft-coring technique on each frame individually (Fig. 3). The image f(x,y) is filtered to obtain a lowpass component fL(x,y) and a highpass component fH(x,y). We assume that small values in fH(x,y) are due to noise whereas large values in fH(x,y)

represent edges and fine details. The highpass component passes through the non-linear filter given in (1) so that the values corresponding to noise are suppressed while the values corresponding to edges are unmodified. Finally, the output of the non-linear filter α(fH(x,y)) is added back to the lowpass component to obtain the denoised video frame g(x,y). The non-linear filter is given by

(1)

where m controls the sharpness of the resulting image; τ controls the soft threshold for suppressing noise; and γ controls the smoothness of transition to the linear region of the input-output characteristic.

The second inefficiency is due to the finite resolution of the motion vectors and non-ideal spatial interpolation for sub-pixel positions. This inefficiency can be avoided by restricting the RoI offset locations to integer pixels. If Δx and Δy are restricted to integers (see Fig. 2), then integer motion vectors will yield an efficient prediction. We noticed that this granularity of movement is sufficient for the pan and tilt operations. Fig. 4 presents a block diagram of the proposed preprocessing steps. To further demonstrate this idea, an illustrative example in 1-D is shown in Fig. 5.

Pixels from Δx to Δx+n are cropped from the resized frame. These pixels correspond to locations from Δx /a to (Δx+n)/a in the HD frame. Bicubic interpolation is used to calculate the values at these locations from the actual pixel values in the HD frame. In Fig. 5, the actual HD frame pixels are shown using red circles and numbered from X to X+N. Assume that in the next frame, the user pans the RoI to the right by k pixels where k << n. This requires cropping a line that starts at Δx+k and ends at Δx+n+k. The values acquired from the HD frame are at locations (Δx+k)/a to (Δx+n+k)/a. Since most of these locations are the same as before panning, the video encoder will infer a horizontal motion vector with k pixels shift and little or no motion compensation residual for all pixels except the newly included k pixels.

3. ZOOM OPERATION

In this section, a compression-aware method for perfor-ming zoom operations is developed. A zoom operation is characterized by a change in the zoom factor a. It is performed in steps over a number of frames so that the user does not experience jerkiness in the displayed video. If we directly

Fig. 2. Resizing the HD frame as an intermediate step before cropping the RoI.

,1,,

,

yxf

HH

H

eyxfmyxf

encode frames during zoom operation, a large prediction residual may result because of the change in the RoI contents. Zoom-in implies an increase in a, whereas zoom-out implies a decrease in a. Zoom-in and zoom-out have different characte-ristics since in zoom-out, new areas are acquired from the HD frame which is not the case for zoom-in. Thus, the two operations are treated separately in the following subsections.

3.1. Zoom-in Assume that zoom-in is performed over q steps. In each

step a is multiplied by the factor zi > 1 and the total zoom change is zi

q. Note that q is not known beforehand and depends on the user input. We exploit the fact that during zoom-in, the area to be displayed during a given step is always a part of the area that was displayed in the previous step.

The zoom factor is not modified before encoding. The received RoI is magnified after decoding and the area required for RoI display is cropped at the receiver. For example, if we consider the first frame in the zoom-in operation, the zoom factor for cropping is not changed to azi. Instead, the HD frame is resized by a and the cropped portion includes an excess margin around the new RoI; i.e., a larger portion of the scene than required for the display is sent. At the receiver, the portion corresponding to the RoI is magnified by zi to make it w x h pixels in size. This yields large bit-rate reduction since for static sequences we can encode most of the picture by copying the previous frame. After a few steps, the quality of the displayed RoI degrades because the magnification ratio increases and a blurry version of the required area is produ-ced. At this point, the sender changes the zoom factor to the current value and crops the input frame with the correct zoom factor. The same procedure is repeated until the final zoom factor azi

q is reached. In Section 4, we explain how we set the threshold for changing the zoom factor prior to encoding.

3.2. Zoom-out Since a new portion of the scene is acquired in each step,

we propose a different technique for zoom-out. Assume that zoom-out is performed over q steps and in each step a is multiplied by the factor zo < 1. Instead of changing the zoom factor to azo in the first frame of a zoom-out operation, the HD frame is still resized by a. A (w/zo) x (h/zo) area is cropped from the resized frame and every 17th horizontal and vertical line in the cropped area is dropped until the remaining part has dimensions w x h. Dropping every 17th line ensures that the resulting macroblocks (MBs) will have good matches in the previous frame with integer motion vectors. If a smaller zo is desired, we can drop every 9th line or every 5th line to match 8 x 8 or 4 x 4 motion compensation block sizes.

The next HD frame is resized by the right zoom factor which is azo². Line dropping is only performed for every other frame until the final zoom factor azo

q is reached. Dropping lines over a number of successive frames before changing the zoom factor does not improve the performance. Dropping lines at the same location over successive frames causes a noticeable artifact, and changing the location where we drop lines causes misalignment between MB edges in successive frames which fails to lead to efficient motion compensation.

4. EXPERIMENTAL RESULTS

Experiments are performed with two 1920 x 1080 video sequences Panel Discussion 1 and Panel Discussion 2. Both are low-motion videos recorded with a static camera. Panel Discussion 2 shows a slide projected on the wall in the background so a large portion of the sequence has nearly no motion. The first frame of Panel Discussion 2 and an example RoI are shown in Fig. 1. The RoI dimensions are 480 x 240.

4.1. Control Experiments with Static Input We first conduct a control experiment where the scene is

completely static. The goal here is to gauge the potential of the proposed improvements when motion compensation only needs to account for movement due to RoI change. We use a frozen image which is the first frame of Panel Discussion 1. This image is shown in Fig. 6. All pan, tilt and zoom operations are synthetic and, unlike for the later experiments, do not correspond to an actual user input. Since we work on a static image, the non-linear noise removal is unnecessary. Thus, the results serve as an upper bound for the gain that can be achieved from compression-aware cropping and zooming in the absence of camera-noise and motion of objects in the video.

To test the benefits of compression-aware pan and tilt, we fix a = 480/700 and extract six RoI sequences of 100 frames each. They correspond to horizontal, vertical and diagonal movements with low and high speeds. Referring to Fig. 6, the thick borders indicate the start of all sequences and the thin borders indicate the end of the high-speed sequences. Low-speed sequences end halfway of the high-speed ones. For comparison, each RoI sequence is first extracted without the compression-aware restrictions on the cropping locations. We refer to this as the compression-oblivious method. For

Fig. 3. Denoising with non-linear soft-coring technique.

Fig. 4. Proposed chain of compression-aware preprocessing.

Fig. 5. Illustrative example for RoI interpolation when RoI selection is restricted to integer shifts.

compression-aware pan/tilt, these locations are approximated to the nearest even integer position horizontally and vertically. Since the chroma components are usually downsampled by a factor of 2 from the original resolution, the cropping restric-tion to even integer locations allows the chroma components to be cropped at integer locations as well as the luma compo-nent. These RoI sequences are compressed with H.264/AVC using JM 15.1 reference software [10]. The R-D curve averaged over the six sequences is plotted in Fig. 7. We obtain up to 85% rate reduction over the oblivious method at high rates. For low rates the prediction residuals for both RoI sequences are small due to severe quantization.

Compression-aware zoom-out is tested by resizing the HD frame with values from 1 to 0.39 over 16 frames and cropping the w x h center part from each frame. Again, this corresponds to the compression-oblivious method for performing zoom-out. The same RoI sequence is regenerated by applying the technique proposed in Section 3.2 and both sequences are encoded using H.264/AVC. Fig. 8 shows the R-D curve for the zoom-out operation. It indicates that the compression-aware zoom-out outperforms the oblivious method and achieves a bit-rate reduction of 35% at high rates. Since our method is only applied every other frame, the gain is not as high as compression-aware pan and tilt.

For zoom-in operation, the transmitted sequence is different from the sequence that is finally displayed. Nevertheless, our goal is to minimize the degradation in compression-aware zoom-in that results from the magnif-ication process that is performed at the receiver after decoding the RoI. We conduct the following experiment to determine the threshold of the magnification ratio at which we decide to acquire an input frame at the current zoom factor. A 480 x 240 area is cropped from the center of the HD frame in Fig. 6 and the cropped area is resized by different ratios to a smaller region and then magnified again to its original size. The PSNR of the result compared to the original cropped area indicates the quality of the magnification process. We found that at a magnification ratio of 1.25, the PSNR is about 39.8 dB and the magnification starts to produce noticeable

artifacts thereafter. Thus, we set the value of 1.25 as the threshold for changing the zoom factor at the transmitter.

4.2. Rate-Distortion Performance for RoI Video We compress the RoI video sequences extracted from the

sequences Panel Discussion 1 and Panel Discussion 2. These RoI sequences were recorded while interactively viewing the videos. Each RoI sequence has 600 frames. Pan and tilt changes last roughly for 160 frames and zoom changes last roughly for 20 frames in each sequence. We first denoise the frames by soft-coring with the values m = 1, γ = 3 and τ = 15. These values ensure that the image quality is not degraded by denoising. The RoI sequences are then cropped using both compression-oblivious and compression-aware techniques. The sequences are encoded using JM software with GOP of size 30. The R-D curves for the two sequences are presented in Figs. 9 and 10. The PSNR is evaluated by comparing the encoder input and the decoder output for the compression-oblivious and compression-aware methods independently.

The compression-aware approach achieves around 70% rate reduction over the oblivious approach at high rates. We also plot the R-D curves for the sequences extracted after denoising without applying compression-aware cropping and zooming. A large percentage of the gain comes from denoising. This is because pan/tilt/zoom changes are present in only 30% of the frames. Having lower motion content, Panel Discussion 2 is encoded at half the rate as Panel Discussion 1 yet the gain associated with the compression-aware technique is similar for both sequences.

We compare the number of bits required for compression-aware and compression-oblivious techniques on denoised frames for Panel Discussion 2 at a PSNR of 42 dB. Fig. 11(a) represents the bit-rate trace for a pan/tilt operation. The compression-aware technique yields rate reduction over all associated frames except the I frames (frame 181 in figure). In Fig. 11(b), the trace for a zoom-in operation is plotted. The compression-aware technique results in lower bit-rates for frames with no change in the zoom factor. When a frame is acquired at the current zoom factor (frame 326), a sudden

Fig. 6. First frame in Panel Discussion 1. The borders indicate the start and end of RoI sequences used for testing pan and tilt operations. Solid red borders are used for horizontal movement, dashed yellow borders for vertical movement and dotted blue borders for diagonal movement.

increase in rate is observed due to the large zoom change in comparison to the slowly changing zoom factor for the oblivious technique. Fig. 11(c) shows the trace for a zoom-out operation. The rate reduction for the compression-aware technique is observed every other frame, i.e., only when line dropping is performed.

5. CONCLUSIONS

We present a video transmission system with digital pan/tilt/zoom for viewing arbitrary portions of a high-spatial-resolution video. We show that compression-aware pan/tilt/z-oom can reduce the transmission bit-rate substantially. Our method involves the reduction of camera-noise followed by cropping the RoI video sequence in a way that results in a low-energy prediction residual after motion compensation.

The proposed technique requires only simple signal resampling operations. Experimental results indicate that we can achieve about 4.5 dB gain at the same bit-rate or reduce the bit-rate by about 70% for the same video quality.

6. REFERENCES

[1] “Halo: Video Conferencing Product by Hewlett-Packard.” Website:

http://www.hp.com/halo/index.html. [2] C. Fehn et al., “Creation of High-Resolution Video Panoramas of Sport Events,” Proc. Eighth IEEE International Symposium on Multimedia ISM'06, San Diego, CA, pp.291–298, Dec. 2006. [3] “Dodeca 2360 system: High-resolution 360º video by Immersive Media.” Website: http://www.immersivemedia.com/#105. [4] D. Taubman and R. Rosenbaum, “Rate-Distortion Optimized Interactive Browsing of JPEG2000 Images,” in Proc. IEEE International Conference on Image Processing (ICIP), Barcelona, Spain, vol. 3, pp. 765-768, Sept. 2003. [5] F.-O. Devaux et al., “Remote interactive browsing of video surveillance content based on JPEG 2000,” submitted to IEEE Transactions on Circuits and Systems for Video Technology, July 2008, unpublished. [6] A. Mavlankar, P. Baccichet, D. Varodayan, and B. Girod, “Optimal Slice Size for Streaming Regions of High Resolution Video with Virtual Pan/Tilt/Zoom Functionality,” Proc. of 15th European Signal Processing Conference (EUSIPCO), Poznan, Poland, pp. 1275–1279, Sept. 2007. [7] S. Heymann et al., “Representation, Coding and Interactive Rendering of High-Resolution Panoramic Images and Video Using MPEG-4,” in Proc. Panoramic Photogrammetry Workshop (PPW), Berlin, Germany, Feb. 2005. [8] “ISO/IEC 15444-1:2000, Information Technology: JPEG 2000 Image Coding System,” 2002. [9] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, July 2003. [10] “H.264/MPEG-4 AVC Reference Software Manual,” in Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-AD010, Jan. 2009.

Fig. 7. R-D curve for pan and tilt experiment.

0 500 1000 1500 200030

32

34

36

38

40

42

44

rate (kbps)

PS

NR

-Y (d

B)

Compression-awareCompression-oblivious

85%

Fig. 9. R-D curve for Panel Discussion 1 RoI sequence.

0 200 400 600 800 1000 1200 140032

34

36

38

40

42

44

rate (kbps)

PS

NR

-Y (

dB

)

Compression-awareOblivious, DenoisedOblivious, Original

62%

Fig. 10. R-D curve for Panel Discussion 2 RoI sequence.

Fig. 8. R-D curve for zoom-out experiment.

0 500 1000 1500 2000 2500 300030

32

34

36

38

40

42

44

rate (kbps)

PS

NR

-Y (dB

)

Compression-awareCompression-oblivious

35%

0 100 200 300 400 500 600 70034

36

38

40

42

44

46

rate (kbps)

PS

NR

-Y (

dB

)

Compression-awareOblivious, DenoisedOblivious, Original

81%

(a) (b) (c)

Fig. 11. Comparison of bit-rate traces for different operations. (a) Pan and tilt (b) Zoom-in (c) Zoom-out.

165 170 175 180 185 1900

1

2

3

4x 10

4

Frame number

Num

ber

of b

its

Pan and Tilt

Compression-awareOblivious, Denoised

322 323 324 325 326 3270

1

2

3x 10

4

Frame number

Nu

mbe

r o

f bits

Zoom-in


435 436 437 438 439 440 441 4420.5

1

1.5

2x 10

4

Frame number

Nu

mbe

r o

f bits

Zoom-out


165 170 175 180 185 1900

10

20

30

40

frame number

kbits

/ fr

am

e

Pan and Tilt


322 323 324 325 326 3270

10

20

30

frame number

kbits

/ fr

am

e

Zoom-in


435 436 437 438 439 440 441 4420

5

10

15

20

frame number

kbits

/ fr

am

e

Zoom-out


Documents

Compression-Aware Digital Pan/Tilt/Zoommsw3.stanford.edu/~mmakar/pub_files/Makar_Asilomar_2009.pdf · a static camera, the motion in the transmitted video results mostly from changing