[IEEE 2012 19th IEEE International Conference on Image Processing (ICIP 2012) - Orlando, FL, USA (2012.09.30-2012.10.3)] 2012 19th IEEE International Conference on Image Processing

CONTENT-ADAPTIVE TEMPORAL CONSISTENCY ENHANCEMENT FOR DEPTH VIDEO

Huanqiang Zeng and Kai-Kuang Ma

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798Email: [email protected]; [email protected]

ABSTRACT

The video plus depth format, which is composed of the texture videoand the depth video, has been widely used for free viewpoint TV.However, the temporal inconsistency is often encountered in thedepth video due to the error incurred in the estimation of the depthvalues. This will inevitably deteriorate the coding efficiency ofdepth video and the visual quality of synthesized view. To addressthis problem, a content-adaptive temporal consistency enhancement(CTCE) algorithm for the depth video is proposed in this paper,which consists of two sequential stages: 1) classification of sta-tionary and non-stationary regions based on the texture video, and2) adaptive temporal consistency filtering on the depth video. Theresult of the first stage is used to steer the second stage so that thefiltering process will be conducted in an adaptive manner. Extensiveexperimental results have shown that the proposed CTCE algorithmcan effectively mitigate the temporal inconsistency in the originaldepth video and consequently improve the coding efficiency of depthvideo and the visual quality of synthesized view.

Index Terms— Depth video, temporal consistency enhance-ment, virtual view synthesis, free viewpoint TV

1. INTRODUCTION

With the rapid development of camera, display and network com-munication techniques, free viewpoint TV (FTV) has been quicklyemerging, as it provides customers with freedom of choosing dif-ferent viewpoints of the scene observed [1]. Due to the superiorinteractivity and reality, FTV is expected to be widely exploited invarious domains, such as home entertainment, education, medicalfield, and so on.

One of the major challenges for FTV is how to store and transmitthe large amount of multi-view data required to synthesize a virtualview at a chosen viewpoint with a good quality. For this, an efficien-t data representation format for multi-view video, called video plusdepth (refer to Fig. 1 for an illustration), has been standardized by M-PEG [2] and widely used. It allows to transmit less viewpoints’ dataat the sender side and to synthesize the desired virtual views at the re-ceiver side based on the decoded texture video and depth video usingthe depth-image-based rendering (DIBR) technique [3]. It is well-recognized that the quality of the depth video is fairly instrumentalto the above-mentioned process; for that, a graph-cut-based depthestimation reference software (DERS) [4] has been developed anddistributed by the MPEG’s FTV standardization body to generate thedepth video. However, the depth estimation algorithm used in thisDERS is conducted frame by frame independently without consid-ering the temporal correlation inherited in video. Consequently, theresulted depth video could suffer from temporal inconsistency. Forexample, those video objects that are actually situated in the samedepth plane could be estimated with different depth values across

Znear

Zfar

Fig. 1. The first frame of View 4 of sequence “Newspaper” is pre-sented here as an illustration of the video plus depth format, whichconsists of the texture video (left) and its associated depth video(right). Note that the depth range of each view is normalized tothe full dynamic range [0, 255] based on the maximum Zfar and theminimum Znear distance from the camera for the corresponding 3Dpoints.

adjacent temporal frames. As expected, this temporal inconsistencywill cause annoying flickering artifact and thus impair the subjectivequality of the synthesized view. Moreover, more temporal incon-sistency means less temporal correlation, this will inevitably reducethe coding efficiency of the depth video. Therefore, how to mitigatetemporal inconsistency in the depth video becomes an effective so-lution to improve the coding efficiency of depth video and the visualquality of synthesized view.

Multiple temporal consistency enhancement algorithms can befound in [5]-[7]. Larsen et al. [5] enforced the temporal consisten-cy by exploiting a temporal belief propagation function on a graph,which is composed of seven nodes, namely, the current pixel, itsfour spatially-connected pixels, and its two temporally correspond-ing pixels identified by using the optical flow, one from the previousframe and the other from the next frame. Lee et al. [6] imposed atemporal constraint on the energy function of the graph-cut-basedDERS [4]. This temporal constraint is established by utilizing thedepth-value difference measured between the current pixel and itstemporally corresponding pixel from the previous frame identifiedby using motion estimation. Consequently, the undesired tempo-ral variations of the estimated depth values can be reduced. Min etal. [7] proposed a weighted mode filtering method to increase theresolution, suppress the noise and enforce the temporal consistencyfor depth video. In this method, the temporal consistency is im-proved by exploiting the patch similarity measurement based on thecurrent pixel and its temporally corresponding pixel identified by us-ing the optical flow.

In this paper, an efficient content-adaptive temporal consisten-cy enhancement (CTCE) algorithm for the depth video is proposed,which can be applied as a post-processing technique to any depth es-timation algorithm to facilitate the follow-up image processing task.In our approach, the stationary and non-stationary regions are firstlyidentified. The obtained result is then used to guide the second stage

3017978-1-4673-2533-2/12/$26.00 ©2012 IEEE ICIP 2012

so that the temporal consistency filtering process will be conductedin an adaptive manner. Experimental results have clearly shown thatsignificant improvements on the coding efficiency of depth video andthe visual quality of synthesized view are achieved.

The rest of this paper is organized as follows. The proposedCTCE algorithm is presented in Section 2 in detail. Extensive sim-ulation results are documented and discussed in Section 3. Finally,conclusions are drawn in Section 4.

2. PROPOSED CONTENT-ADAPTIVE TEMPORALCONSISTENCY ENHANCEMENT (CTCE) ALGORITHM

The proposed CTCE algorithm consists of two sequential stages asdescribed in the following two sub-sections, respectively. The sta-tionary and non-stationary regions are identified in the first stage sothat the follow-up temporal consistency filtering process will be ap-plied to these two kinds of regions adaptively in the second stage.

2.1. Stationary and Non-stationary Region Detection

Without a scene cut or any camera movement, it is expected thatthe depth values across the adjacent frames of the depth videoshould be constant or consistent on those stationary regions, andany depth variation can only be incurred by and presented in thosenon-stationary regions normally containing moving objects. There-fore, it is instrumental to discriminate these two types of regions asthe first step in order to provide some adaptation on their temporalconsistency filtering in the next step.

Consider that the depth video and its corresponding texturevideo should have similar image segments, the stationary and non-stationary regions can be detected simply based on the color differ-ence computed at each pixel position across adjacent frames overthe texture video. Here, the bidirectional color difference Dn(p)incurred at each pixel p of the current frame with respect to the samepixel position of the previous frame and to that of the next frame canbe obtained by summing up all the differences; that is,

Dn(p) =

n+1∑i=n

(|Yi(p)− Yi−1(p)|+ |Ui(p)− Ui−1(p)|+

|Vi(p)− Vi−1(p)|) (1)

where p = (x, y) is the spatial coordinate, n is the index of the cur-rent frame, Yi(p), Ui(p), and Vi(p) represent the Y, U, and V com-ponent values at each pixel p of the i-th frame of the texture video,respectively.

To detect the stationary regions, the color difference Dn(p)computed at each pixel position is compared with a pre-set thresholdT = 30, which is empirically determined from extensive simulationexperiments. If the color difference Dn(p) < T , pixel p is thenclassified as locating in a stationary region; otherwise, pixel p isconsidered as locating in a non-stationary region.

2.2. Temporal Consistency Filter

Unlike the texture video, the depth video reflects the depth informa-tion of video objects and usually consists of large segment-wise s-mooth regions with distinct boundary. On one hand, these smoothregions are advantageous on gaining the coding efficiency of thedepth video. On the other hand, it is desirable to preserve the sharpboundaries of video objects as they are crucial to the visual qualityof the synthesized view. Therefore, it is our goal to design a tempo-ral consistency filter that not only reduce the temporal inconsistency

of the depth video but also satisfy these two goals simultaneously.Furthermore, the texture video is exploited in this work to furtherbenefit temporal consistency enhancement. This is because the sim-ilar image segments can be found on the texture video and its depthvideo; this is especially beneficial to preserve the boundary sharp-ness of video objects. Motivated by the above analysis, a temporalconsistency filter is proposed in this paper as follows.

For each pixel on the depth video under filtering, the proposedtemporal consistency filter’s output is a weighted average of itsneighboring pixels. Three types of weights are considered, one onspatial closeness and two on pixel-value similarity, based on theobservation that the closer and more similar the neighboring pixelsto the current pixel, the higher correlation they have. For the former,the filter measures the spatial closeness between the current pixeland its neighboring pixel by their pixel distance. For the latter, thefilter considers the pixel-value similarity in the texture video and inthe depth video, respectively. Lastly, the filtering process is furtheradjusted according to the content of the video in terms of stationaryor non-stationary regions. Such adaptation effectively enforces thetemporal consistency for the stationary regions, while maintainingthe depth variation in the non-stationary regions. With the abovediscussion, the proposed temporal consistency filter is consolidatedand formulated as

dn(p) =

∑q∈εWC(q) ·WT

S (q) ·WDS (q) · dn(q)∑

q∈ε WC(q) ·WTS (q) ·WD

S (q)(2)

where p denotes the current pixel under filtering and q is denotedas one of its neighboring pixels within the neighborhood ε (in thiswork, a 10 × 10 window is used), dn(q) denotes the original depthvalue of the pixel q on the current frame, and dn(p) denotes theresulted filtered depth value of the current pixel p on the currentframe. Regarding the weights WC(q), W

TS (q), and WD

S (q), theyare generated through a Gaussian function. WC(q) measures thecloseness in terms of the Euclidean distance between pixels p and q,WT

S (q) and WDS (q) measure the similarity between p and q in the

texture video (denoted by T in the superscript) and in the depth video(denoted by D in the superscript), respectively; they are individuallydefined as

WC(q) = exp

(−||p− q||22

2σ2C

)(3)

WTS (q) = exp

(− (Yn(p)− Yn(q))

2

2σ2S

)(4)

WDS (q) = exp

(− (Z(p)− Z(q))2

2σ2S

)(5)

where σC = 3 and σS = 0.1 are the standard deviations, which areempirically determined through extensive experiments, Yn(·) de-notes the Y-component value (i.e., the luminance) of the currentframe of the texture video, and Z(·) denotes the depth value of thedepth video. Finally, depending on whether the current pixel p un-der filtering lies in a stationary region or in a non-stationary region,which has been decided in the previous stage as described in Sec-tion 2.1, the depth value Z(·) will be adaptively chosen as follows:

Z(·) ={dn−1(·), if p falls in a stationary region;dn(·), otherwise.

(6)

3018

Depthestimation

(DERS 5.0)

Viewsynthesis

(VSRS 3.5)MVC

encoder

MVCencoder

MVCdecoder

MVCdecoder

ProposedCTCE depth enhancement

method

Multi-view texture video

Synthesizedvideo

Sender side Receiver side

Transmission

TransmissionDepth video

Fig. 2. The FTV system with incorporation of the proposed CTCEalgorithm for mitigating temporal inconsistency.

Table 1. The multi-view video sequences used in our experiments

Sequences Resolution Original Synthesized Framesviews view

Newspaper 1024×768 4, 6 5 100

Champagne Tower 1280×960 41, 43 42 100

Alt Moabit 1024×768 9, 11 10 100

Book Arrival 1024×768 9, 11 10 100

3. EXPERIMENTAL RESULTS AND DISCUSSION

3.1. Coding Efficiency Evaluation

To evaluate the performance of the proposed CTCE algorithm, ithas been incorporated into the FTV system as shown in Fig. 2. Inour experiments, the original depth video is generated by using theDERS [4]. Both the texture video and depth video are independentlyencoded using multi-view video coding (MVC) reference software—joint multi-view video coding (JMVC 8.5) developed by JVT [8].The reconstructed texture video and depth video are then used asthe input to synthesize the virtual view by using the view synthesisreference software (VSRS 3.5) [9].

To measure the gain on coding efficiency contributed by the pro-posed CTCE algorithm, experiments have been conducted on the o-riginal depth video as well as the enhanced depth video based onthe MVC reference software, JMVC 8.5. The test conditions ofJMVC are set as follows. Each test sequence shown in Table 1 andFig. 3 [10] is encoded under a GOP length = 16 and the quantiza-tion parameter = 24, 28, 32, and 36, respectively. The options ofusing the rate distortion optimization and CABAC entropy codingare enabled. The search range of motion estimation and of disparityestimation is set to ±64 each.

Since the depth video is used to synthesize the virtual view ratherthan display, the quality of the synthesized view should be consid-ered in the performance evaluation. Therefore, the PSNR refersto that of the synthesized view, which is calculated by comparingthe synthesized view with the ground truth (i.e., the original texturevideo captured at the same viewpoint), and the bit rate refers to thatof the depth video. Table 2 compares the performance of two tempo-ral consistency enhancement algorithms, the proposed CTCE algo-rithm and the method presented in [6]. Note that the rate-distortiongain of each method is measured by exploiting BDPSNR and BD-BR [11] respectively, with respect to that of the original depth videowithout applying any temporal consistency enhancement method.

From Table 2, it can be seen that the proposed CTCE algorithmconstantly achieves, on average, 0.89 dB PSNR improvement and38.39% bit rate saving, compared with the outcomes resulted fromthe original depth video. Experimental results further indicate thatthe proposed CTCE algorithm consistently outperforms Lee et al. [6]by 0.40 dB PSNR gain and 16.01% bit rate reduction on average.

Fig. 3. The illustration of multi-view video sequences (the firstframe): (a) Newspaper (View 4), (b) Champagne Tower (View 41),(c) Alt Moabit (View 9), and (d) Book Arrival (View 9).

Table 2. Experimental results of two temporal consistency enhance-ment methods: (A) Lee et al. [6], and (B) our proposed CTCE algo-rithm.

Sequences Method BDPSNR (dB) BDBR (%)

Newspaper(A) +0.21 -18.18(B) +0.56 -39.24

Champagne Tower(A) +0.41 -17.12(B) +1.23 -42.76

Alt Moabit(A) +0.54 -21.84(B) +0.76 -31.29

Book Arrival(A) +0.79 -32.37(B) +1.03 -40.18

Average(A) +0.49 -22.38(B) +0.89 -38.39

3.2. Visual Quality Evaluation

To show the effect on visual quality, for each multi-view video se-quence as shown in Table 1 and Fig. 3, each frame of the virtual viewis synthesized based on its two neighboring texture-video framesand their associated depth-video frames using VSRS 3.5 [9]. Thesubjective quality comparison on two multi-view video sequences,“Newspaper,” and “Book Arrival,” are respectively shown in Fig. 4and Fig. 5 for illustration. One can see that the original depth videohas presented a noticeable temporal inconsistency across adjacen-t frames, leading to the synthesized view with an inferior quality.After applying the proposed CTCE algorithm, it can be observedthat the content of adjacent frames in the enhanced depth video andcorresponding synthesized view looks much more consistent. More-over, the annoying flickering artifacts, especially presented along thevideo object’s boudary in the synthesized view, are effectively sup-pressed.

4. CONCLUSION

In this paper, a novel content-adaptive temporal consistency en-hancement (CTCE) algorithm is proposed for processing the depth

3019

(a) Original (b) Enhanced

View 4 Depth video

View 6 Depth video

View 5 Synthesized

view

40th 41st 42nd 40th 41st 42nd

Fig. 4. An enlarged portion of the depth video and the synthesized view of sequence “Newspaper”: (a) Original video; (b) Enhanced videoby the proposed CTCE algorithm.

9th 10th 11th 9th 10th 11th

(a) Original (b) Enhanced

View 9 Depth video

View 11 Depth video

View 10 Synthesized

view

Fig. 5. An enlarged portion of the depth video and the synthesized view of sequence “Book Arrival”: (a) Original video; (b) Enhanced videoby the proposed CTCE algorithm.

video in order to improve its coding efficiency and the visual qual-ity of the synthesized view. In our approach, the stationary andnon-stationary regions are firstly detected and classified so that thefollow-up temporal consistency filtering can be conducted in thesetwo types of regions with a different treatment. Extensive simulationresults have clearly justified the effectiveness and efficacy of the pro-posed CTCE algorithm, not only achieving significant improvementon coding efficiency of the depth video but also producing distinctlyimproved visual quality of the synthesized view.

5. REFERENCES

[1] M. Tanimoto, M. P. Tehrani, T. Fujii, and T. Yendo, “Free-viewpointTV,” IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 67-76, Ja-nunary 2011.

[2] A. Bourge and C. Fehn, “White Paper on ISO/IEC 23002-3 Auxil-iary Video Data Representations,” ISO/IEC JTC1/SC29/WG11, MPEGN8039, April 2006.

[3] C. Fehn, “Depth-Image-Based Rendering (DIBR), Compression andTransmission for a New Approach on 3D-TV,” Proceedings of SPIEStereoscopic Displays and Virtual Reality Systems XI, pp. 93-104, Jan-uary 2004.

[4] M. Tanimoto, T. Fujii, M. P. Tehrani, and M. Wildeboer, “Depth Esti-mation Reference Software (DERS) 5.0,” ISO/IEC JTC1/SC29/WG11,MPEG M16605, June 2009.

[5] E. S. Larsen, P. Mordohai, M. Polleyfeys, and H. Fuchs, “Temporal-ly Consistent Reconstruction from Multiple Video Streams Using En-hanced Belief Propagation” International Conference on Computer Vi-sion (ICCV), pp. 1-8, October 2007.

[6] S. B. Lee and Y. S. Ho, “Temporally Consistent Depth Map Estima-tion Using Motion Estimation for 3DTV,” International Workshop onAdvanced Image Technology, pp. 149(1-6), Januray 2010.

[7] D. Min, J. Lu, and M. N. Do, “Depth Video Enhancement Based onWeighted Mode Filtering,” IEEE Transactions on Image Processing,vol. 21, no. 3, pp. 1176-1190, March 2012.

[8] Joint Video Team, Multi-view Video Coding Reference Software—Joint Multi-view Video Coding (JMVC 8.5). [Online]. Available:garcon.ient.rwth-aachen.de, March 2011.

[9] M. Tanimoto, T. Fujii and K. Suzuki, “View Synthesis Algorith-m in View Synthesis Reference Software 2.0 (VSRS2.0),” ISO/IECJTC1/SC29/WG11, MPEG M16090, February 2009.

[10] Y. S. Ho, E. K. Lee, and C. Lee, “Multi-view Video Test Sequence andCamera Parameters,” ISO/IEC JTC1/SC29/WG11, MPEG M15419,April 2008.

[11] G. Bjontegaard, “Calculation of Average PSNR Differences betweenRD-Curves,” Document VCEG-M33, VCEG 13th meeting, April 2001.

3020

Documents

[IEEE 2012 19th IEEE International Conference on Image Processing (ICIP 2012) - Orlando, FL, USA (2012.09.30-2012.10.3)] 2012 19th IEEE International Conference on Image Processing