[IEEE 2013 Picture Coding Symposium (PCS) - San Jose, CA, USA (2013.12.8-2013.12.11)] 2013 Picture Coding Symposium (PCS) - Temporal visual masking for HEVC/H.265 perceptual optimization

Temporal Visual Masking for HEVC/H.265

Perceptual Optimization

Velibor Adzic, Hari Kalva, Borko Furht

Computer & Electrical Engineering and Computer Science

Florida Atlantic University

Boca Raton, Florida, USA

{vadzic, hkalva, bfurht}@fau.edu

Abstract—We present a method that employs human visual system

characteristics for perceptually lossless optimization of video

coding. Algorithm is robust and applicable for any modern hybrid

coder. We decided to evaluate it using the most recent standard

(HEVC/H.265) in order to show potential benefits for future

applications. Experiments show savings of up to 8% in bitrate

without any loss in quality as confirmed by two sets of subjective

experiments using DSCQS and ACJ methodology.

Keywords: HEVC, H.265, perceptual video, HVS, subjective quality

I. INTRODUCTION

There are at least two motivating factors for the work presented in this paper. On one hand, a high demand for the video content requires improvement in compression that cannot always be achieved with the state of the art coders; on the other hand, solving this problem provides opportunity for an interdisciplinary approach to problem solving. For more than a century psychovisual research provided models of numerous phenomena observed in the human visual system (HVS). Some of the models are implemented in modern image and video coding algorithms but many more are left unexplored. One such phenomenon is visual masking that was originally reported at the end of 19th and beginning of the 20th century [1, 2]. Visual masking can be exhibited in either spatial or temporal domain. While researchers considered spatial masking for image coding and perceptual extensions, temporal visual masking received much less attention from engineering community. It wasn’t until 1989 that temporal masking was considered in light of potential benefits for video coding [3].

The rest of the paper is organized as follows. In section II we describe temporal visual masking, a review of related work is presented in section III. In sections IV and V we describe our model for temporal masking and implementation with results from experiments. Finally, in section VI we bring conclusions and potential open problems for future research.

II. TEMPORAL VISUAL MASKING

Temporal visual masking happens when two competing visual stimuli are presented one after the other and one of the stimuli gets masked. Although there is no consensus on biological explanation of this phenomenon one of the theories is that the higher level processing of visual information (target) is interrupted by sudden presentation of other stimulus (mask).

There are two basic kinds of temporal visual masking – forward and backward. Both designate sequence in which the stimuli are presented. In forward masking the mask precedes the target while in backward masking target is presented before the mask. More details about history and recent findings in backward visual masking can be found in [4]. We are focusing on backward masking as it seems intuitive that backward masking allows for more impairments to be introduced in the video sequence without propagation. If we consider frames of the video sequence as targets and mask we can conclude that certain number of frames before the mask frame can be impaired without observer noticing it.

III. RELATED WORK

As stated in the introduction, work on possible effects of temporal visual masking on presentation of video sequences is first introduced in [3]. In this paper Girod investigated forward masking and its effects on frames just after the scene change. In [5] authors further explored the possibility of compression increase after the abrupt stimulus change. They indicated that extra compression can be achieved because of the lower visibility of high spatial frequencies in the first 100 milliseconds of a new scene. Another work on forward masking application to video coding is done in [6]. Authors reported possibility of “hiding” quantization artifacts in the first frame of a new scene, but also found that the artifacts are visible starting from the second frame allowing for moderate savings in bitrate.

Backward masking was explored to a lesser extent. In [7] and [8] authors investigated effects of frame dropping (or freezing) for certain amount of time just before and after the scene change. They found that visibility of repeated frames peaked after 30-100 milliseconds after scene cut. While [7] claimed forward masking is stronger, [8] claimed backward masking is more dominant. Our prior work confirmed that backward masking is much stronger and we also showed that frame freezing is not an effective way of exploiting temporal masking [9]. While the work in reviewed papers introduces temporal visual masking in video coding none of the authors made precise correlation between parameters of psychophysical effects and video coding parameters. None of the previous studies applied and evaluated temporal masking for bitrate reduction is video coding. Moreover, most of the findings were preliminary and not fully implemented.

430978-1-4799-0294-1/13/$31.00 ©2013 IEEE PCS 2013

IV. MODEL FOR BACKWARD MASKING

In order to utilize backward masking phenomenon for video coding it is necessary to develop a model that connects parameters and results from psychophysical experiments with parameters of video coding. In this particular case temporal masking phenomenon suggests high tolerance for visual impairments just before significant inter-frame change. A particularly good candidate is a scene cut. Here we have all the elements needed for visual masking. First frame of a new scene acts as a mask on several frames at the end of the previous scene, which become target in this case. Since we are considering only quantization impairments as target, and not the whole frame, it is obvious that mask in this case has much higher energy and relatively larger size. This can be classified as particular mode of backward masking called “pattern masking” [4]. Although there exist numerous studies that explore pattern masking from the psychophysical perspective [9-11], these studies only report the results that are obtained from experiments and do not have an encompassing model.

Model that authors developed earlier is used as a basis [12]. It is improved with more precise parameters based on empirical results. Logistic function fits results from strong pattern masking that follow monotonically increasing curve of target visibility as the duration of stimulus onset asynchrony increases. We are interested in amount of quantization that can be introduced without being noticed by observers. Quantization parameter is used to directly determine quantization levels in most of the encoders, so we represent the model with formula:

∆𝑄𝑃 = 𝑠 +∆𝑄𝑃𝑚𝑎𝑥 −𝑠

1+ 𝑒𝐿−2.5(

𝐹𝑘)

In this formula ∆QP is increase for QP parameter, ∆QPmax is a difference between maximum QP value allowed by the encoder and the QP value used by a standard encoder for that frame, s is initial value for QP increase (starting offset), F is the sequence number of a frame in the ramp, L is logistic parameter and k is used as normalizing coefficient for different number of frames in the ramp (number of frames divided by 5). Examples of ramp quantization functions for the cases of 5 and 10 frames and initial QP value of 30 with a starting QP offset of 1 are presented in Fig. 1. Ramp can be extended to any number of

frames by adjusting parameter k. Increasing the ramp increases bitrate savings but begins to introduce perceptible distortion. In all of our experiments we use ramp with 5 frames only.

The model is implemented in the reference encoder for HEVC/H.265 standard (HM version 10.1) [13]. All ramps are set to 5 frames before each scene cut. Values of QP increase for each frame in the ramp are determined using sampled values from continuous function in (1).

V. EXPERIMENTS AND RESULTS

Because we are using perceptual optimization of video coding and claiming visually lossless results, the only way to test our claims is by subjective testing. Our goal is to conduct two sets of experiments and include control cases. First we want to directly compare original sequences with ones that are modified by QP ramps. Secondly, we introduce control sequences, in which the exactly same ramp is introduced not just before scene change, but at some other random position. Our claim is that subjects are going to be able to perceive those ramps, because there is no backward masking effect on the whole frame if it’s not very close to the end of a scene.

All three classes of sequences: original (HM), backward masked (BM) and random masked (RM) are tested in two sets of experiments: first using adjectival categorical judgment (ACJ) method and then using double-stimulus continuous quality-scale (DSCQS) method, both as specified in ITU BT-500[14]. All recommendations from the BT-500 specification were implemented in order to obtain valid results. Sequences that are used for experiments are listed in Table 1. We decided to use high definition clips of popular video sequences that were chosen based on YouTube popularity. All videos were obtained from YouTube source at the highest provided quality (720p and 1080p) and transcoded into HEVC bitstreams.

TABLE I. CHARACTERISTICS OF THE TEST SEQUENCES

Class Sequences Resolution Frame rate Duration

Movie 2 1280x720p 25 10s

Music 2 1280x720p 50 10s

Promo 2 1280x720p 30 10s

Movie 2 1920x1080p 25 10s

Music 2 1920x1080p 50 10s Promo 2 1920x1080p 30 10s

Figure 1. Examples of quantization ramps for cases of 5 frames (left) and 10 frames (right). Y-axis represents ∆QP values.

431

This selection is reasonable because our primary goal is to reduce bitrate of the video content distributed over Internet for which some recent reports state majority share of the overall Internet traffic. Moreover, most of the standard test sequences do not resemble popular videos and usually have no or only one scene change. Since we are conducting paired comparison and not direct impairment scale comparison the source videos need not be from the original camera source.

For the coded sequences we want to match quality settings from the recent standardization documents. This is why we coded HM bitstreams with the default low delay settings at three different QP levels: 20, 30 and 40. For BM and RM sequences we use the exact same settings with addition of quantization ramps at a signaled positions. This ensures that all parts of the video sequences are identical except for the region of frames where the ramps are inserted. We expect the effects of backward masking to hold for other profiles of HEVC as well.

Videos were presented to 15 subjects (12 male and 3 female, all age between 23 and 45). Three subjects are categorized as video experts, but analysis of results couldn’t distinguish their ratings. All subjects have normal or corrected to normal visual acuity. In order to validate results several control cases are inserted randomly in the test playlist. They either contain pairs of identical video clips or video clips of the same sequence at an extreme quality difference. All subjects passed the control.

Results of ACJ tests are presented in Table 2. In ACJ methodology sequences are presented in randomized pairs after which the subject is asked to assign a score to the sequence that is second in order.

TABLE II. RESULTS OF THE ACJ TESTS FOR ALL CONDITIONS.

Results for pairs designated as (first) – (second) in ACJ order

BM - HM

QP20

BM - HM

QP30

BM - HM

QP40

RM - HM

all QPs

RM - BM

QP20

Mean - 0.11 - 0.056 - 0.12 0.74 1.89

Median 0 0 0 1 2

Score is determined as a quality measure of a sequence as compared to a first sequence and ranges from -3 which is interpreted as “much worse” to +3 which is interpreted as “much better”. As can be seen, subjects didn’t notice any statistically significant difference between HM and BM sequences. However, there is significant difference in quality ratings between both HM and BM sequences as compared to RM sequences. This is especially noticeable for the sequences of high quality (QP 20), because ramp impairments tend to “pop out” more. The backward masking effect ensured that even at this level of quality well placed ramps are not perceivable at all.

Results of the DSCQS test that underwent ANOVA analysis and were calculated for a 95% confidence interval are presented in Fig. 2. Because of the limited space we present results for selected classes of sequences. Results were similar across all classes of sequences without any outliers that would indicate that impairments were noticed in the BM sequences. Moreover, we additionally extracted and analyzed pairs of DSCQS comparisons between HM and BM sequences and couldn’t find any statistically significant difference in scores. It is clear from the presented results that RM sequences again underperformed as compared to HM and BM sequences at the same QP level, which validates the claim of strong effects of backward masking that occludes impairments in BM sequences.

Finally, in Table 3 we present bitrate savings achieved in BM sequences as compared to HM sequences.

TABLE III. AVERAGE BITRATE SAVINGS FOR 720P SEQUENCES.

Sequence

class

QP

value

Average HM

bitrate

Average BM

bitrate

% saving

Movie

40 105.804 kbps 100.379 kbps 5.12

30 472.280 kbps 434.155 kbps 8.07

20 1714.681 kbps 1562.588 kbps 8.87

Music

40 351.039 kbps 338.920 kbps 3.45

30 1174.536 kbps 1109.972 kbps 5.50

20 3439.195 kbps 3234.087 kbps 5.96

Promo

40 198.684 kbps 195.321 kbps 1.69

30 785.924 kbps 754.569 kbps 3.99

20 2883.528 kbps 2761.289 kbps 4.24

Figure 2. Average MOS scores derived from DSQCS tests (vertical axis) for all sequences in classes Promo and Movie.

432

This shows potential of backward masking for perceptually lossless optimization of HEVC/H.265 coding. Almost same savings are achieved by introducing quantization masks for other profiles of HM, such as “random access”, but without visually lossless confirmation. Savings depend significantly on the number of scene cuts in the sequences. The video sequences used in experiments have on average 3 cuts which corresponds to average shot length of about 3.5 seconds. This number is very close to the average shot length in a modern movies [15]. Moreover, results are very close to those reported previously for the AVC/H.264 sequences by the same authors [9].

VI. CONCLUSIONS

We presented the model of backward temporal visual masking and implemented it for video coding optimization. Set of subjective tests produced results that validate the assumption of visually lossless optimization. Although the model in this case is implemented in HEVC/H.265 reference encoder it should be straight forward to apply quantization ramps in other modern video encoders. We expect bitrate savings to hold when the proposed approach is applied to different encoders since the model only assumes frame level coding and quantization

increase based on quantization parameter only. Saved bits can be redistributed to other parts of video sequence if the bitrate constraints are not critical. There are aspects of the model that can be further improved by using empirical data from psychophysical experiments. Right now we are using conservative thresholds in the model, obtained from previous experiments with strict limits. However, number of frames in the ramp can be increased for sequences that have higher energy differences between frames at the scene boundaries. Moreover, backward masking can be applied on the parts of frames that are significantly changing even during the scene. The thresholds for smaller regions of frames can be established with relaxed conditions as compared to the whole frames at the scene boundaries because of the decrease of visual acuity for smaller areas of visual field.

Other aspects of temporal masking and related psychovisual phenomena on the higher (attentional) levels should be further explored. Exploiting correlations between psychovisual domain and video coding domain has brought significant improvements into modern algorithms. Many unexplored aspects of the human visual system are leaving room for future interdisciplinary research that can introduce further improvements.

REFERENCES

[1] C.S. Sherrington, "On the reciprocal action in the retina as studied by

means of some rotating discs," J. Physiology 21, pp. 33–54, 1897.

[2] W. McDougall, "The sensations excited by a single momentary stimulation of the eye," Brit J. Psychol 1, pp. 78–113, 1904.

[3] B. Girod, "The information theoretical significance of spatial and temporal masking in video signals," Proc. SPIE Int. Conf. Human Vision, Visual Processing, and Digital Display, vol. 1077, pp.178-187 1989.

[4] B.G. Breitmeyer and H. Ogmen, "Recent models and findings in visual backward masking: A comparison, review, and update," Percept Psychophys 62, pp. 1572–1595, 2000.

[5] Q. Hu, S.A. Klein and T. Carney, “Masking of high-spatial-frequency information after a scene cut,” Society for Informational Display 93 Digest. n. 24, 1993, p. 521-523.

[6] W.J. Tam, L.B. Stelmach, L. Wang, D. Lauzon and P. Gray, “Visual masking at video scene cuts,” Proc. SPIE Human Vision, Visual Processing and Digital Display, vol. 2411, 1995, pp. 111–119.

[7] R.R. Pastrana-Vidal, J.-C. Gicquel, C. Colomes and H. Cherifi, “Temporal Masking Effect on Dropped Frames at Video Scene Cuts,” Proc. SPIE Human Vision and Electronic Imaging IX, vol. 5292, 2004, pp. 194-201.

[8] Q. Huynh-Thu and M. Ghanbari, “Asymmetrical temporal masking near video scene change,” ICIP 2008. 15th IEEE International Conference on Image Processing, vol., no., pp.2568-2571.

[9] V. Adzic, H. Kalva and B. Furht, "Exploring visual temporal masking for video compression," In IEEE International Conference on Consumer Electronics (ICCE), pp. 590-591, 2013.

[10] T.J. Spencer and R. Shuntich, "Evidence for an interruption theory of backward masking," Journal of Experimental Psychology 85, no. 2, pp. 198-203, 1970.

[11] J. W. Rieger, C. Braun, H. H. Bülthoff and K. R. Gegenfurtner. "The dynamics of visual pattern masking in natural scene processing: A magnetoencephalography study," Journal of Vision 5, no. 3, pp. 275-286, 2005.

[12] V. Adzic, "What you see is what you should get," In Proceedings of the 20th ACM international conference on Multimedia, pp. 1441-1444, 2012.

[13] G.J. Sullivan, J. Ohm, W.J. Han and T. Wiegand, "Overview of the High Efficiency Video Coding (HEVC) Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol.22, no.12, pp.1649-1668, 2012.

[14] Recommendation ITU-R BT.500-13, “Methodology for the subjective assessment of the quality of television pictures,” 01/2012.

[15] [36] D. Bordwell, D. The way Hollywood tells it: Story and style in modern movies, Univ of California Press, 2006.

Figure 3. Examples of original sequence (HM) frames and masked (BM) frames with significant impairment that were not

consciously perceived due to masking (as confirmed by subjective scores). Frames presented in pairs HM (left) BM (right).

433

Documents

[IEEE 2013 Picture Coding Symposium (PCS) - San Jose, CA, USA (2013.12.8-2013.12.11)] 2013 Picture Coding Symposium (PCS) - Temporal visual masking for HEVC/H.265 perceptual optimization