Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
MOBILE3DTV project has received funding from the European Community’s ICT programme
in the context of the Seventh Framework Programme (FP7/2007-2011) under grant
agreement n° 216503. This document reflects only the authors’ views and the Community
or other project partners are not liable for any use that may be made of the information
contained therein.
Adaptation and optimization of coding algorithms for mobile 3DTV Philipp Merkle Heribert Brust Kristina Dix Yongzhe Wang Aljoscha Smolic
MOBILE3DTV
Project No. 216503
Adaptation and optimization of coding algorithms for mobile 3DTV
Philipp Merkle, Heribert Brust, Kristina Dix, Yongzhe Wang, Aljoscha Smolic
Abstract: As a starting point and reference for the further research, this deliverable specifies and evaluates available coding standards for 3D video for the specific conditions in the Mobile3DTV project. The following stereo video formats and codecs are specified and evaluated: • H.264/AVC simulcast, • H.264 Stereo SEI message, • H.264/MVC, • MPEG-C Part 3 using H.264 for both video and depth, • H.264 auxiliary picture syntax for video plus depth. Significant coding gains can be achieved with hierarchical B pictures for temporal prediction, but the gain differs largely for individual sequences. On the other hand hierarchical B pictures also mean increased complexity and memory requirements. Inter-view prediction leads to a significant reduction of bitrate for some sequences. However, in other cases the gain is negligible. We achieved up to 35% bitrate savings from inter-view prediction compared to stereo simulcast. Inter-view prediction whether as H.264 SEI or MVC does not add substantial complexity. A representation as video plus depth is an interesting alternative for 3D video. It allows to adjust the stereo rendering at the decoder and to optimally adapt the 3D impression for any given display. However, this extended functionality comes at the cost of an increased complexity. MPEG-C Part 3 is suitable for encoding of video plus depth data. Good depth quality is essential for good overall quality. Keywords: 3D video coding, MVC, H.264/AVC, MPEG-C Part 3
MOBILE3DTV D2.2
2
Executive Summary
The ultimate goal of research in WP2 is to develop the best possible 3D video representation and coding for the specific application of transmission over DVB-H and mobile terminals. For that different alternative approaches will be developed, implemented, optimized and compared. This will include approaches beyond the current state-of-the-art as defined in available standards like mixed resolution stereo video coding.
As a starting point and reference for the further research, this deliverable specifies and evaluates available coding standards for 3D video for the specific conditions in the Mobile3DTV project. Specifically the optimizations are targeted for the demonstrator device as to be used in the project, i.e. how to use the different 3D video codecs in this specific context.
The following 3D video formats and codecs are specified and evaluated:
H.264/AVC simulcast,
H.264 Stereo SEI message,
H.264/MVC,
MPEG-C Part 3 using H.264 for both video and depth,
H.264 auxiliary picture syntax for video plus depth.
Evaluation is done by simulations. Same test data are used in all experiments. Professional stereo sequences in 16:9 format (as the display in the demonstrator) kindly provided by KUK Filmproduktion are used. In total the 4 sequences Horse, Car, Hands, and Snail spanning a range of different types of content and complexity are used formatted to 480x270. The material is coded at different bitrates using optimum settings for each of those codecs. Quality is evaluated by means of PSNR over bitrate and informal subjective expert viewing.
As for any type of video coding, the same amount of raw input data leads to very different RD-performance. The required bitrate for achieving acceptable quality strongly depends on the properties of the sequence content, especially temporal variation and complexity of the scene. The coding gain from inter-view prediction (Stereo SEI & MVC) varies largely.
Significant coding gains can be achieved with hierarchical B pictures for temporal prediction. The gain from using hierarchical B pictures differs largely for individual sequences, depending on factors like scene content complexity and temporal variation. On the other hand hierarchical B pictures also mean increased complexity and memory requirements. It remains to be studied how far this can be implemented on a mobile terminal.
Inter-view prediction leads to a significant reduction of bitrate for some sequences. However, in some cases the gain is negligible. In our experiments we achieved up to 35% bitrate savings from inter-view prediction compared to stereo simulcast. Inter-view prediction whether performed as H.264 SEI or MVC does not add substantial complexity. However, a standard conform implementation of MVC requires that the decoder supports the H.264 High Profile since MVC extends that. A standard conform implementation of the Stereo SEI Message requires that the decoder supports the H.264 interlaced tools.
A representation as video plus depth is an interesting alternative for 3D video. It allows to adjust the stereo rendering at the decoder and to optimally adapt the 3D impression for any given display. However, this extended functionality comes at the cost of an increased complexity since rendering of one output view has to be done at the terminal device. Depth estimation is
MOBILE3DTV D2.2
3
necessary on sender side, which is an inherently error prone task. Nevertheless, it has been shown that good general quality is achievable by the video plus depth approach. Please refer to D2.3 “Report on generation of video plus depth data base”.
MPEG-C Part 3 is suitable for encoding of video plus depth data. Different bitrate ratios (and
thereby qualities) for video and depth can be adjusted. It was found that at a certain bitrate the
QP for depth should be chosen lower than the QP for color to achieve best overall results, e.g.
C30, D24. It can be concluded that good depth quality is essential for good overall quality. The
numerical bitrate ratio between color and depth may vary largely depending on the sequence.
We have found ratios between 1:1 and 6:1. In most cases for best overall quality a substantial
portion of the bitrate has to be spent for depth.
A detailed comparison of “video plus video” approaches (simulcast, H.264 SEI, MVC) and video
plus depth (MPEG-C Part 3) is still to be done. This will be done in close collaboration with
WP4, which will perform formal subjective tests about these issues. These experiments will also
include mixed resolution stereo video coding, as an extension of available 3D video formats.
MOBILE3DTV D2.2
4
Table of Contents
1. Introduction ................................................................................................................................................................ 5 H.264 Simulcast ................................................................................................................................................................. 6
1.1. Specification .................................................................................................................................................... 6 1.2. Simulation ....................................................................................................................................................... 7
2. H.264 Stereo SEI Message ....................................................................................................................................... 12 2.1. Specification .................................................................................................................................................. 13 2.2. Simulation ..................................................................................................................................................... 16
3. Multiview Video Coding (MVC) ............................................................................................................................. 21 3.1. Specification .................................................................................................................................................. 22 3.2. Simulation ..................................................................................................................................................... 22
4. MPEG-C Part 3 ......................................................................................................................................................... 28 4.1. Specification .................................................................................................................................................. 29 Syntax ........................................................................................................................................................................... 30
Supplemental information message syntax ....................................................................................................... 31 Supplemental information payload syntax ......................................................................................................... 31
4.2. Simulation ..................................................................................................................................................... 33 5. H.264 Auxiliary Picture Syntax for video plus depth ............................................................................................... 41
5.1. Specification .................................................................................................................................................. 42 5.2. Simulation ..................................................................................................................................................... 43
6. Comparative Analysis ............................................................................................................................................... 47 7. Conclusions .............................................................................................................................................................. 52
MOBILE3DTV D2.2
5
1. Introduction
The ultimate goal of research in WP2 is to develop the best possible 3D video representation and coding for the specific application of transmission over DVB-H and mobile terminals. For that different alternative approaches will be developed, implemented, optimized and compared. This will include approaches beyond the current state-of-the-art as defined in available standards like mixed resolution stereo video coding.
As a starting point and reference for the further research, this deliverable specifies and evaluates available coding standards for stereo video for the specific conditions in the Mobile3DTV project. Specifically the optimizations are targeted for the demonstrator device as to be used in the project, i.e. how to use the different stereo video codecs in this specific context.
The following stereo video formats and codecs are specified and evaluated:
H.264/AVC simulcast,
H.264 Stereo SEI message,
H.264/MVC,
MPEG-C Part 3 using H.264 for both video and depth,
H.264 auxiliary picture syntax for video plus depth.
Evaluation is done by simulations. Same test data are used in all experiments. Professional stereo sequences in 16:9 format (as the display in the demonstrator) kindly provided by KUK Filmproduktion are used. In total the 4 sequences Horse, Car, Hands, and Snail spanning a range of different types of content and complexity are used formatted to 480x270. The material is coded at different bitrates using optimum settings for each of those codecs. Quality is evaluated by means of PSNR over bitrate and informal subjective expert viewing.
Finally, an initial comparison is done from all simulation results of different formats and codecs.
MOBILE3DTV D2.2
6
H.264 Simulcast
Figure 1: Schematic block diagram for H.264 Simulcast coding with stereo video format data
As depicted in the overview diagram, H.264 Simulcast uses the stereo video format, consisting of the two input video sequences Video 1 and Video 2 for the left and right view of the stereo pair. The codec used for H.264 Simulcast is H.264/AVC, which is applied to each of the two input sequences independently, resulting in two encoded bit- or transport-streams BS/TS. After transmission over the Channel the two streams are decoded independently, resulting in the distorted sequences of the stereo pair Video 1 and Video 2.
1.1. Specification According to the H.264/MPEG-4-AVC standard1, “H.264 Simulcast” is specified as the individual application of an H.264/AVC conforming coder to several video sequences in a generic way.
1 ITU-T Recommendation H.264, “Advanced video coding for generic audiovisual services”,
November 2007.
MOBILE3DTV D2.2
7
H.264/MPEG4-AVC is the latest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). H.264/MPEG4-AVC has recently become the most widely accepted video coding standard and covers all common video applications ranging from mobile services and videoconferencing to IPTV, HDTV, and HD video storage.
1.2. Simulation
Test data
The simulations for H.264 Simulcast have been carried out with the following test data sets:
Producer KUK
Sequences Car Hands Horse Snail
Length [frames] 235 251 140 189
Framerate [frames/second] 30
Resolution [pixel] 480 x 270
Data Format VL + VR
Setup
The simulations for H.264 Simulcast have been carried out with the following coding settings:
Coder Implementation JM 14.2
Standard H.264/AVC
Quantization Parameter
24
30
36
42
GOP Size 16
1
(hierarchical B pictures)
(no B pictures)
Intra Period 16
Search Range 32
Symbol Mode CABAC
Besides these settings typical configurations for H.264/AVC have been used.
Results
MOBILE3DTV D2.2
8
For H.264 Simulcast simulations the left and right view are encoded, transmitted and decoded independently. We achieved the following results:
Car GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 951,66 41,03 1079,74 40,75 400,83 38,48 439,78 38,08
30 335,50 36,99 372,00 36,65 164,19 35,12 178,66 34,75
36 124,16 33,74 134,21 33,41 68,06 32,37 73,20 32,00
42 52,32 30,98 55,84 30,67 27,76 29,93 29,54 29,59
Hands GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 2873,82 41,78 2498,30 42,19 1565,86 37,40 1350,52 37,94
30 1586,37 36,94 1368,25 37,49 664,89 32,05 580,85 32,65
36 733,07 32,33 641,97 32,97 235,62 28,14 215,00 28,71
42 255,48 28,12 236,52 28,73 78,28 25,34 74,31 25,82
Horse GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 1677,38 37,87 1683,29 37,75 736,45 36,61 749,10 36,46
30 564,73 32,75 566,33 32,61 367,76 32,65 374,37 32,55
36 196,96 28,81 192,89 28,69 151,08 28,73 151,67 28,61
42 70,58 26,13 68,11 26,10 52,28 25,92 50,55 25,88
MOBILE3DTV D2.2
9
Snail GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 230,04 45,06 221,54 45,11 142,33 44,87 140,22 44,82
30 103,67 40,93 101,08 41,05 79,58 40,97 77,51 41,02
36 51,94 37,16 50,75 37,33 42,24 37,11 41,23 37,22
42 28,74 33,13 27,77 33,34 23,09 33,07 22,18 33,24
Table 1: RD results for H.264 Simulcast coding simulations with stereo video format data
MOBILE3DTV D2.2
10
Figure 2: RD-comparison for H.264 Simulcast coding simulations: total bitrate for both views vs. average PSNR relative to the original sequences
These simulation results clearly indicate, that the overall RD-performance for temporal prediction with using hierarchical B pictures (GOP 16) is better than for simulcast coding without hierarchical B pictures (GOP 1). The gains that can be achieved for the individual sequences differ largely, depending on factors like scene content complexity and temporal variation. For the same quality between almost zero and up to 50% of the bitrate can be saved with hierarchical B pictures (GOP 16). In addition to that the RD-performance of the left and right view is the same, when compressed under equal conditions.
Informal subjective expert viewing has been carried out for the H.264 Simulcast simulation results on a stereoscopic display. This lead to the conclusion that the objective RD results are confirmed, as for the same bitrate a higher quality is achieved by using hierarchical B pictures (GOP 16) for temporal prediction or in return a lower bitrate is necessary to achieve the same
MOBILE3DTV D2.2
11
subjective quality. Without hierarchical B pictures (GOP 1) the sequences are temporally more unsteady. For medium bitrates a tolerable to mostly acceptable subjective quality was observed.
MOBILE3DTV D2.2
12
2. H.264 Stereo SEI Message
Figure 3: Schematic block diagram for H.264 Stereo SEI Message coding with stereo video format data
As depicted in the overview diagram, H.264 Stereo SEI Message uses the stereo video format, consisting of the two input video sequences Video 1 and Video 2 for the left and right view of the stereo pair. For H.264 Stereo SEI Message compression these two sequences are interlaced line-by-line into one sequence, where the top field contains Video 1 and the bottom field Video 2. The codec used for H.264 Stereo SEI Message is H.264/AVC, which is applied to the interlaced sequence, resulting in one encoded bit- or transport-stream BS/TS. After
MOBILE3DTV D2.2
13
transmission over the Channel this stream is decoded, resulting in the distorted interlaced sequence. For output this sequence is deinterlaced to the stereo pair Video 1 and Video 2.
2.1. Specification According to the H.264/AVC standard2, the “H.264 Stereo SEI Message” is specified as follows:
Supplemental Enhancement Information
SEI (supplemental enhancement information) messages assist in processes related to decoding, display or other purposes. However, SEI messages are not required for constructing the luma or chroma samples by the decoding process. Conforming decoders are not required to process this information for output order conformance to H.264/SVC.
SEI payload syntax
sei_payload( payloadType, payloadSize ) { C Descriptor
if( payloadType == 0 )
buffering_period( payloadSize ) 5
else if( payloadType == 1 )
pic_timing( payloadSize ) 5
…
else if( payloadType == 21 )
stereo_video_info( payloadSize ) 5
…
}
Stereo video information SEI message syntax
stereo_video_info( payloadSize ) { C Descriptor
field_views_flag 5 u(1)
if( field_views_flag )
top_field_is_left_view_flag 5 u(1)
else {
current_frame_is_left_view_flag 5 u(1)
2 ITU-T Recommendation H.264, “Advanced video coding for generic audiovisual services”,
November 2007.
MOBILE3DTV D2.2
14
next_frame_is_second_view_flag 5 u(1)
}
left_view_self_contained_flag u(1)
right_view_self_contained_flag u(1)
}
Stereo video information SEI message semantics
This SEI message provides the decoder with an indication that the entire coded video sequence consists of pairs ofpictures forming stereo-view content.
The stereo video information SEI message shall not be present in any access unit of a coded video sequence unless a stereo video information SEI message is present in the first access unit of the coded video sequence.
field_views_flag equal to 1 indicates that all pictures in the current coded video sequence are fields and all fields of a particular parity are considered a left view and all fields of the opposite parity are considered a right view for stereoview content. field_views_flag equal to 0 indicates that all pictures in the current coded video sequence are frames and alternating frames in output order represent a view of a stereo view. The value of field_views_flag shall be the same in all stereo video information SEI messages within a coded video sequence.
When the stereo video information SEI message is present and field_views_flag is equal to 1, the left view and right view of a stereo video pair shall be coded as a complementary field pair, the display time of the first field of the field pair in output order should be delayed to coincide with the display time of the second field of the field pair in output order, and the spatial locations of the samples in each individual field should be interpreted for display purposes as representing complete pictures as shown in Figure 4 (top) rather than as spatially-distinct fields within a frame as shown in Figure 4 (bottom).
MOBILE3DTV D2.2
15
Figure 4: Nominal vertical and horizontal sampling locations of 4:2:0 samples in a frame (top) and in top and bottom fields (bottom)
top_field_is_left_view_flag equal to 1 indicates that the top fields in the coded video sequence represent a left view and the bottom fields in the coded video sequence represent a right view. top_field_is_left_view_flag equal to 0 indicates that the bottom fields in the coded video sequence represent a left view and the top fields in the coded video sequence represent a right view. When present, the value of top_field_is_left_view_flag shall be the same in all stereo video information SEI messages within a coded video sequence.
current_frame_is_left_view_flag equal to 1 indicates that the current picture is the left view of a stereo-view pair. current_frame_is_left_view_flag equal to 0 indicates that the current picture is the right view of a stereo-view pair.
next_frame_is_second_view_flag equal to 1 indicates that the current picture and the next picture in output order form a stereo-view pair, and the display time of the current picture should be delayed to coincide with the display time of the next picture in output order. next_frame_is_second_view_flag equal to 0 indicates that the current picture and the previous picture in output order form a stereo-view pair, and the display time of the current picture should not be delayed for purposes of stereo-view pairing.
left_view_self_contained_flag equal to 1 indicates that no inter prediction operations within the decoding process for the left-view pictures of the coded video sequence refer to reference pictures that are right-view pictures. left_view_self_contained_flag equal to 0 indicates that some inter prediction operations within the decoding process for the left-view pictures of the coded video sequence may or may not refer to reference pictures that are right-view pictures. Within a coded video sequence, the value of left_view_self_contained_flag in all stereo video information SEI messages shall be the same.
right_view_self_contained_flag equal to 1 indicates that no inter prediction operations within the decoding process for the right-view pictures of the coded video sequence refer to reference pictures that are left-view pictures. right_view_self_contained_flag equal to 0 indicates that some inter prediction operations within the decoding process for the right-view pictures of the coded video sequence may or may not refer to reference pictures that are left-view pictures. Within a coded video sequence, the value of right_view_self_contained_flag in all stereo video information SEI messages shall be the same.
MOBILE3DTV D2.2
16
2.2. Simulation
Test data
The simulations for H.264 Stereo SEI Message have been carried out with the following test data sets:
Producer KUK
Sequences Car Hands Horse Snail
Length [frames] 235 251 140 189
Framerate [frames/second] 30
Resolution [pixel] 480 x 540
Data Format VLVR Interlaced
Setup
The simulations for for H.264 Stereo SEI Message have been carried out with the following coding settings:
Coder Implementation JM 14.2
Standard H.264/AVC
Field Coding Enabled
Quantization Parameter
24
30
36
42
GOP Size 16
1
(hierarchical B pictures)
(no B pictures)
Intra Period 16
Search Range 32
Symbol Mode CABAC
Besides these settings typical configurations for H.264/AVC have been used.
Results
For H.264 Stereo SEI Message simulations the left and right view sequences are interlaced into one sequence of double height and encoded, transmitted and decoded as one stream consisting of top and bottom field. We achieved the following results:
MOBILE3DTV D2.2
17
Car GOP 1 GOP 16
VLVR (all) VLVR (intra bottom) VLVR (all) VLVR (intra bottom)
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 1675,31 40,87 1801,99 40,92 648,85 38,21 775,48 38,28
30 565,04 36,82 631,96 36,90 252,82 34,85 321,03 34,93
36 213,04 33,54 243,39 33,69 106,76 32,16 135,87 32,19
42 91,56 30,83 105,18 30,94 45,13 29,79 56,41 29,75
Hands GOP 1 GOP 16
VLVR (all) VLVR (intra bottom) VLVR (all) VLVR (intra bottom)
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 5222,53 41,93 5297,24 41,98 2812,86 37,64 2864,36 37,66
30 2819,99 37,15 2872,02 37,22 1161,12 32,29 1197,42 32,31
36 1271,41 32,57 1305,07 32,65 401,86 28,35 425,28 28,38
42 440,65 28,32 461,94 28,41 132,58 25,56 147,53 25,59
Horse GOP 1 GOP 16
VLVR (all) VLVR (intra bottom) VLVR (all) VLVR (intra bottom)
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 2931,68 37,76 3374,14 37,85 1046,62 36,23 1473,03 36,55
30 865,59 32,49 1109,81 32,68 469,69 32,30 729,65 32,57
36 276,03 28,54 372,31 28,74 193,24 28,56 296,06 28,61
42 103,82 26,03 135,12 26,15 69,66 25,92 101,20 25,90
Snail GOP 1 GOP 16
VLVR (all) VLVR (intra bottom) VLVR (all) VLVR (intra bottom)
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
MOBILE3DTV D2.2
18
24 375,32 44,90 460,23 45,08 199,35 44,58 283,55 44,85
30 156,38 40,63 205,65 40,98 107,41 40,82 157,13 40,98
36 77,21 36,90 103,33 37,28 57,57 37,10 83,71 37,14
42 43,96 33,03 57,60 33,26 32,48 33,18 45,56 33,10
Table 2: RD results for H.264 Stereo SEI Message coding simulations with stereo video format data
MOBILE3DTV D2.2
19
Figure 5: RD-comparison for H.264 Stereo SEI Message coding simulations: total bitrate for both views vs. average PSNR relative to the original sequences
These simulation results clearly indicate, that the overall RD-performance for temporal prediction with using hierarchical B pictures (GOP 16) is better than for H.264 Stereo SEI Message coding without hierarchical B pictures (GOP 1). The gains that can be achieved for the individual sequences differ largely, depending on factors like scene content complexity and temporal variation. For the same quality between almost zero and up to 60% of the bitrate can be saved with hierarchical B pictures (GOP 16). In addition to that the RD-performance of the left and right view is the same, since they are compressed jointly. Moreover the effect of intra coded pictures in the bottom field has been tested, where inter prediction for the bottom field is disabled if the top field is intra coded, resulting in a relevant decline of the RD-performance.
Informal subjective expert viewing has been carried out for the H.264 Stereo SEI Message simulation results on a stereoscopic display. This lead to the conclusion that the objective RD
MOBILE3DTV D2.2
20
results are confirmed, as for the same bitrate a higher quality is achieved by using hierarchical B pictures (GOP 16) for temporal prediction or in return a lower bitrate is necessary to achieve the same subjective quality. Without hierarchical B pictures (GOP 1) the sequences are temporally more unsteady. For medium bitrates a mostly acceptable subjective quality was observed.
MOBILE3DTV D2.2
21
3. Multiview Video Coding (MVC)
Figure 6: Schematic block diagram for H.264 Multiview Video Coding with stereo video format data
As depicted in the overview diagram, H.264 MVC uses the stereo video format, consisting of the two input video sequences Video 1 and Video 2 for the left and right view of the stereo pair. The codec used for H.264 Multiview Video Coding is H.264/MVC, which is applied to both sequences simultaneously for inter-view predictive coding, resulting in two dependent encoded bit-streams BS that may contain the camera parameters as auxiliary information. For transmission these two bit-streams are interleaved frame-by-frame in the multiplexer MUX, resulting in one MVC transport-stream TS. After transmission over the Channel this stream is decoded (and thereby demultiplexed), resulting in the distorted sequences of the stereo pair Video 1 and Video 2.
MOBILE3DTV D2.2
22
3.1. Specification According to the H.264/MVC standard3, “Multiview Video Coding” is specified as an extension to the family of H.264 standards. For MVC, the single-view concepts of H.264/AVC are extended, so that a current picture in the coding process can have temporal as well as inter-view reference pictures for motion-compensated prediction, but also includes a number of new techniques for improved coding efficiency, reduced decoding complexity, and new functionalities for multiview operations. MVC takes advantage of some of the interfaces and transport mechanisms introduced for the scalable video coding (SVC) extension of H.264/AVC. New requirements for 3D video related to interface, transport of the MVC bitstreams, and MVC decoder resource management lead to new features, that have been adopted for MVC, including marking of reference pictures, supporting for efficient view switching, structuring of the bitstream, signaling of view scalability supplemental enhancement information (SEI) and parallel decoding SEI.
Figure 7: MVC coding scheme with stereo video format data: inter-view prediction (red arrows) combined with hierarchical B pictures for temporal prediction (black arrows)
The figure above shows how H.264/MVC is applied to stereo video format data. Hierarchical B pictures for temporal prediction are used in combination with additional inter-view reference pictures for the second of the two stereo views.
3.2. Simulation
Test data
The simulations for H.264 MVC have been carried out with the following test data sets:
Producer KUK
Sequences Car Hands Horse Snail
Length [frames] 235 251 140 189
3 ISO/IEC JTC1/SC29/WG11, “Text of ISO/IEC 14496-10:200X/FDAM 1 Multiview Video
Coding”, Doc. N9978, Hannover, Germany, July 2008.
MOBILE3DTV D2.2
23
Framerate [frames/second] 30
Resolution [pixel] 480 x 272
Data Format VL + VR
Setup
The simulations for H.264 MVC have been carried out with the following coding settings:
Coder Implementation JMVM 7.0
Standard H.264/MVC
Inter-view Prediction enabled (IP prediction structure)
Quantization Parameter
24
30
36
42
GOP Size 16
2
(hierarchical B pictures)
(only one consecutive B picture)
Intra Period 16
Search Range 96
Symbol Mode CABAC
Besides these settings typical configurations for H.264/MVC have been used. Results
For H.264 MVC simulations the left and right view are encoded, transmitted and decoded dependently. We achieved the following results:
Car GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 805,69 40,64 624,56 40,32 390,02 38,37 263,80 37,84
30 309,44 36,88 223,35 36,51 171,92 35,46 103,39 34,85
36 113,41 33,59 83,42 33,16 72,84 32,65 44,02 32,07
42 45,15 30,83 35,99 30,40 29,17 30,06 18,41 29,48
MOBILE3DTV D2.2
24
Hands GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 2320,38 39,75 1860,98 40,26 1446,13 36,56 1118,55 37,21
30 1235,47 35,28 953,30 35,88 678,19 32,32 512,49 33,02
36 564,13 31,19 416,30 31,77 263,10 28,59 198,86 29,22
42 210,22 27,59 149,28 28,04 87,43 25,57 64,83 25,98
Horse GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 1283,52 37,13 1014,70 37,14 661,67 36,16 394,98 35,95
30 482,43 32,47 309,21 32,38 329,65 32,38 167,73 32,18
36 176,41 28,69 89,98 28,33 133,39 28,55 54,54 28,05
42 62,43 25,98 32,18 25,71 48,04 25,78 19,51 25,36
Snail GOP 1 GOP 16
VL VR VL VR
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 206,88 44,92 151,19 45,01 136,44 44,49 81,83 44,46
30 94,22 40,75 59,22 40,79 77,01 40,70 41,25 40,62
36 46,66 36,97 26,98 36,77 41,02 36,77 19,90 36,47
42 27,05 33,22 15,49 33,11 22,85 32,93 10,32 32,74
Table 3: RD results for H.264 MVC coding simulations with stereo video format data
MOBILE3DTV D2.2
25
MOBILE3DTV D2.2
26
Figure 8: RD-comparison for H.264 MVC simulations: total bitrate for both views vs. average PSNR relative to the original sequences
These simulation results clearly indicate, that the overall RD-performance for temporal
prediction with using hierarchical B pictures (GOP 16) is better than for H.264 MVC coding with
an IBPBP… prediction structure (GOP 2). The gains that can be achieved for the individual
sequences differ largely, depending on factors like scene content complexity and temporal
variation. For the same quality between almost zero and up to 50% of the bitrate can be saved
with hierarchical B pictures (GOP 16). Additional simulations with equal coding conditions,
except that inter-view prediction is disabled, identify the gain than is achieved by MVC
compared to simulcast coding. The results show that inter-view prediction leads to a bitrate
reduction for the right view. The gains that can be achieved for the individual sequences differ
MOBILE3DTV D2.2
27
largely, so that for the same quality between almost zero and up to 25% of the total bitrate can
be saved with MVC.
Informal subjective expert viewing has been carried out for the MVC simulation results on a stereoscopic display. This lead to the conclusion that the objective RD results are confirmed, as for the same bitrate a higher quality is achieved by using hierarchical B pictures (GOP 16) for temporal prediction or in return a lower bitrate is necessary to achieve the same subjective quality. With an IBPBP… prediction structure (GOP 2) the sequences are temporally more unsteady. For medium bitrates a tolerable to mostly acceptable subjective quality was observed. By an additional comparison between MVC and Simulcast results (using JMVM with and without inter-view prediction) the objective RD results are confirmed as well, leading to the conclusion that MVC requires a lower bitrate to achieve the same objective and subjective quality than simulcast coding.
MOBILE3DTV D2.2
28
4. MPEG-C Part 3
Figure 9: Schematic block diagram for MPEG-C Part 3 coding with video plus depth format data
As depicted in the overview diagram, MPEG-C Part 3 uses the video plus depth format, consisting of the input video sequences Video and the associated depth information Depth for one of the two views of a stereo pair. The codec used for MPEG-C Part 3 is H.264/AVC, which is applied to each of the two input sequences independently, resulting in two encoded bit- -
MOBILE3DTV D2.2
29
streams BS. For transmission these two bit-streams are interleaved frame-by-frame in the multiplexer MUX, resulting in one MVC transport-stream TS, that may contain additional depth maps properties as auxiliary information. After transmission over the Channel the demultiplexer DEMUX separates this stream into the two individually coded streams. These two streams are decoded independently, resulting in the distorted Video sequence and the distorted Depth sequence for one of the two views of a stereo pair.
4.1. Specification
ISO/IEC 23002-3:20074 defines auxiliary video streams as data coded as video sequences and supplementing a primary video sequence. Depth maps and parallax maps are the first specified types of auxiliary video streams, relating to stereoscopic-view video content. In this context, ISO/IEC 23002-3:2007 specifies syntax and semantics for conveying information describing the interpretation of auxiliary video streams.
Syntax for such information is specified in ISO/IEC 23002-3:2007 as a stream of data referred to as a supplemental information (SI) message stream. Provisions for extensibility have been included, so that additional types of data can be defined in future extensions of the current SI message stream syntax by ISO/IEC.
An SI message stream can contain several concatenated SI messages, hence conveying various types of information. The auxiliary video SI (AVSI) is the only currently-defined type of SI (other than reserved SI message types that are reserved for future specification by ISO/IEC and are to be ignored by decoders if present). An AVSI message characterizes the interpretation of an auxiliary video sequence that accompanies a primary video sequence. For instance, an AVSI can indicate that the auxiliary video represents depth map information, and can provide parameters for the proper interpretation of the auxiliary video as such depth information. The means for identifying the primary video stream and the auxiliary video stream to which these messages pertain is a system-level issue that is outside the scope of ISO/IEC 23002-3:2007.
Although the auxiliary video SI is the only type of SI that is currently specified in ISO/IEC 23002-3:2007, the SI message format has been defined in a generic fashion so that it can potentially be used for purposes other than aiding in the interpretation of auxiliary video sequences. Any kind of data could potentially be carried in the SI message format.
According to the standard specification, “MPEG-C Part 3” is specified as follows:
Auxiliary Video Stream
An auxiliary video stream is a coded representation of an auxiliary video and should be accompanied by a Supplemental Information (SI) RBSP containing at least one Auxiliary Video Supplemental Information (AVSI) message. If more than one AVSI message is present in the SI RBSP, then the first one shall be taken into account and the other ones shall be discarded. The
4 ISO/IEC JTC1/SC29/WG11, “ISO/IEC CD 23002-3: Representation of auxiliary video and
supplemental information”, Doc. N8259, Klagenfurt, Austria, July 2007.
MOBILE3DTV D2.2
30
sample values m of an auxiliary video picture shall be interpreted according to the payload type payloadType of the AVSI message. The following table lists the valid AVSI payload types, the corresponding type of auxiliary video and the number of channels.
payloadType Type of auxiliary video Number of channels
0x10 Depth map 1
0x11 Parallax map 1
The Supplemental Information (SI) RBSP is not part of the Auxiliary Video stream, and shall be conveyed by means that are beyond the scope of this International Standard.
The primary video and the auxiliary video might be spatially and/or temporally misaligned due to:
- interlaced/progressive mismatch,
- different spatial resolutions,
- different temporal resolutions.
Although the re-sampling process is voluntarily left open, the minimal constraints specified now should be met to ensure a correct matching of the primary and auxiliary samples.
Field/frame alignment is provided through the syntax elements aux_is_one_field, aux_is_bottom_field and aux_is_interlaced, which are part of AVSI messages.
Spatial alignment is provided through the two syntax elements position_offset_h and position_offset_v, which are part of AVSI messages.
The temporal synchronization between the primary and the auxiliary videos shall be conveyed by means beyond the scope of this Specification.
Supplemental Information (SI)
Syntax
si_rbsp( NumBytesInSI ) { Descriptor
NumBytesInRBSP = 0
while( NumBytesInRBSP < NumBytesInSI )
si_message( )
}
MOBILE3DTV D2.2
31
Supplemental information message syntax
si_message( ) { Descriptor
payloadType = 0
while( next_bits( 8 ) = = 0xFF ) {
ff_byte /* equal to 0xFF */ f(8)
NumBytesInRBSP ++
payloadType += 255
}
last_payload_type_byte u(8)
NumBytesInRBSP ++
payloadType += last_payload_type_byte
payloadSize = 0
while( next_bits( 8 ) = = 0xFF ) {
ff_byte /* equal to 0xFF */ f(8)
NumBytesInRBSP ++
payloadSize += 255
}
last_payload_size_byte u(8)
NumBytesInRBSP ++
payloadSize += last_payload_size_byte
si_payload( payloadType, payloadSize )
NumBytesInRBSP += payloadSize
}
Supplemental information payload syntax
si_payload( payloadType, payloadSize ) { Descriptor
is_avsi = FALSE
if( payloadType == 0x10 || payloadType == 0x11 ) {
is_avsi = TRUE
generic_params()
}
if( payloadType == 0x10 )
depth_params()
else if( payloadType == 0x11 )
parallax_params()
else
reserved_si_message( payloadSize )
MOBILE3DTV D2.2
32
}
Depth map parameters syntax
depth_params( ) { Descriptor
nkfar u(8)
nknear u(8)
}
Parallax map parameters syntax
parallax_params( ) { Descriptor
parallax_zero u(16)
parallax_scale u(16)
dref u(16)
wref u(16)
}
Generic parameters syntax
generic_params( ) { Descriptor
aux_is_one_field f(1)
if (aux_is_one_field) {
aux_is_bottom_field f(1)
}
else {
aux_is_interlaced f(1)
}
reserved_generic_bits f(6)
position_offset_h u(8)
position_offset_v u(8)
}
Reserved SI message syntax
reserved_si_message( payloadSize ) { Descriptor
for( i = 0; i < payloadSize; i++ )
MOBILE3DTV D2.2
33
reserved_si_byte b(8)
}
4.2. Simulation
Test data
The simulations for MPEG-C Part 3 have been carried out with the following test data sets:
Producer KUK
Sequences Car Hands Horse Snail
Length [frames] 235 251 140 189
Framerate [frames/second] 30
Resolution [pixel] 480 x 270
Data Format VL + DL
Setup
The simulations for MPEG-C Part 3 have been carried out with the following coding settings:
Coder Implementation JM 14.2
Standard H.264/AVC
Quantization Parameter
24
30
36
42
GOP Size 16
1
(hierarchical B pictures)
(no B pictures)
Intra Period 16
Search Range 32
Symbol Mode CABAC
Besides these settings typical configurations for H.264/AVC have been used. Results
MOBILE3DTV D2.2
34
For MPEG-C Part 3 simulations the left view video and depth are encoded, transmitted and decoded independently. We achieved the following results:
Car GOP 1 GOP 16
VL DL VL DL
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 951,66 41,03 353,40 45,17 400,83 38,48 130,29 42,47
30 335,50 36,99 118,83 42,48 164,19 35,12 41,85 40,00
36 124,16 33,74 39,35 39,64 68,06 32,37 15,92 37,96
42 52,32 30,98 14,38 36,56 27,76 29,93 7,99 36,28
Hands GOP 1 GOP 16
VL DL VL DL
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 2873,82 41,78 929,08 43,85 1565,86 37,40 374,22 39,89
30 1586,37 36,94 370,00 40,23 664,89 32,05 129,64 36,45
36 733,07 32,33 131,29 36,89 235,62 28,14 48,31 33,70
42 255,48 28,12 44,98 33,80 78,28 25,34 17,93 31,18
Horse GOP 1 GOP 16
VL DL VL DL
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 1677,38 37,87 94,21 48,39 736,45 36,61 41,48 46,79
30 564,73 32,75 35,64 45,49 367,76 32,65 18,24 44,64
36 196,96 28,81 15,98 42,07 151,08 28,73 10,09 42,15
MOBILE3DTV D2.2
35
42 70,58 26,13 8,81 38,19 52,28 25,92 7,22 39,82
Snail GOP 1 GOP 16
VL DL VL DL
QP Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB] Rate [kbps] PSNR [dB]
24 230,04 45,06 72,00 48,65 142,33 44,87 34,77 47,36
30 103,67 40,93 27,81 45,76 79,58 40,97 17,45 45,08
36 51,94 37,16 13,91 42,08 42,24 37,11 11,13 41,98
42 28,74 33,13 9,09 38,66 23,09 33,07 7,63 38,09
Table 4: RD results for MPEG-C Part 3 coding simulations with video plus depth format data
MOBILE3DTV D2.2
36
MOBILE3DTV D2.2
37
Figure 10: RD-comparison for MPEG-C Part 3 coding simulations using the same QP for VL and DL : total bitrate for video plus depth vs. average PSNR relative to the original VL sequence and the VR sequence
rendered from original VL+DL, respectively
These simulation results clearly indicate, that the overall RD-performance for temporal
prediction with using hierarchical B pictures (GOP 16) is better than for video plus depth coding
without hierarchical B pictures (GOP 1). Note, that for the evaluation of MPEG-C Part 3
simulations the right view has been rendered from video plus depth of the left view. According to
this the PSNR of the right view was calculated between the right view rendering results from
compressed and original video plus depth data of the left view. The gains that can be achieved
for the individual sequences differ largely, depending on factors like scene content and depth
complexity as well as temporal variation. For the same quality between almost zero and up to
50% of the bitrate can be saved with hierarchical B pictures (GOP 16).
Since the video and the depth sequences are coded individually with MPEG-C Part 3, different
bitrate ratios (and thereby qualities) for left view video and left view depth can be combined. The
influence of such combinations on the RD-performance of the rendered right view has been
evaluated. The following figure shows the results for all 16 possible combinations of color and
depth quality in our experiments using 4 qualities (i.e. QP settings) for each. The curves
combine points of constant color quality (C24, C30, …) and points of constant depth quality
(D24, D30, …). Apparently curves of constant color quality are steeper. In most cases
increasing depth bitrate has a stronger influence on overall quality than increasing color bitrate.
At a certain bitrate the QP for depth should be chosen lower (i.e. better quality) than the QP for
color to achieve best overall results, e.g. C30, D24. It can be concluded that good depth quality
is essential for good overall quality.
MOBILE3DTV D2.2
38
The bitrate ratio between color and depth in such an optimum point may vary largely depending
on the sequence. If we select D24 and C30 we get the following ratios for GOP1 from the tables
above:
Car: 335,50 : 353,40 ≈ 1 : 1
Hands: 1586,37 : 929,08 ≈ 1.5 : 1
Horse: 564,73 : 94,21 ≈ 6 : 1
Snail: 103,67 : 72,00 ≈ 1.5 : 1
For best overall quality a substantial portion of the bitrate has to be spent for depth in most
cases.
Car, GOP1
31,00
33,00
35,00
37,00
39,00
41,00
43,00
0,00 200,00 400,00 600,00 800,00 1000,00 1200,00 1400,00
Total bitrate (V+D) [kbps]
Y-P
SN
R [
dB
]
C24
C30
C36
C42
D24
D30
D36
D42
MOBILE3DTV D2.2
39
Hands, GOP1
27,00
28,00
29,00
30,00
31,00
32,00
33,00
34,00
35,00
36,00
37,00
0,00 500,00 1000,00 1500,00 2000,00 2500,00 3000,00 3500,00 4000,00
Total bitrate (V+D) [kbps]
Y-P
SN
R [
dB
]C24
C30
C36
C42
D24
D30
D36
D42
Horse, GOP1
27,00
28,00
29,00
30,00
31,00
32,00
33,00
34,00
35,00
36,00
37,00
0,00 500,00 1000,00 1500,00 2000,00
Total bitrate (V+D) [kbps]
Y-P
SN
R [
dB
]
C24
C30
C36
C42
D24
D30
D36
D42
MOBILE3DTV D2.2
40
Figure 11: RD details for the rendered right view VR for MPEG-C Part 3 coding simulations: total bitrate for video plus depth vs. average PSNR relative to the VR sequence rendered from original VL+DL; all
combinations of color and depth quality; curves of constant color quality and curves of constant depth quality
Informal subjective expert viewing has been carried out for the MPEG-C Part 3 simulation results on a stereoscopic display using the results with same QP for video and depth. This lead to the conclusion that the objective RD results are confirmed, as for the same bitrate a higher quality is achieved by using hierarchical B pictures (GOP 16) for temporal prediction or in return a lower bitrate is necessary to achieve the same subjective quality. Without hierarchical B pictures (GOP 1) the sequences are temporally more unsteady.
Snail, GOP1
33,00
35,00
37,00
39,00
41,00
43,00
45,00
0,00 50,00 100,00 150,00 200,00 250,00 300,00 350,00
Total bitrate (V+D) [kbps]
Y-P
SN
R [
kb
ps]
C24
C30
C36
C42
D24
D30
D36
D42
MOBILE3DTV D2.2
41
5. H.264 Auxiliary Picture Syntax for video plus depth
Figure 12: Schematic block diagram for H.264 Auxiliary Picture Syntax coding with video plus depth format data
As depicted in the overview diagram, H.264 Auxiliary Picture Syntax uses the video plus depth format, consisting of the input video sequences Video and the associated depth information Depth for one of the two views of a stereo pair. The codec used for H.264 Auxiliary Picture Syntax is H.264/AVC, which is applied to both sequences simultaneously but independently (with Video being the primary coded picture and Depth the auxiliary coded picture), resulting in one encoded bit- or transport-stream BS/TS. After transmission over the Channel this stream is
MOBILE3DTV D2.2
42
decoded, again simultaneously but independently for primary and auxiliary coded pictures, resulting in the distorted Video sequence and the distorted Depth sequence for one of the two views of a stereo pair.
5.1. Specification In addition to basic coding tools, the H.264/AVC standard enables sending extra supplemental information along with the compressed video data. This often takes a form called "supplemental enhancement information" (SEI) or "video usability information" (VUI) in the standard. SEI data is specified in a backward-compatible way, so that as new types of supplemental information are specified, they can even be used with profiles of the standard that had been previously specified before that definition. The first version of the standard includes the definition of a variety of such SEI data, which we will not specifically review herein. Instead we focus only on what new types of backward-compatible supplemental and auxiliary data are defined in the new FRExt amendment. One of these new types of data are auxiliary pictures, which are extra monochrome pictures sent along with the main video stream, that can be used for such purposes as alpha blend compositing (specified as a different category of data than SEI).
Definitions
Primary coded picture: The coded representation of a picture to be used by the decoding process for a bitstream conforming to H.264/AVC. The primary coded picture contains all macroblocks of the picture. The only pictures that have a normative effect on the decoding process are primary coded pictures.
Auxiliary coded picture: A picture that supplements the primary coded picture that may be used in combination with other data not specified by H.264/AVC in the display process. An auxiliary coded picture has the same syntactic and semantic restrictions as a monochrome redundant coded picture. An auxiliary coded picture must contain the same number of macroblocks as the primary coded picture. Auxiliary coded pictures have no normative effect on the decoding process.
Decoding
The decoding of auxiliary coded pictures is not required for conformance with H.264/AVC.
The (optional) decoding process for the decoding of auxiliary coded pictures is the same as if the auxiliary coded pictures were primary coded pictures in a separate coded video stream (with some minor constraints).
The syntax of each coded slice of an auxiliary coded picture shall obey the same constraints as a coded slice of a redundant picture, with the following differences of constraints.
– If the primary coded picture is an IDR picture, the auxiliary coded slice syntax shall correspond to that of a slice of an IDR picture;
– Otherwise (the primary coded picture is not an IDR picture), the auxiliary coded slice syntax shall correspond to that of a slice of a non-IDR picture.
– The slices of an auxiliary coded picture (when present) shall contain all macroblocks corresponding to those of the primary coded picture.
MOBILE3DTV D2.2
43
5.2. Simulation
Test data
The simulations for H.264 Auxiliary Picture Syntax have been carried out with the following test data sets:
Producer KUK
Sequences Car Hands Horse Snail
Length [frames] 235 251 140 189
Framerate [frames/second] 30
Resolution [pixel] 480 x 270
Data Format VL + DL
Setup
The simulations for H.264 Auxiliary Picture Syntax have been carried out with the following coding settings:
Coder Implementation JM 14.2
Standard H.264/AVC
Quantization Parameter
24
30
36
42
GOP Size 16
1
(hierarchical B pictures)
(no B pictures)
Intra Period 16
Search Range 32
Symbol Mode CABAC
Besides these settings typical configurations for H.264/AVC have been used. Results
For H.264 Auxiliary Picture Syntax simulations the left view video and depth are encoded, transmitted and decoded independently. Based on the same simulations as for MPEG-C Part 3, we achieved the following results:
MOBILE3DTV D2.2
44
MOBILE3DTV D2.2
45
Figure 13: RD-comparison for H.264 Auxiliary Picture Syntax coding simulations: total bitrate for video plus depth vs. average PSNR relative to the original VL sequence and the VR sequence rendered from
original VL+DL, respectively
As already described in section 5.2 these simulation results clearly indicate, that the overall RD-performance for temporal prediction with using hierarchical B pictures (GOP 16) is better than for simulcast coding without hierarchical B pictures (GOP 1). Note, that according to section 5.2, the PSNR of the right view was calculated between the right view rendering results from compressed and original video plus depth data of the left view. The gains that can be achieved for the individual sequences differ largely, depending on factors like scene content and depth complexity and temporal variation. For the same quality between almost zero and up to 50% of the bitrate can be saved with hierarchical B pictures (GOP 16). In contrast to MPEG-C Part 3 no variations of different video and depth qualities are possible with H.264 Auxiliary Picture Syntax. Therefore the different contribution of the right and the rendered left view to the average PSNR is analyzed here. For high bitrates the left views achieves a higher quality than the rendered right view, while for low bitrates the opposite can be observed in some cases. The differences
MOBILE3DTV D2.2
46
between the two views are mostly small, but especially for high bitrates differences of several dB are possible.
Informal subjective expert viewing has been carried out for the H.264 Auxiliary Picture Syntax simulation results on a stereoscopic display. Due to the equivalent simulations the conclusions for the subjective quality evaluation are the same as for MPEG-C Part 3 simulations (see section 4.2 for details).
MOBILE3DTV D2.2
47
6. Comparative Analysis
MOBILE3DTV D2.2
48
Figure 14: RD-comparison for the three different simulations on stereo video coding approaches: total bitrate for both views vs. average PSNR relative to the original sequences
Objective results in terms of RD-performance have been compared for the three different simulations on stereo video coding approaches (without depth), namely H.264 Simulcast, H.264 Stereo SEI Message and H.264/MVC. This comparison clearly indicates that the overall RD-performance of both Stereo SEI Message and MVC coding is better than for Simulcast coding. The results for Stereo SEI Message and MVC cannot be compared directly, because different coding conditions had to be used for simulations, but since the corresponding simulcast coding experiments with JM (in section 2.2) and with JMVM (in section 4.2) achieve very similar results, the conclusion seems reasonable, that Stereo SEI performs better than MVC in some cases. Comparison of the RD-performance gains between Simulcast and Stereo SEI (using equal coding conditions) shows that for the same quality up to 35% of the total bitrate can be saved with interview prediction. However, in some cases as for the Hands sequence the gain is negligible.
MOBILE3DTV D2.2
49
Informal subjective expert viewing has been carried out for the simulation results of the three different approaches for coding of stereo video format data on a stereoscopic display. This lead to the conclusion that the objective RD results are confirmed, as for the same bitrate a lower quality is achieved by Simulcast coding than by Stereo SEI or MVC coding. In return for these two approaches a lower bitrate is necessary to achieve the same subjective quality as simulcast.
From a complexity point of view all 3 approaches are comparable. They use the same basic operations. Using hierarchical B pictures with a GOP of 16 certainly means a tremendous increase of memory requirements and delay. A standard conform implementation of MVC requires that the decoder supports the H.264 High Profile since MVC extends that. A standard conform implementation of the Stereo SEI Message requires that the decoder supports the H.264 interlaced tools.
MOBILE3DTV D2.2
50
Figure 15: RD-comparison for simulcast coding with stereo video and video plus depth format data: total bitrate vs. average PSNR for both views
Objective results in terms of RD-performance have been compared for the simulcast simulations with stereo video and video plus depth format data. However, it has to be considered for the comparison between stereo video and video plus depth coding results, as well as the fact that for stereo video the PSNR of left and right view is calculated between original and decoded pictures, while for video plus depth the PSNR of the right view is calculated between the rendered right view from compressed and original video plus depth data of the left view. Taking these differences into account the results indicate that the overall RD-performance of video plus depth is better than for stereo video with simulcast coding. For both formats the RD-performance for temporal prediction with using hierarchical B pictures (GOP 16) is better than for simulcast coding without hierarchical B pictures (GOP 1).
Initial and very informal subjective expert viewing has been carried out to compare simulcast with video plus depth coding. However these comparisons were done using sequences with
MOBILE3DTV D2.2
51
equal PSNR and not equal bitrate, since no data were available yet for equal bitrate. Note that the PSNR values are calculated differently as stated above. In these tests simulcast stereo coding showed better subjective results than video plus depth, but the bitrate was higher as well. Therefore it is not possible to take conclusions at this point.
A detailed comparison of the different approaches is the main work in WP2 in the coming months. Detailed subjective testing is already planned in collaboration with WP4, which will also answer the questions remaining open here at this point.
MOBILE3DTV D2.2
52
7. Conclusions
This deliverable investigated available 3D video representation formats and coding standards for mobile applications. Simulations were carried out with realistic coding settings, e.g. intra period of 16 for random access and error robustness. A typical set of test sequences was used targeting the display to be used in the demonstrator and covering different types of content (high and low scene content complexity, high and low temporal variation). As for any type of video coding, the same amount of raw input data leads to very different RD-performance. The required bitrate for achieving acceptable quality strongly depends on the properties of the sequence content, especially temporal variation and complexity of the scene. The coding gain from inter-view prediction (Stereo SEI & MVC) varies largely.
Significant coding gains can be achieved with hierarchical B pictures for temporal prediction. Not using hierarchical B pictures not only results in considerably higher bitrates for the same objective quality, but even in a worse subjective quality. However, the gain from using hierarchical B pictures differs largely for individual sequences, depending on factors like scene content complexity and temporal variation. However, hierarchical B pictures also mean increased complexity and memory requirements. It remains to be studied how far this can be implemented on a mobile terminal.
Savings from GOP 16 vs. GOP 1 for 3D video coding:
H.264 Simulcast: up to 50% bitrate saving
H.264 Stereo SEI: up to 60% bitrate saving
MVC: up to 50% bitrate saving
MPEG-C Part 3: up to 50% bitrate saving
Inter-view prediction leads to a significant reduction of bitrate for some sequences. However, in some cases the gain is negligible. In our experiments we achieved up to 35% bitrate savings from inter-view prediction compared to stereo simulcast. Inter-view prediction whether performed as H.264 SEI or MVC does not add substantial complexity. It uses the same basic operations. However, a standard conform implementation of MVC requires that the decoder supports the H.264 High Profile since MVC extends that. A standard conform implementation of the Stereo SEI Message requires that the decoder supports the H.264 interlaced tools.
A representation as video plus depth is an interesting alternative for 3D video. It allows to adjust the stereo rendering at the decoder and to optimally adapt the 3D impression for any given display. However, this extended functionality comes at the cost of an increased complexity since rendering of one output view has to be done at the terminal device. Depth estimation is necessary on sender side, which is an inherently error prone task. Nevertheless, it has been shown that good general quality is achievable by the video plus depth approach. Please refer to D2.3 “Report on generation of video plus depth data base”.
MPEG-C Part 3 is suitable for encoding of video plus depth data. Different bitrate ratios (and
thereby qualities) for video and depth can be adjusted. It was found that at a certain bitrate the
QP for depth should be chosen lower than the QP for color to achieve best overall results, e.g.
C30, D24. It can be concluded that good depth quality is essential for good overall quality. The
numerical bitrate ratio between color and depth may vary largely depending on the sequence.
We have found ratios between 1:1 and 6:1. In most cases for best overall quality a substantial
portion of the bitrate has to be spent for depth.
MOBILE3DTV D2.2
53
A detailed comparison of “video plus video” approaches (simulcast, H.264 SEI, MVC) and video
plus depth (MPEG-C Part 3) is still to be done. This will be done in close collaboration with
WP4, which will perform formal subjective tests about these issues. These experiments will also
include mixed resolution stereo video coding, as an extension of available 3D video formats.
Mobile 3DTV Content Delivery Optimization over DVB-H System
MOBILE3DTV - Mobile 3DTV Content Delivery Optimization over DVB-H System - is a three-yearproject which started in January 2008. The project is partly funded by the European Union 7th
RTD Framework Programme in the context of the Information & Communication Technology (ICT)Cooperation Theme.
The main objective of MOBILE3DTV is to demonstrate the viability of the new technology ofmobile 3DTV. The project develops a technology demonstration system for the creation andcoding of 3D video content, its delivery over DVB-H and display on a mobile device, equippedwith an auto-stereoscopic display.
The MOBILE3DTV consortium is formed by three universities, a public research institute and twoSMEs from Finland, Germany, Turkey, and Bulgaria. Partners span diverse yet complementaryexpertise in the areas of 3D content creation and coding, error resilient transmission, userstudies, visual quality enhancement and project management.
For further information about the project, please visit www.mobile3dtv.eu.
Tuotekehitys Oy TamlinkProject coordinator
FINLAND
Tampereen Teknillinen Yliopisto
Visual quality enhancement,
Scientific coordinator
FINLAND
Fraunhofer Gesellschaft zur Förderung der angewandten Forschung e.V
Middle East Technical UniversityError resilient transmission
TURKEY
Stereo video content creation and coding
GERMANY
Technische Universität IlmenauDesign and execution of subjective tests
GERMANY
MM Solutions Ltd. Design of prototype terminal device
BULGARIA
MOBILE3DTV project has received funding from the European Community’s ICT programme in the context of theSeventh Framework Programme (FP7/2007-2011) under grant agreement n° 216503. This document reflects onlythe authors’ views and the Community or other project partners are not liable for any use that may be made of theinformation contained therein.