Video Codec Basics

Video Codec BasicsVideo compression algorithms ("codecs") helps to reduce the storage requirements and bandwidth requirements, while delivering good visual quality. The most popular Video Codec is Mpeg-2, which is been used for storage in video in CD and Mpeg4 is also popular for storage in DVD. Video data is traditionally represented in the form of a stream of images, called frames as shown in the following figure. These frames are displayed to the user at a constant rate, called frame rate (frames/second). Commonly used frame rate is 30.

If you observe the above sequence of frames you can notice a large amount of similarly between the consecutive frames. This similarity is redundant information in terms of video storage. The Video Compression is possible by eliminating this redundant information from the sequence of incoming video frames and there are four important types redundancies. Temporal Redundancy Correlation between consecutive frames in a video sequence. The following figure illustrates the correlation between consecutive frames. If we subtract/remove the Frame-1 from other frames, the remaining amount of data is significantly less. This is inter-frame redundancy removal, since we are subtracting one frame from another.

Spatial Redundancy Correlation between adjacent image pixels within the frame. Color Spectral Redundancy The human eye's increased sensitivity to small differences in brightness and decreased sensitivity to small differences in color is taken advantage here. Physco-visual redundancy Human Vision System is less sensitive to detailed texture in the image. The embedded developers and researchers should have a good understanding of the various algorithms those which removes these redundancies.

Resolution The choice of resolution mainly depends upon the processing power available for the given equipment. Higher the resolution, higher processing power is required and hence higher energy is also required. In the plain words, smaller the equipment, lesser the resolution supported. The common resolutions used are QCIF, CIF, SD and HD is detailed in the following sections. Teleconference Resolution of 174 x 144 known as QCIF. Generally 5-10 Frames Per Second (fps) is used. Uncompressed size with the above mentioned resolution and fps somewhere around 1.6-3Mbps. After compression with general codecs such as MPEG2/H.263, the expected file s Multimedia Resolution of 352 x 288 known as CIF. Generally 30 Frames Per Second is used Uncompressed size 36Mbps Expected Compress size 200-300Kbps size is around 3264Kbps. Standard definition Resolution of 720 x 486 known as SD. Generally 30 Frames Per Second is used Uncompressed size 168Mbps Expected Compress size 4-6Mbps. High Definition Resolution of 1920 x 1080 known as HD. Generally 30 Frames Per Second is used Uncompressed size 1.2Gbps. Expected Compress size 20Mbps Digital Cinema Resolution of 4096 x 2160 known as DC. Generally 24 Frames Per Second is used Uncompressed size 7.6Gbps. Expected Compress size 100Mbps

H.264: Advanced Video Codec H.264 encoder is 10 times efficient as well as complex than MPEG2 and also has many new features. The high efficiency of compression is achieved in H.264 standard is by the combination of a number of encoding algorithms. The following tutorial of video codec will also try to cover the relevant parts of the 'H.264' and its implementation aspects.

In this block diagram we can see five processing blocks, each one to remove the redundancies explained before. For the removal of temporal redundancy, we have Temporal-model which will analysis the incoming frames with the previous frames to identify the similarities. In the block diagram, the Temporalmodel have two inputs, one is the incoming video and the other is previous frames stored inside the memory of the codec.

Figure: Basic Video Encoder

Spatial-Model will remove the redundancies in the frame without comparing with other frames. This is something similar to JPG or GIF compression. For any given frame we have to execute either Temporal or Spatial compression. The Discrete Cosine Transform along with Quantization module will remove the Physo-visual redundancy. The Entropy-coder will remove the redundancies in the final bit-stream. For example, recurring patterns in the bit-stream.

Algorithms in Video CodecA video codec is a cocktail of algorithms, combined in the proper fashion to achieve the compression. The significant algorithms/tools used in the encoder are discussed here. 1) Intra Frame Intra-prediction utilizes spatial correlation in each frame to reduce the amount of transmission data necessary to represent the picture. Intra frame is essentially the first frame to encode but with less amount of compression. Macroblocks Let us consider a small video frame of size of 352x288 pixels. For the ease of processing, the frame will be broken into smaller chunks. For many codecs, square blocks of 16 x 16 pixels are used, as shown figure.

The choice of an 16 x 16 block size is a good compromise between the number of unique pieces of information coded and the amount of commonality between pixels within a block. This blocks are termed as MacroBlocks. Macroblocks are the basic building blocks encoding/decoding process is carried out. of the standard for which the

The Macroblocks can be again break into smaller pieces of 8x8 and 4x4 pixels. While most of the processing happens in 16x16 pixels, the smaller blocks are used wherever finer precision is required. Figure: Macroblock, Block

YUV Human eyes are sensing the colour and brightness by different set of sensors. The brightness sensors are comparatively more sensitive than the color sensors. Image compression algorithms are taking advantage of this phenomenon. The compression algorithms first transforms the image from RGB to the luminance/chrominance (Y-Cb-Cr) color space. Here Y called as luma represents the brightness/grayscale and Cb-Cr are the two color components represents the extent to which the color deviates from gray toward blue and red, respectively. Since human visual system is more sensitive to luma than chroma, we will use one fourth of the number of samples the chroma component has, than the luma component. This is done by down sampling half the number of samples in both the horizontal and vertical dimensions. This is called 4:2:0 sampling with 8 bits of precision persample. Mostly used YUV format is 4:2:0

Frame PredictionThere are two important kinds of frames in Video Coding: I (intra) frames and P (predicted) frames. P-frames contributes significantly towards the high compression ratios. Depends upon the type of prediction, one more frame predication method is also widely used, named B (bidirectional) frames. Intra frames do not refer to other frames, making them suitable as key frames. They are, essentially, self-contained compressed images. Consider the following as a film strip.

Intra Frame (I-Frame) This First frame will be coded independently.

Predicted Frame (P-Frame) Second frame will be code only with differences from the first frame

Predicted Frame (P-Frame) Third frame will be code only with differences from the second frame

Predicted Frame (P-Frame) Fourth frame will be code only with differences from the third frame

In this film strip, the first frame is an I-frame. The following frames are predicted from the first-frame and hence these frames are predicted frames and is known as P-frames. I frame Intra frame is essentially the first frame to encode but with less amount of compression. This frame is also known as key frame because the preceding frames are encoded using the information available from this frame. Intra-prediction utilizes spatial correlation in each frame to reduce the amount of transmission data necessary to represent the picture. Intra-frame is more or less similar to image compression like JPEG or GIF. They is coded without any dependencies to other frames. This type of frame is in which a complete image is stored in the data stream. Thus an Intra-frame can be decoded of its own without referring other frames. This kind of frames are formed using Intra-Predication methods, which are discussed below. H.264 view:H.264 performs intra-prediction on two different sized blocks: 16x16 (the entire macroblock) and 4x4. 16x16 prediction is generally chosen for areas of the picture that are smooth. 4x4 prediction, on the other hand, is useful for predicting more detailed sections of the frame. In the following picture some of the locations are pointed out, where we can use 16x16 and 4x4.

The general idea is to predict a block, whether it be a 4x4 or 16x16 block, based on surrounding pixels using a mode that results in a prediction that most closely resembles the actual pixels in that block.

P-frame Inter prediction P-frames are predicted by using the previous P or I-frame. The frames 2 to 4 are Predicted frames.

The Inter frames are encoded from the second frame onwards from the incoming frames. This type of frames is responsible for the most reduction of the video stream. This is possible by extracting only the motion information from frames. Motion estimation The motion estimation algorithms are used in the encoding of Inter frames.H.264 encoding supports sub-pixel resolution formotion vectors, meaning that the reference block is actually calculated by interpolating inside a block of real pixels. The motion vectors for luma blocks are expressed at quarter-pixel resolution, and for chroma blocks the accuracy can be eighth-pixel accuracy. B-frame B-frames are bidirectional predicted frames. As the name suggests, B-frames rely on the frames preceding and following them. B-frames contain only the data that have changed from the preceding frame or are different from the data in the very next frame. The following figure shows frame number 2 and 3 are B-frames.

B frames are interesting for two facts. First they have a slightly better prediction. And second and more important, they do not impact the quality of following frames, so they can be coded with lower quality without degrading the whole sequence. Since B-frames depend on both past and future picture, the decoder have to be fed with future I-P frames before being able to decode them. and the conclusion is.. I frames are the least compressible but don't require other video frames to decode. P frames can use data from previous I frames to decompress and are more compressible than I frames. B frames can use both previous and forward frames for data reference to get the highest amount of data compression. More details for Intra prediction Intra-Prediction modes There are nine 4x4 prediction modes, shown in following Figure, and four 16x16 modes. The four 16x16 modes are similar to modes 0, 1, 2, and a combination of modes 3 and 8 of the 4x4 modes.

2) Motion Estimation The fundamental concept in video compression is to store only incremental changes between frames. The difference between to frames are extracted by Motion Estimation tool. Here one whole frame is reduced into many sets of motion vectors. 3) Motion Compensation Motion Compensation will decode the image that is encoded by Motion Estimation. This reconstruction of image is done from received motion vectors and the reference frame.

4) Transformation The transformation is used to compress the image in Inter-frames or Intra-frames. The mostly used transformation is Discrete Cosine Transform (DCT) and Wavelet Transform. The codec calculates a DCT on each 4 x 4 block of pixel in frame. 5) Quantization The quantization stage reduces the amount of information by dividing each coefficient by a particular number to reduce the quantity of possible values that value could have. Because this makes the values fall into a narrower range, this allows entropy coding to express the values more compactly. 6) De-blocking filter Loop filtering is mandatory in the encoder, it identify a blocking situation depending by two threshold factors (alpha and beta). A lot of efficiency is due to the loop filter. The strength of filter depends on intra/inter coding, differential vectors, quantization level. Up to 40% of total processing power may be required by this kind of filter. Filtering the reference frames prior to use them in prediction can significantly improve the objective and perceptual quality expecially at low or medium bitrates. 7) Entropy Coder This algorithm is a lossless encoding tool, i.e. the encoded stream can be decoded without any loss. H.264 deployed an enhanced VLC of two types. 1) context-adaptive variable-length coding (CAVLC) 2) Context-adaptive binary-arithmetic coding (CABAC) With the knowledge of the probabilities of syntax elements in a given context, syntax elements in the video stream can be losslessly compressed . 8) Network Abstraction Layer All the compressed data is packetized in Network friendly format by NAL unit. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and bitstream delivery is - except that each NAL unit can be preceded by a start code prefix in a bitstream-oriented transport layer. 9) Rate Distortion Optimization The compressed bitstream will vary the size depending upon the contends of the frames. For example, a slow moving movie will generated very less compressed data, where as a fast moving movie will generate significantly large compressed data, for the same resolution & fps. This characteristics may not be welcoming in most of the situations. The rate control mechanism will keep the output bitrate within the requirement.

Rate Distortion OptimizationThe compressed bitstream will vary the size depending upon the contends of the frames. For example, a slow moving movie will generated very less compressed data, where as a

fast moving movie will generate significantly large compressed data, for the same resolution & fps. This characteristics may not be welcoming in most of the situations. The rate control mechanism will keep the output bitrate within the requirement.

Figure: Rate Control

Rate control is possible by adding a buffer on the output of the VLC and a feedback of a rate control from the buffer. This feedback will control the strength of quantization coefficients in the quantizer. If we need constant bitrate (CBR) video, then the buffer output bitrate must be constant. That means its input bitrate must be controlled so as to avoid overflow or underflow. The common way of controlling the bitrate is to monitor buffer fullness, and then feed back this information to the quantizer. Usually the stepsize of the quantizer is adjusted to keep the buffer around the midpoint, or half full.

The top-level block diagram of an H.264 Encoder is shown in the figure. Figure: H.264 Encoder Block Diagram

The encoding operation consists of the forward encoding path and the inverse decoding path. The forward encoding path (Red colored lines) predicts each MB using Inter or Intra prediction and Transforms and Quantizes (TQ) the residual. Then it forwards the result to the Entropy Encoder and forms output packets in the Network Abstraction Layer (NAL). So this article concentrate more towards encoder side, since the encoder has most of the components in the decoder also, except entropy decoder. The inverse path (Blue colored lines) involves the reconstruction of the MB from the previously transformed data by utilizing the Inverse Transform and Quantization (ITQ) and the deblocking filter.

Deinterlacing is defined as the changing of an interlaced image into a progressive scan image.

Most of the newer technology display devices have some type of deinterlacer built into them, but just as in scaling, how well it is performed is critical to the image quality you see. Video comes to your display in two forms: video from a video camera and video produced from film. Both present their own unique challenges for a deinterlacer.

Video originally from a video source, as in anything that would be shot by a video camera instead of film, is recorded in individual fields. (Remember, a field equals one half of a frame). In NTSC, these fields consist of 240 lines of information, or half the resolution needed for a full frame. The problem is that these two fields one with the odd lines of information of the frame and one with the even lines of information for the frame are not actually recorded at the same time. If everything is motionless, there isn't a problem with simply taking the odd field and adding it to the even field, to make up one full progressive frame of information. Everything would look great. The problem therein lies with motion. If there is motion between the time the odd field is captured and when the even field is captured by the camera, you can't simply add the two fields together to create a frame. When these fields are played back in interlaced form, one after the other, the difference in fields isn't noticeable because they are not shown at the same time. However, if you were to simply add the fields together to form a progressive scan image, you would get something that looks like this:

Because this car is moving, just adding the two fields together won't work. The resulting jagged edges seen above are a sure sign of poor deinterlacing and are called "jaggies". A good deinterlacer will solve this by comparing the separate fields, field one versus. field two. In areas of high motion, it interpolates (averages) the two areas to create that portion of the progressive frame, while at the same time it combines only the areas that are not in motion. This process is called motion adaptive deinterlacing. The resulting image is smoothed out as follows:

You may now be tempted to say, "well, that was easy", but hold on. We now have a new situation to consider. As we've mentioned before, NTSC video might have originally been converted from film. Film is, by nature, already progressive scan (a full frame), but is captured at 24 frames per second, while video is captured at 30 frames per second (60 fields per second) in an interlaced format. This means that there has to be some creativity involved in converting the progressive film into interlaced video, due to the timing difference. Here's how it works.

Every frame of film has to be split into fields. Two fields per frame are needed for video. The first film frame is used for the first three fields, or frame-and-a-half, of video. The next film frame is used to make the next two fields of video. This continues at a three fields, two fields rate. It looks like the figure shown above. Obviously, certain video frames don't add up (they come from two separate frames of film), but remember that this is for display on an interlaced television. Because you never actually see a complete frame on an interlaced television, your eyes can't see that the frames might not match up much in the same way that motion doesn't match up in video. This process for converting film to interlaced video is called 3:2 pulldown. There is a problem though. You cannot use the same deinterlacing techniques here as we used for video. What happens if we change scenes from frame A of film to frame B of film? The second frame of video would have information from two completely different scenes! You can't simply look at the two scenes and add them together or figure out an average. You actually have to reverse the 3:2 pulldown process. Here's a diagram of how that is done:

Looking at the diagram above, you can see that the deinterlacer first finds the original two interlaced fields that made up the first frame of film and combines them. It then displays the first full frame of film as the first three frames of progressive scan video. It does the same thing with the second frame of film, but displays it as two frames of progressive scan video. The next film frame is displayed three times again, and so on and so on.

This works because progressive video is displaying 60 full frames of video per second instead of 60 fields of video per second. The end result is a very smooth image without any deinterlacing problems. The slight downside is that because it is displaying at a rate of three frames two frames- three frames, the video has a slight "judder" to it. Although at 60 frames per second, it's almost indistinguishable. In order to create a great progressive scan image, the deinterlacer must efficiently perform deinterlacing of both video and film. The other thing the deinterlacer must excel at is to know when it is looking at film-based material and when it is looking at video. If it can't do that well, then everything else is rather moot, because the deinterlacer might try to render film-type deinterlacing on video (which simply wouldn't work) and video-type deinterlacing on film (which again would exhibit serious problems). When purchasing a display device for home theater usage, make sure that the deinterlacer is of the best quality. Company names to look for would be Faroudja and Silicon Image.

Documents

Video Codec Basics