14
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999 109 Wavelet Based Rate Scalable Video Compression Ke Shen, Member, IEEE, and Edward J. Delp, Fellow, IEEE Abstract—In this paper, we present a new wavelet based rate scalable video compression algorithm. We will refer to this new technique as the Scalable Adaptive Motion Compensated Wavelet algorithm. uses motion compensation to reduce temporal redundancy. The prediction error frames and the intracoded frames are encoded using an approach similar to the embedded zerotree wavelet (EZW) coder. An adaptive motion compensation (AMC) scheme is described to address error propagation problems. We show that, using our AMC scheme, the quality of the decoded video can be maintained at various data rates. We also describe an EZW approach that exploits the interdependency between color components in the luminance/chrominance color space. We show that, in addition to providing a wide range of rate scalability, our encoder achieves comparable performance to the more traditional hybrid video coders, such as MPEG1 and H.263. Furthermore, our coding scheme allows the data rate to be dynamically changed during decoding, which is very appealing for network-oriented applications. Index Terms—Motion compensation, rate scalable, video com- pression, wavelet transform. I. INTRODUCTION M ANY applications require that digital video be deliv- ered over computer networks. The available bandwidth of most computer networks almost always pose a problem when video is delivered. A user may request a video sequence with a specific quality. However, the variety of requests and the diversity of the traffic on the network may make it difficult for a video server to predict, at the time the video is encoded and stored on the server, the video quality and data rate it will be able to provide to a particular user at a given time. One solution to this problem is to compress and store a video sequence at different data rates. The server will then deliver the requested video at the proper rate given network loading and the specific user request. This approach requires more resources to be used on the server in terms of disk space and management overhead. Therefore, scalability, the capability of decoding a compressed sequence at different data rates, has become a very important issue in video coding. Scalable video coding has applications in digital libraries, video database system, video streaming, videotelephony, and multicast of television (including HDTV). The term “scalability” used here includes data rate scal- ability, spatial resolution scalability, temporal resolution scal- ability, and computational scalability. The MPEG-2 video Manuscript received October 31, 1997; revised May 30, 1998. This work was supported by grants from the AT&T Foundation and the Rockwell Foun- dation. This paper was recommended by Associate Editor S. Panchanathan. The authors are with the Video and Image Processing Laboratory (VIPER), School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907-1285 USA. Publisher Item Identifier S 1051-8215(99)01231-8. compression standard incorporated several scalable modes, including signal-to-noise ratio (SNR) scalability, spatial scala- bility, and temporal scalability [1], [2]. However, these modes are layered instead of being continuously scalable. Continuous rate scalability provides the capability of arbitrarily selecting the data rate within the scalable range. It is very flexible, and allows the video server to tightly couple the available network bandwidth and the data rate of the video being delivered. A specific coding strategy known as embedded rate scalable coding is well suited for continuous rate scalable applications [3]. In embedded coding, all of the compressed data are embedded in a single bit stream, and can be decoded at different data rates. In image compression, this is very similar to the progressive transmission. The decompression algorithm receives the compressed data from the beginning of the bit stream up to a point where a chosen data rate requirement is achieved. A decompressed image at that data rate can then be reconstructed, and the visual quality corresponding to this data rate can be obtained. Thus, to achieve the best performance, the bits that convey the most important information need to be embedded at the beginning of the compressed bit stream. For video compression, the situation can be more complicated since a video sequence contains multiple images. Instead of sending the initial portion of the bit stream to the decoder, the sender needs to selectively provide the decoder with portions of the bit stream corresponding to different frames or sections of frames of the video sequence. These selected portions of the compressed data achieve the data rate requirement, and can then be decoded by the decoder. This approach can be used if the position of the bits corresponding to each frame or each section of frames can be identified. In this paper, we propose a new continuous rate scalable hybrid video compression algorithm using the wavelet trans- form. We will refer to this new technique as the Scalable Adaptive Motion Compensated Wavelet (SAMCoW) algorithm. SAMCoW uses motion compensation to reduce temporal re- dundancy. The prediction error frames (PEF’s) and the intra- coded frames ( frames) are encoded using an approach similar to embedded zerotree wavelet (EZW) [3], which provides continuous rate scalability. The novelty of this algorithm is that it uses an adaptive motion compensation (AMC) scheme to eliminate quality decay even at low data rates. A new modified zerotree wavelet image compression scheme that exploits the interdependence between the color components in a frame is also described. The nature of SAMCoW allows the decoding data rate to be dynamically changed to meet network loading. Experimental results show that SAMCoW has a wide range of scalability. For medium data rate (CIF images, 30 frames/s) applications, the scalable range of 1–6 Mbits/s can be achieved. The performance is comparable to that of 1051–8215/99$10.00 1999 IEEE

Wavelet Based Rate Scalable Video Compression - … · IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999 109 Wavelet Based Rate Scalable

  • Upload
    trannga

  • View
    220

  • Download
    3

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999 109

Wavelet Based Rate Scalable Video CompressionKe Shen,Member, IEEE, and Edward J. Delp,Fellow, IEEE

Abstract—In this paper, we present a new wavelet based ratescalable video compression algorithm. We will refer to this newtechnique as the Scalable Adaptive Motion Compensated Wavelet(SAMCoW )(SAMCoW )(SAMCoW ) algorithm. SAMCoWSAMCoWSAMCoW uses motion compensationto reduce temporal redundancy. The prediction error frames andthe intracoded frames are encoded using an approach similarto the embedded zerotree wavelet (EZW) coder. An adaptivemotion compensation (AMC) scheme is described to addresserror propagation problems. We show that, using our AMCscheme, the quality of the decoded video can be maintainedat various data rates. We also describe an EZW approachthat exploits the interdependency between color componentsin the luminance/chrominance color space. We show that, inaddition to providing a wide range of rate scalability, our encoderachieves comparable performance to the more traditional hybridvideo coders, such as MPEG1 and H.263. Furthermore, ourcoding scheme allows the data rate to be dynamically changedduring decoding, which is very appealing for network-orientedapplications.

Index Terms—Motion compensation, rate scalable, video com-pression, wavelet transform.

I. INTRODUCTION

M ANY applications require that digital video be deliv-ered over computer networks. The available bandwidth

of most computer networks almost always pose a problemwhen video is delivered. A user may request a video sequencewith a specific quality. However, the variety of requests andthe diversity of the traffic on the network may make it difficultfor a video server to predict, at the time the video is encodedand stored on the server, the video quality and data rate itwill be able to provide to a particular user at a given time.One solution to this problem is to compress and store a videosequence at different data rates. The server will then deliverthe requested video at the proper rate given network loadingand the specific user request. This approach requires moreresources to be used on the server in terms of disk space andmanagement overhead. Therefore, scalability, the capabilityof decoding a compressed sequence at different data rates, hasbecome a very important issue in video coding. Scalable videocoding has applications in digital libraries, video databasesystem, video streaming, videotelephony, and multicast oftelevision (including HDTV).

The term “scalability” used here includes data rate scal-ability, spatial resolution scalability, temporal resolution scal-ability, and computational scalability. The MPEG-2 video

Manuscript received October 31, 1997; revised May 30, 1998. This workwas supported by grants from the AT&T Foundation and the Rockwell Foun-dation. This paper was recommended by Associate Editor S. Panchanathan.

The authors are with the Video and Image Processing Laboratory (VIPER),School of Electrical and Computer Engineering, Purdue University, WestLafayette, IN 47907-1285 USA.

Publisher Item Identifier S 1051-8215(99)01231-8.

compression standard incorporated several scalable modes,including signal-to-noise ratio (SNR) scalability, spatial scala-bility, and temporal scalability [1], [2]. However, these modesare layered instead of being continuously scalable. Continuousrate scalability provides the capability of arbitrarily selectingthe data rate within the scalable range. It is very flexible, andallows the video server to tightly couple the available networkbandwidth and the data rate of the video being delivered.

A specific coding strategy known asembedded rate scalablecoding is well suited for continuous rate scalable applications[3]. In embedded coding, all of the compressed data areembedded in a single bit stream, and can be decoded atdifferent data rates. In image compression, this is very similarto the progressive transmission. The decompression algorithmreceives the compressed data from the beginning of the bitstream up to a point where a chosen data rate requirement isachieved. A decompressed image at that data rate can then bereconstructed, and the visual quality corresponding to this datarate can be obtained. Thus, to achieve the best performance,the bits that convey the most important information need tobe embedded at the beginning of the compressed bit stream.For video compression, the situation can be more complicatedsince a video sequence contains multiple images. Instead ofsending the initial portion of the bit stream to the decoder, thesender needs to selectively provide the decoder with portionsof the bit stream corresponding to different frames or sectionsof frames of the video sequence. These selected portions ofthe compressed data achieve the data rate requirement, and canthen be decoded by the decoder. This approach can be used ifthe position of the bits corresponding to each frame or eachsection of frames can be identified.

In this paper, we propose a new continuous rate scalablehybrid video compression algorithm using the wavelet trans-form. We will refer to this new technique as the ScalableAdaptive Motion Compensated Wavelet (SAMCoW) algorithm.SAMCoWuses motion compensation to reduce temporal re-dundancy. The prediction error frames (PEF’s) and the intra-coded frames (frames) are encoded using an approach similarto embedded zerotree wavelet (EZW) [3], which providescontinuous rate scalability. The novelty of this algorithm isthat it uses an adaptive motion compensation (AMC) schemeto eliminate quality decay even at low data rates. A newmodified zerotree wavelet image compression scheme thatexploits the interdependence between the color componentsin a frame is also described. The nature ofSAMCoWallowsthe decoding data rate to be dynamically changed to meetnetwork loading. Experimental results show thatSAMCoWhasa wide range of scalability. For medium data rate (CIF images,30 frames/s) applications, the scalable range of 1–6 Mbits/scan be achieved. The performance is comparable to that of

1051–8215/99$10.00 1999 IEEE

110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

Fig. 1. One level of the wavelet transform.

MPEG-1 at fixed data rates. For low bit-rate (QCIF images,10–15 frames/s) applications, the data rate can be scaled from20–256 Kbits/s.

In Section II, we provide an overview of wavelet basedembedded rate scalable coding and the motivation for using themotion compensated scheme in our new scalable algorithm. InSection III, we describe our new adaptive motion compensa-tion scheme (AMC) andSAMCoW. In Section IV, we provideimplementation details of theSAMCoWalgorithm. Simulationresults are presented in Section V.

II. RATE SCALABLE CODING

A. Rate Scalable Image Coding

Rate scalable image compression, or progressive transmis-sion of images, has been extensively investigated [4]–[6].Reviews on this subject can be found in [7] and [8]. Differenttransforms, such as the Laplacian pyramid [4], the discretecosine transform (DCT) [6], and the wavelet transform [3],[9], have been used for progressive transmission.

Shapiro introduced the concept of embedded rate scalablecoding using the wavelet transform and spatial orientation trees(SOT’s) [3]. Since then, variations of the algorithm have beenproposed [10], [9], [11]. These algorithms have attracted muchattention due to their superb performance, and are candidatesfor the baseline algorithms used in JPEG2000 and MPEG-4.In this section, we provide a brief overview of several waveletbased embedded rate scalable algorithms.

A wavelet transform corresponds to two sets of analy-sis/synthesis digital filters and where is a high-passfilter and is a low-pass filter. By using the filters andan image can be decomposed into four bands. Subsamplingis used to translate the subbands to a baseband image. This isthe first level of the wavelet transform (Fig. 1). The operationscan be repeated on the low–low (LL) band. Thus, a typical2-D discrete wavelet transform used in image processing willgenerate a hierarchical pyramidal structure shown in Fig. 2.The inverse wavelet transform is obtained by reversing thetransform process and replacing the analysis filters with thesynthesis filters and using upsampling (Fig. 3). The wavelettransform can decorrelate the image pixel values, and resultin frequency and spatial orientation separation. The transformcoefficients in each band exhibit unique statistical propertiesthat can be used for encoding the image.

For image compression, quantizers can be designed specif-ically for each band. The quantized coefficients can thenbe binary coded using either Huffman coding or arithmetic

Fig. 2. Pyramid structure of a wavelet decomposed image. Three levels ofthe wavelet decomposition are shown.

Fig. 3. One level of the inverse wavelet transform.

coding [12]–14]. In embedded coding, a key issue is to embedthe more important information at the beginning of the bitstream. From a rate–distortion point of view, one wants toquantize the wavelet coefficients that cause larger distortionin the decompressed image first. Let the wavelet transformbe where is the collection of image pixelsand is the collection of wavelet transform coefficients. Thereconstructed image is obtained by the inverse transform

where is the quantized transform coefficients.The distortion introduced in the image is

where is the distortionmetric and the summation is over the entire image. The greatestdistortion reduction can be achieved if the transform coefficientwith the largest magnitude is quantized and encoded withoutdistortion. Furthermore, to strategically distribute the bitssuch that the decoded image will look “natural,” progressiverefinement or bit-plane coding is used. Hence, in the codingprocedure, multiple passes through the data are made. Let

be the largest magnitude in In the first pass, thosetransform coefficients with magnitudes greater thanare considered significant and are quantized to a value of

The rest are quantized to 0. In the second pass, thosecoefficients that have been quantized to 0 but have magnitudesin between and are considered significant andare quantized to Again, the rest are quantized tozero. Also, those significant coefficients in the last pass arerefined to one more level of precision, i.e., orThis process can be repeated until the data rate meets therequirement or the quantization step is small enough. Thus, wecan achieve the largest distortion reduction with the smallest

SHEN AND DELP: WAVELET BASED RATE SCALABLE VIDEO COMPRESSION 111

number of bits, while the coded information is distributedacross the image.

To make this strategy work, however, we need to encodethe position information of the wavelet coefficients along withthe magnitude information. It is critical that the positionsof the significant coefficients be encoded efficiently. Onecould scan the image in a given order that is known toboth the encoder and decoder. This is the approach used inJPEG with the “zigzag” scanning. A coefficient is encoded0 if it is insignificant or 1 if it is significant relative to thethreshold. However, the majority of the transform coefficientsare insignificant when compared to the threshold, especiallywhen the threshold is high. These coefficients will be quantizedto zero, which will not reduce the distortion even though westill have to use at least one symbol to code them. Using morebits to encode the insignificant coefficients results in lowerefficiency.

It has been observed experimentally that coefficients whichare quantized to zero at a certain pass have structural similarityacross the wavelet subbands in the same spatial orientation.Thus spatial orientation trees (SOT’s) can be used to quan-tize large areas of insignificant coefficients efficiently (e.g.,zerotree in [3]).

The EZW algorithm proposed by Shapiro [3] and the SPIHTtechnique proposed by Said and Pearlman [9] use slightly dif-ferent SOT’s (shown in Fig. 4). The major difference betweenthese two algorithms lies in the fact that they use differentstrategies to scan the transformed pixels. The SOT used bySaid and Pearlman [9] is more efficient than Shapiro’s [3].

B. Scalable Video Coding

One could achieve continuous rate scalability in a videocoder by using a rate scalable still image compression al-gorithms such as [6], [3], [9] to encode each video frame.This is known as the “intraframe coding” approach. Weused Shapiro’s algorithm [3] to encode each frame of thefootball sequence. The rate–distortion performance is shown inFig. 5. A visually acceptable decoded sequence, comparable toMPEG-1, is obtained only when the data rate is larger than 2.5Mbits/s for a CIF (352 240) sequence. This low performanceis due to the fact that the temporal redundancy in the videosequence is not exploited. Taubman and Zakhor proposed anembedded scalable video compression algorithm using 3-Dsubband coding [15]. Some drawbacks of their scheme arethat the 3-D subband algorithm can not exploit the temporalcorrelation of the video sequence very efficiently, especiallywhen there is a great deal of motion. Also, since 3-D subbanddecomposition requires multiple frames to be processed at thesame time, more memory is needed for both the encoder andthe decoder, which results in delay. Other approaches to 3-Dsubband video coding are presented in [16] and [17].

Motion compensation is very effective in reducing temporalredundancy, and is commonly used in video coding. A mo-tion compensated hybrid video compression algorithm usuallyconsists of two major parts: the generation and compressionof the motion vector (MV) fields, and the compression of the

frames and prediction error frames. Motion compensation

(a)

(b)

Fig. 4. Diagrams of the parent–descendent relationships in the spatial orien-tation trees. (a) Shapiro’s algorithm. Notice that the pixel in the LL band hasthree children. Other pixels, except for those in the highest frequency bands,have four children. (b) Said and Pearlman’s algorithm. One pixel in the LLbands (noted with “*”) does not have a child. Other pixels, except for thosein the highest frequency bands, have four children.

Fig. 5. Average PSNR of EZW encodedfootball sequence (I frame only)at different data rates (30 frames/s).

is usually block based, i.e., the current image is divided intoblocks, and each block is matched with a reference frame.The best matched block of pixels from the reference frameare then used in the current block. The prediction error frame(PEF) is obtained by taking the difference between the current

112 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

Fig. 6. Block diagram of a generalized hybrid video codec for predictively coded frames. Feedback loop is used in the encoder. Adaptive motioncompensation is not used.

frame and the motion predicted frame. PEF’s are usuallyencoded using either block-based transforms, such as DCT[8], or nonblock-based coding, such as subband coding or thewavelet transform. The DCT is used in the MPEG and H.263algorithms [18], [1], [19]. A major problem with a block-basedtransform coding algorithm is the existence of the visuallyunpleasant block artifacts, especially at low data rates. Thisproblem can be eliminated by using the wavelet transform,which is usually obtained over the entire image. The wavelettransform has been used in video coding for the compressionof motion predicted error frames [20], [21]. However, thesealgorithms are not scalable. If we use wavelet based ratescalable algorithms to compress theframes and PEF’s,rate scalable video compression can be achieved. Recently,a wavelet based rate scalable video coding algorithm hasbeen proposed by Wang and Ghanbari [22]. In their scheme,the motion compensation was done in the wavelet transformdomain. However, in the wavelet transform domain, spatialshifting results in phase shifting, hence, motion compensationdoes not work well, and may cause motion tracking errors inhigh-frequency bands. Pearlman [23], [24] has extended theuse of SPIHT to describe a three-dimensional SOT for use invideo compression.

III. A N EW APPROACH: SAMCoW

A. Adaptive Motion Compensation

One of the problems of any rate scalable compressionalgorithm is the inability of the codec to maintain a constantvisual quality at any data rate. Often, the distortion of adecoded video sequence varies from frame to frame. Sincea video sequence is usually decoded at 25 or 30 frames/s (or5–15 frames/s for low data rate applications), the distortion of

each frame may not be discerned as accurately as when indi-vidual frames are examined due to temporal masking. Yet, thedistortion of each frame contributes to the overall perceptionof the video sequence. When the quality of successive framesdecreases for a relatively long time, a viewer will notice thechange. This increase in distortion, sometimes referred to as“drift,” may be perceived as an increase in fuzziness and/orblockiness in the scene. This phenomenon can occur dueto artifact propagation, which is very common when motioncompensated prediction is used. This can be more serious witha rate scalable compression technique.

Motion vector fields are generated by matching the currentframe with its reference frame. After the motion vector field

is obtained for the current frame, the predicted frame isgenerated by rearranging the pixels in the reference framerelative to We denote this operation by or

where is the predicted frame and is the referenceframe. The prediction error frame is obtained by taking thedifference between the current frame and the predicted frame

At the decoder, the predicted frame is obtained by using thedecoded motion vector field and the decoded reference frame:

The decoded frame is then obtained by adding the tothe decoded PEF

SHEN AND DELP: WAVELET BASED RATE SCALABLE VIDEO COMPRESSION 113

Fig. 7. Block diagram of the proposed codec for predictively coded frames. Adaptive motion compensation is used.

Usually, the motion field is losslessly encoded by maintainingthe same reference frame at the encoder and the decoder, i.e.,

then

This results in the decoded PEF, being the only sourceof distortion in Thus, one can achieve betterperformance if the encoder and decoder use the same referenceframe. For a fixed rate codec, this is usually achieved by usinga prediction feedback loop in the encoder so that a decodedframe is used as the reference frame (Fig. 6). This procedureis commonly used in MPEG or H.263. However, in ourscalable codec, the decoded frames have different distortionsat different data rates. Hence, it is impossible for the encoderto generate the exact reference frames as in the decoder forall possible data rates. One solution is to have the encoderlocked to a fixed data rate (usually the highest data rate) andlet the decoder run freely, as in Fig. 6. The codec will workexactly as the nonscalable codec, when decoding at the highestdata rate. However, when the decoder is decoding at a lowdata rate, the quality of the decoded reference frames at thedecoder will deviate from that at the encoder. Hence, both themotion prediction and the decoding of the PEF’s contribute tothe increase in distortion of the decoded video sequence. Thisdistortion also propagates from one frame to the next withina group of pictures (GOP). If the size of a GOP is large, theincrease in distortion can be unacceptable.

To maintain video quality, we need to keep the referenceframes the same at both the encoder and the decoder. Thiscan be achieved by adding a feedback loop in the decoder(Fig. 7), such that the decoded reference frames at both theencoder and decoder are locked to the same data rate—thelowest data rate. We denote this scheme asadaptive motioncompensation(AMC) [25], [26]. We assume that the targetdata rate is within the range and the

Fig. 8. Diagram of the parent–descendent relationships inSAMCoWalgo-rithm. This tree is developed on the basis of the tree structure in Shapiro’salgorithm. TheY UV color space is used.

bits required to encode the motion vector fields have datarate where At the encoder, sinceis known, the embedded bit stream can always be decodedat rate which is then added to the predictedframe to generate the reference frame At the decoder, theembedded bit stream is decoded at two data rates, the targeteddata rate and the fixed data rate Theframe decoded at rate is added to the predictedframe to generate the reference frame, which is exactly thesame as the reference frame used in the encoder. Theframe decoded at rate is added to the predicted frameto generate the final decoded frame. This way, the referenceframes at the encoder and the decoder are identical, whichleaves the decoded PEF as the only source of distortion.Hence, error propagation is eliminated.

114 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

B. Embedded Coding of Color Images

Many wavelet based rate scalable algorithms, such as EZW[3] and SPIHT [9], can be used for the encoding offrames and PEF’s. However, these algorithms were developedfor gray-scale images. To code a color image, the colorcomponents are treated as three individual gray-scale images,and the same coding scheme is used for each component.The interdependence between the color components is notexploited. To exploit the interdependence between color com-ponents, the algorithm may also be used on the decorrelatedcolor components generated by a linear transform. In Said andPearlman’s algorithm [9], the Karhunen–Loeve (KL) transformis used [27]. The KL transform is optimal in the sense thatthe transform coefficients are uncorrelated. The KL transform,however, is image dependent, i.e., the transform matrix needsto be obtained for each image and transmitted along with thecoded image.

The red–green–blue ( ) color space is commonly usedbecause it is compatible with the mechanism of color displaydevices. Other color spaces are used, among these are theluminance and chrominance ( ) spaces which are popularin video/television applications. An space, e.g., ,

, or , consists of a luminance component andtwo chrominance (color difference) components. Thespaces are popular because the luminance signal can beused to generate a gray-scale image, which is compatiblewith monochrome systems, and the three color componentshave little correlation, which facilitates the encoding and/ormodulation of the signal [28], [29].

Although the three components in an space are un-correlated, they are not independent. Experimental evidencehas shown that at the spatial locations where chrominancesignals have large transitions, the luminance signal also haslarge transitions [30], [31]. Transitions in an image usually cor-respond to wavelet coefficients with large magnitudes in high-frequency bands. Thus, if a transform coefficient in a high-frequency band of the luminance signal has small magnitude,the transform coefficient of the chrominance components at thecorresponding spatial location and frequency band should alsohave small magnitude [22], [32]. In embedded zerotree coding,if a zerotree occurs in the luminance component, a zerotree atthe same location in the chrominance components is highlylikely to occur. This interdependence of the transform coef-ficients signals between the color components is incorporatedinto SAMCoW.

In our algorithm, the space is used. The algorithm issimilar to Shapiro’s algorithm [3]. The SOT is described asfollows. The original SOT structure in Shapiro’s algorithm isused for the three color components. Each chrominance nodeis also a child node of the luminance node at the same locationin the wavelet pyramid. Thus, each chrominance node has twoparent nodes: one is of the same chrominance component ina lower frequency band, and the other is of the luminancecomponent in the same frequency band. A diagram of theSOT is shown in Fig. 8.

In our algorithm, the coding strategy is similar to Shapiro’salgorithm. The algorithm also consists of dominant passes

Fig. 9. PSNR of each frame within a GOP of thefootball sequence atdifferent data rates. Solid lines: AMC; dashed lines: non-AMC; data ratesin Kbits/s (from top to bottom): 6000, 5000, 3000, 1500, 500.

and subordinate passes. The symbols used in the dominantpass are positive significant, negative significant, isolated zero,and zerotree. In the dominant pass, the luminance componentis first scanned. For each luminance pixel, all descendents,including those of the luminance component and those ofthe chrominance components, are examined, and appropriatesymbols are assigned. The zerotree symbol is assigned ifthe current coefficient and its descendents in the luminanceand chrominance components are all insignificant. The twochrominance components are alternatively scanned after theluminance component is scanned. The coefficients in thechrominance that have already been encoded as part of azerotree while scanning the luminance component are notexamined. The subordinate pass is essentially the same as thatin Shapiro’s algorithm.

IV. I MPLEMENTATION OF SAMCoW

The discrete wavelet transform was implemented usingthe biorthogonal wavelet basis from [33]—the 9-7 tap filterbank. Four–six levels of wavelet decomposition were used,depending on the image size.

The video sequences used in our experiments use thecolor space with color components downsampled to 4 : 1 :1.Motion compensation is implemented using macroblocks, i.e.,16 16 for the component and 8 8 for the and

components, respectively. The search range is15 lumi-nance pixels in both the horizontal and vertical directions.Motion vectors are restricted to integer precision. The spatiallycorresponding blocks in , and components share thesame motion vector. One problem with block-based motioncompensation is that it introduces blockiness in the predictionerror images. The blocky edges cannot be efficiently codedusing the wavelet transform, and may introduce unpleasantringing effects. To reduce the blockiness in the prediction errorimages, overlapped block motion compensation is used forthe component [34], [20], [19]. Let be the th rowand th column macroblock of the luminance image, and let

be its motion vector. The predicted pixel

SHEN AND DELP: WAVELET BASED RATE SCALABLE VIDEO COMPRESSION 115

(a) (b)

(c) (d)

(e) (f)

Fig. 10. Frame 35 (P frame) of the football sequence, decoded at different data rates usingSAMCoW(CIF, 30 frames/s). (a) Original. (b) 6 Mbps.(c) 4 Mbps. (d) 2 Mbps. (e) 1.5 Mbps. (f) 1 Mbps.

values for are the weighted sum where The weighting values for the currentblock are shown in matrix , found at the bottom of thenext page. The weighting values for the top block are shownin matrix , found at the bottom of the next page, and theweighting values for the left block are shown in the matrix

, found at the bottom of the next page.The weighting values for the bottom and right blocks are

and respec-

116 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

tively, where Obviously,which is the necessary condi-

tion for overlapped motion compensation. The motion vectorsare differentially coded. The prediction of the motion vectorfor the current macroblock is obtained by taking the median of

the motion vectors of the left, the top, and the top-right adja-cent macroblocks. The difference between the current motionvector and the predicted motion vector is entropy coded.

In our experiments, the GOP size is 100 or 150 frames,with the first frame of a GOP being intracoded. To maintain

SHEN AND DELP: WAVELET BASED RATE SCALABLE VIDEO COMPRESSION 117

(a)

(b)

Fig. 11. Comparison of the performance ofSAMCoWand Taubman and Za-khor’s algorithm. Dashed lines:SAMCoW; solid lines: Taubman and Zakhor’salgorithm. The sequences are decoded at 6, 4, 2, 1.5, and 1 Mbits/s, which,respectively, correspond to the lines from top to bottom.

the video quality of a GOP, the intracoded frames need to beencoded with relatively more bits. We encode an intracodedframe using six–ten times the number of bits used for each pre-dictively coded frame. In our experiments, no bidirectionallypredictive-coded frames ( frames) are used. However, thenature of our algorithm does not preclude the use offrames.

The embedded bit stream is arranged as follows. Thenecessary header information, such as the resolution of thesequence and the number of levels of the wavelet transform,is embedded at the beginning of the sequence. In each GOP,the frame is coded first using our rate scalable coder. Foreach frame, the motion vectors are differentially coded first.The PEF is then compressed using our rate scalable algorithm.When decoding, after sending the bits of each frame, an end-of-frame (EOF) symbol is transmitted. The decoder can thendecode the sequence without prior knowledge of the data rate.Therefore, the data rate can be changed dynamically in theprocess of decoding.

(a)

(b)

Fig. 12. Comparison of the performance ofSAMCoWand MPEG-1. Dashedlines: SAMCoW; solid lines: MPEG-1. The sequences are decoded at 6, 4,2, 1.5, and 1 Mbits/s, which respectively, correspond to the lines from topto bottom.

TABLE IPSNROF CIF SEQUENCES, AVERAGED OVER A GOP (30 FRAMES/s)

118 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

(a) (b)

(c) (d)

Fig. 13. Frame 35 (P frame) of thefootball sequence (CIF, 30 frames/s). The data rate is 1.5 Mbits/s. (a) Original. (b)SAMCoW. (c) MPEG-1. (d)Taubman and Zakhor.

V. EXPERIMENTAL RESULTS AND DISCUSSION

Throughout this paper, we use the term “visual quality”of a video sequence (or an image), which is the fidelity, orthe closeness of the decoded and the original video sequence(or image) when perceived by a viewer. We believe thatthere does not exist an easily computable metric that willaccurately predict how a human observer will perceive adecompressed video sequence. In this paper, we will use thepeak signal-to-noise ratio (PSNR) based on mean-square erroras our “quality” measure. We feel that this measure, whileunsatisfactory, does track “quality” in some sense. PSNR ofthe color component is obtained by

where is the mean-square error of When neces-sary, the overall or combined PSNR is obtained by

The effectiveness of using AMC is shown in Fig. 9. Fromthe figure, we can see that the non-AMC algorithm works

better at the highest data rate, to which the encoder feed-back loop is locked. However, for any other data rates, thePSNR performance of the non-AMC algorithm declines veryrapidly while the error propagation is eliminated in the AMCalgorithm. Data rate scalability can be achieved and videoquality can be kept relatively constant even at a low datarate with AMC. One should note that the AMC scheme canbe incorporated into any motion compensated rate scalablealgorithm, no matter what kind of transform is used for theencoding of the frames and PEF’s.

In our experiment, two types of video sequences are used.One type is a CIF (352 240) sequence with 30 frames/s. Theother is a QCIF (176 144) sequence with 10 or 15 frames/s.1

The CIF sequences are decompressed usingSAMCoWatdata rates of 1, 1.5, 2, 4, and 6 Mbits/s. A representativeframe decoded at the above rates is shown in Fig. 10. At 6Mbits/s, the distortion is imperceptible. The decoded videohas an acceptable quality when the data rate is 1 Mbits/s.We used Taubman and Zakhor’s algorithm [15] and MPEG-1to encode/decode the same sequences at the above data rates.2

1The original sequences along with the decoded sequences usingSAMCoWare available atftp://skynet.ecn.purdue.edu/pub/dist/delp/samcow.

2Taubman and Zakhor’s software was obtained from the authors.

SHEN AND DELP: WAVELET BASED RATE SCALABLE VIDEO COMPRESSION 119

(a)

(b)

(c)

(d)

(e)

Fig. 14. Frame 78 (P frame) of theAkiyosequence (left image:SAMCoW; right image H.263) and frame 35 (P frame) of theForemansequence (left image:SAMCoW; right image H.263), decoded at different data rates (QCIF, 10 frames/s). (a) 256 Kbps. (b) 128 Kbps. (c) 64 Kbps. (d) 32 Kbps. (e) 20 Kbps.

Since MPEG-1 is not scalable, the sequences were specifically

compressed and decompressed at each of the above data rates.

The overall PSNR’s of each frame in a GOP are shown in

Figs. 11 and 12. The computational rate–distortion in terms

of average PSNR over a GOP is shown in Table I. The data

indicate thatSAMCoWhas very comparable performance to

the other methods tested. Comparison of a decoded image

quality using SAMCoW, Taubman and Zakhor’s algorithm,

and MPEG-1 is shown in Fig. 13. We can see thatSAMCoW

outperforms Taubman and Zakhor’s algorithm, visually and in

120 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

(a)

(b)

Fig. 15. Comparison of the performance ofSAMCoWand H.263 (QCIF at15 frames/s) Dashed lines:SAMCoW; solid lines: H.263. The sequences aredecoded at 256, 128, 64, 32, and 20 Kbits/s, which, respectively, correspondto the lines from top to bottom.

terms of PSNR. Even thoughSAMCoWdoes not perform aswell as MPEG-1 in terms of PSNR, subjective experimentshave shown that our algorithm produces decoded video withvisual quality comparable to MPEG-1 at every tested data rate.

The QCIF sequences are compressed and decompressed us-ing SAMCoWat data rates of 20, 32, 64, 128, and 256 Kbits/s.The same set of sequences is compressed using the H.263algorithm at the above data rates.3 Decoded images usingSAMCoWat different data rates, along with that using H.263,are shown in Fig. 14. The overall PSNR’s of each frame ina GOP are shown in Figs. 15 and 16. The computationalrate–distortion in terms of average PSNR over a GOP isshown in Tables II and III. Our subjective experiments haveshown that at data rates greater than 32 Kbits/s,SAMCoWperforms similar to H.263. Below 32 Kbits/s, when sequenceswith high motion are used, such as theForemansequence,

3The H.263 software was obtained fromftp://bonde.nta/no/pub/tmn/software.

(a)

(b)

Fig. 16. Comparison of the performance ofSAMCoWand H.263 (QCIF at10 frames/s) Dashed lines:SAMCoW; solid lines: H.263. The sequences aredecoded at 256, 128, 64, 32, and 20 Kbits/s, which, respectively, correspondto the lines from top to bottom.

TABLE IIPSNROF QCIF SEQUENCES, AVERAGED OVER A GOP (15 FRAMES/s)

our algorithm is visually inferior to H.263. This is partiallydue to the fact that the algorithm cannot treat “active” and“quiet” regions differently, besides using the zerotree coding.At low data rates, a large proportion of the wavelet coefficientsis quantized to zero and, hence, a large number of the bits

SHEN AND DELP: WAVELET BASED RATE SCALABLE VIDEO COMPRESSION 121

TABLE IIIPSNROF QCIF SEQUENCES, AVERAGED OVER A GOP (10 FRAMES/s)

is used to code zerotree roots, which does not contribute todistortion reduction. On the contrary, H.263, using a blockbased transform, is able to selectively allocate bits to dif-ferent regions with different types of activity. It should beemphasized that the scalable nature ofSAMCoWmakes it veryattractive in many low bit rate applications, e.g., streamingvideo on the Internet. Furthermore, the decoding data rate canbe dynamically changed.

VI. SUMMARY

In this paper, we proposed a hybrid video compressionalgorithm,SAMCoW, that provides continuous rate scalability.The novelty of our algorithm includes the following. First,an adaptive motion compensation scheme is used, whichkeeps the reference frames used in motion prediction at boththe encoder and decoder identical at any data rate. Thus,error propagation can be eliminated, even at a low data rate.Second, we introduced a spatial orientation tree in our modifiedzerotree algorithm that uses not only the frequency bands,but also the color channels to scan the wavelet coefficients.The interdependence between different color components in

spaces is exploited. Our experimental results show thatSAMCoWoutperforms Taubman and Zakhor’s 3-D subbandrate scalable algorithm. In addition, our algorithm has awide range of rate scalability. For medium to high data rateapplications, it has comparable performance to the nonscalableMPEG-1 and MPEG-2 algorithms. Furthermore, it can beused for low bit rate applications with a performance similarto H.263. The nature ofSAMCoWallows the decoding datarate to be dynamically changed. Therefore, the algorithm isappealing for many network-oriented applications because itis able to adapt to the network loading.

REFERENCES

[1] ISO/IEC 13818-2,Generic Coding of Moving Pictures and AssociatedAudio Information, MPEG (Moving Pictures Expert Group), Interna-tional Organization for Standardization, 1994 (MPEG2 Video).

[2] B. G. Haskell, A. Puri, and A. N. Netravali,Digital Video: An Introduc-tion to MPEG-2. New York: International Thomson, 1997.

[3] J. M. Shapiro, “Embedded image coding using zerotrees of waveletcoefficients,” IEEE Trans. Signal Processing, vol. 41, pp. 3445–3462,Dec. 1993.

[4] P. J. Burt and E. H. Adelson, “The Laplacian pyramid as a compactimage code,”IEEE Trans. Commun., vol. COM-31, pp. 532–540, Apr.1983.

[5] H. M. Dreizen, “Content-driven progressive transmission of gray levelimages,”IEEE Trans. Commun., vol. COM-35, pp. 289–296, Mar. 1987.

[6] Y. Huang, H. M. Dreizen, and H. P. Galatsanos, “Prioritized DCTfor compression and progressive transmission of images,”IEEE Trans.Image Processing, vol. 1, pp. 477–487, Oct. 1992.

[7] K. H. Tzou, “Progressive image transmission: A review and compari-son,” Opt. Eng., vol. 26, pp. 581–589, 1987.

[8] K. R. Rao and P. Yip,Discrete Cosine Transform: Algorithms, Advan-tages, and Applications. New York: Academic, 1990.

[9] A. Said and W. A. Pearlman, “A new, fast, and efficient image codecbased on set partitioning in hierarchical trees,”IEEE Trans. CircuitsSyst. Video Technol., vol. 6, pp. 243–250, June 1996.

[10] B. Yazici, M. L. Comer, R. L. Kashyap, and E. J. Delp, “A tree structuredBayesian scalar quantizer for wavelet based image compression,”inProc. 1994 IEEE Int. Conf. Image Processing, vol. III, Austin, TX,Nov. 1994, pp. 339–342.

[11] C. S. Barreto and G. Mendoncca, “Enhanced zerotree wavelet transformimage coding exploiting similarities inside subbands,” inProc. IEEE Int.Conf. Image Processing, vol. II, Lausanne, Switzerland, Sept. 1996, pp.549–552.

[12] D. A. Huffman, “A method for the construction of minimum redundancycodes,”Proc. IRE, vol. 40, pp. 1098–1101, Sept. 1952.

[13] I. Witten, R. Neal, and J. Cleary, “Arithmetic coding for data compres-sion,” Commun. ACM, vol. 30, pp. 520–540, 1987.

[14] M. Nelson and J. Gailly,The Data Compression Book. M&T Books,1996.

[15] D. Taubman and Z. Zakhor, “Multirate 3-D subband coding of video,”IEEE Trans. Image Processing, vol. 3, pp. 572–588, Sept. 1994.

[16] C. I. Podilchuck, N. S. Jayant, and N. Farvardin, “Three dimensionalsubband coding of video,”IEEE Trans. Image Processing, vol. 4, pp.125–138, Feb. 1995.

[17] Y. Chen and W. A. Pearlman, “Three dimensional subband coding ofvideo using the zero-tree method,” inProc. SPIE Conf. Visual Commun.Image Processing, San Jose, CA, Mar. 1996.

[18] ISO/IEC 11172-2,Coding of Moving Pictures and Associated Audiofor Digital Storage Media at Up or About 1.5 Mbit/s, MPEG (MovingPictures Expert Group), International Organization for Standardization,1993 (MPEG1 Video).

[19] ITU-T, ITU-T Recommendation H.263: Video Coding for Low BirateCommunication, International Telecommunication Union, 1996.

[20] M. Ohta and S. Nogaki, “Hybrid picture coding with wavelet trans-form and overlapped motion-compensated interframe prediction cod-ing,” IEEE Trans. Signal Processing, vol. 41, pp. 3416–3424, Dec.1993.

[21] S. A. Martucci, I. Sodagar, T. Chiang, and Y.-Q. Zhang, “A zerotreewavelet video coder,”IEEE Trans. Circuits Syst. Video Technol., vol. 7,pp. 109–118, Feb. 1997.

[22] Q. Wang and M. Ghanbari, “Scalable coding of very high resolutionvideo using the virtual zerotree,”IEEE Trans. Circuits Syst. VideoTechnol., vol. 7, pp. 719–727, Oct. 1997.

[23] B. J. Kim and W. A. Pearlamn, “Low-delay embedded 3-D waveletcolor video coding with SPIHT,” inProc. SPIE Conf. Visual Commun.Image Processing, San Jose, CA, Jan. 1998.

[24] , “An embedded wavelet video coder using three dimensionalset partitioning in hierarchical trees (SPIHT),” inProc. IEEE DataCompression Conf., Snowbird, UT, Mar. 1997.

[25] M. J. Comer, K. Shen, and E. J. Delp, “Rate-scalable video coding usinga zerotree wavelet approach,” inProc. 9th Image and MultidimensionalDigital Signal Processing Workshop, Belize Cit, Belize, Mar. 1996, pp.162–163.

[26] K. Shen and E. J. Delp, “A control scheme for a data rate scalable videocodec,” in Proc. IEEE Int. Conf. Image Processing, vol. II, Lausanne,Switzerland, Sept. 1996, pp. 69–72.

[27] H. Hotelling, “Analysis of a complex of statistical variables into prin-cipal components,”J. Educ. Psychol., vol. 24, pp. 417–441, 498–520,1933.

[28] C. B. Rubinstein and J. O. Limb, “Statistical dependence betweencomponents of a differentially quantized color signal,”IEEE Trans.Commun. Technol., vol. COM-20, pp. 890–899, Oct. 1972.

[29] P. Pirsch and L. Stenger, “Statistical analysis and coding of color videosignals,”Acta Electronica, vol. 19, no. 4, pp. 277–287, 1976.

[30] A. N. Netravali and C. B. Rubinstein, “Luminance adaptive codingof chrominance signals,”IEEE Trans. Commun., vol. COM-27, pp.703–710, Apr. 1979.

122 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999

[31] J. O. Limb and C. B. Rubinstein, “Plateau coding of the chrominancecomponent of color picture signals,”IEEE Trans. Commun., vol. COM-22, pp. 812–820, June 1974.

[32] K. Shen and E. J. Delp, “Color image compression using an embeddedrate scalable approach,”in Proc. IEEE Int. Conf. Image Processing,Santa Barbara, CA, Oct. 1997.

[33] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image codingusing wavelet transform,”IEEE Trans. Image Processing, vol. 1, pp.205–220, Apr. 1992.

[34] H. Watanable and S. Singhal, “Windowed motion compression,” inSPIEConf. Visual Commun. Image Processing, Boston, MA, Nov. 1991, pp.582–589.

Ke Shen(S’93–M’97) was born in Shanghai, China.He received the B.S. degree in electronics engineer-ing from Shanghai Jiao-Tong University, China, in1990, the M.S. degree in biomedical engineeringfrom the University of Texas Southwestern MedicalCenter at Dallas and the University of Texas atArlington (jointly) in 1992, and the Ph.D. degreein electrical engineering from Purdue University in1997.

From 1990 to 1991 he worked as a ResearchAssistant at the UT Southwestern Medical Center at

Dallas, and from 1992 to 1997 he worked as a research assistant at the Videoand Image Processing Laboratory, Purdue University. Since 1997, he has beenwith DiviCom Inc., Milpitas, CA. His research interests include image/videoprocessing and compression, multimedia systems, and parallel processing.

Dr. Shen is a member of Eta Kappa Nu.

Edward J. Delp (S’70–M’79–SM’86–F’97) wasborn in Cincinnati, OH. He received the B.S.E.E.(cum laude) and M.S. degrees from the Universityof Cincinnati and the Ph.D. degree from PurdueUniversity.

From 1980–1984 he was with the Department ofElectrical and Computer Engineering at The Univer-sity of Michigan, Ann Arbor, Michigan. Since Au-gust 1984, he has been with the School of ElectricalEngineering at Purdue University, West Lafayette,Indiana, where he is a Professor of Electrical Engi-

neering. His research interests include image and video compression, medicalimaging, parallel processing, multimedia systems, ill-posed inverse problemsin computational vision, nonlinear filtering using mathematical morphology,communication and information theory.

Dr. Delp is a member of Tau Beta Pi, Eta Kappa Nu, Phi Kappa Phi,Sigma Xi, ACM, and the Pattern Recognition Society. He is a Fellow ofthe SPIE and a Fellow of the IS&T. In 1997 he was elected Chair ofthe Image and Multidimensional Signal Processing Technical Committee (ofthe IEEE Signal Processing Society). From 1994 to 1998 he was Vice-President for Publications of IS&T. He was the General Co-Chair of the1997 Visual Communications and Image Processing Conference (VCIP). Hewas Program Chair of the IEEE Signal Processing Society Ninth IMDSPWorkshop, March 1996. From 1984–1991 he was a member of the editorialboard of theInternational Journal of Cardiac Imaging. From 1991 to 1993,he was an Associate Editor of the IEEE TRANSACTIONS ONPATTERN ANALYSIS

AND MACHINE INTELLIGENCE. Since 1992 has been a member of the editorialboard of thePattern Recognition. In the fall of 1994, Dr. Delp was appointedassociate editor of theJournal of Electronic Imaging.From 1996 to 1998, hewas Associate Editor of the IEEE TRANSACTIONS ON IMAGE PROCESSING. In1990 he received the Honeywell Award and in 1992 the D. D. Ewing Award,both for excellence in teaching.