VLSI Architectures for Video Compression-A Surveyread.pudn.com/downloads77/sourcecode/multimedia/streaming/295… · telecommunications, computer and media industries. The progress

VLSI Architectures for Video Compression-A Survey PETER PIRSCH, SENIOR MEMBER, IEEE, NICOLAS DEMASSIEUX, AND WINFRED GEHRKE

Invited Paper

The paper presents an overview on architectures for V U 1 implementations of video compression schemes as specified by standardization committees of the ITU and ISO. V U 1 implementation strategies are discussed and split into function specific and programmable architectures. As examples for the function oriented approach, alternative architectures for DCT and block matching will be evaluated. Also dedicated decoder chips are included. Programmable video signal processors are classijied and specijied as homogeneous and heterogenous processor architectures. Architectures are presented for reported design examples from the literature. Heterogenous processors outperform homogeneous processors because of adaptation to the requirements of special subtasks by dedicated modules. The majority of heterogenous processors incorporate dedicated modules for high performance subtasks of high regularity as DCT and block matching. By normalization to a fictive 1.0 p m CMOS process typical linear relationships between silicon area and through-put rate have been determined for the different architectural styles. This relationship indicates a figure of merit for silicon eflciency.

I. INTRODUCTION Visual communications is a rapidly evolving field for

telecommunications, computer and media industries. The progress in this field is supported by the availability of digital transmission channels and digital storage media. In addition to the available narrowband ISDN new digital transmission channels such as broadband ISDN, digital satellite channels and digital terrestrial TV broadcasting channels will be introduced shortly. Moreover, personal comput- ers and workstations have become important platforms for multimedia interactive applications. Communication based applications include ISDN videophone, videocon- ference systems, digital broadcast TV/HDTV and remote surveillance. Storage based audiovisual applications include training, education, entertainment, advertising, video mail and document annotation.

Manuscript received June 30, 1994; revised October 4, 1994. P. Pirsch and W. Gehrke are with the University of Hannover, 30167

N. Demassieux is with the Ecole Nationale SupCrieure des

IEEE Log Number 9406979.

Hannover, Germany.

TClCcommunications, 75643 Paris, France.

Essential for the introduction of new communication services is low cost. Recent developments in the technology of digital storage media, image displays, desktop computing and communication networks have made digital video and audio economically viable for the envisaged applications. In order to reduce transmission and storage cost bit rate compression schemes are employed. Bit rate reduction can be achieved by source coding schemes such as predictive coding, transform coding, subband coding and interpolative coding [2], [6], [5]. To achieve the requirements on bit rate reduction and remaining picture quality advanced coding schemes as combinations of the basic schemes are required.

To facilitate world wide interchange of digitally encoded audiovisual data there is a demand for international standards for the coding methods and transmission formats. International standardization committees have been working on the specification of several compression algorithms. The Joint Photographic Experts Group (JPEG) of the In- ternational Standards Organization (ISO) has specified an algorithm for compression of still images [7]. The ITU (formerly CCI'IT) proposed the H.261 standard for video telephony and video conference [l]. The Motion Pictures experts Group (MPEG) of IS0 has completed its first standard MPEG-1, which will be used for interactive video and provides a picture quality comparable to VCR quality [3]. MPEG made substantial progress for the second phase of standards, MPEG-2, which will provide audiovisual quality of both broadcast TV and HDTV. Because of the wide field of applications MPEG-2 is a family of standards with different profiles and levels [4].

The envisaged mass application of the discussed video communication services calls for coding equipment of low manufacturing cost and small size. The source coding schemes recommended by the international standardization groups are very sophisticated in order to achieve high bit rate reduction under the constraint of the highest possible picture quality. These coding schemes will result in hardware systems of high complexity. Thus cost effective implementation of these high complexity systems rely on

220

0018-9219/95$04.00 0 1995 IEEE

PROCEEDINGS OF THE IEEE, VOL. 83, NO. 2, FEBRUARY 1995

1 III I

Table 2 Video Coding Standards

Video Frame Size Format Luminance

Frame Video Frame Rate Source Store

Rate

50Hz j3 1 1920*1250 1 1 1.9Gb/s 19Mb

720*576 25 Hz 166 Mb/s 3.3 Mb

352*288 30Hz 36 Mbls 0.8 Mb

172*144 30 Hz 9 Mb/s 0.2 Mb

VLSI. Recent advances in VLSI technology relieve the hardware problems but the high demands of video coding require special architectural approaches adapted to the video signal processing schemes. This paper provides characteristics of application specific integrated components for video coding and discusses the applied architectural structures.

Before examine VLSI architectures the Section I1 will briefly describe the main schemes for video coding. Then Section 111 will provide hardware related characteristics of these algorithms and general strategies for mapping algorithms onto architectures. Implementation altematives can be split into function specific and programmable. The first offers high silicon efficiency where the second has high flexibility. Section IV will focus on function specific (hardwired) architectures and discuss in particular examples for implementations of DCT and block matching. Based on reported design examples in Section V programmable architectures will be presented.

11. VIDEO COMPRESSION SCHEMES

A. Image Compression Standards Representation of images according to the information

content is performed by source coding schemes. Bitrate reduction is possible because of redundant and irrelevant information in the video signal. Any information which can be extracted using the statistical dependencies between the picture elements (pels) is redundant and can be removed. Any information below a specific picture quality threshold is not relevant to the receiver and needs not to be transmitted. In case of a human being, the visual properties of the human eye determine irrelevant information.

Redundancy reduction is performed by transformation of images to other representations with reduced statistical dependencies. This can be performed by decorrelation with transform methods such as discrete cosine transform (DCT) or predictive coding. In most cases irrelevancy reduction is achieved by quantization adapted to visual properties. In the literature a large variety of image compression schemes are reported. The interested reader is referred to [2], [5], [6].

Specific image coding schemes are dependent on the application. The requirements on picture quality and the characteristics of communication channels and storage media have strong influence on the applied scheme. As an example, TV distribution has preference for high picture quality where video phone has preference for world wide

~~ ~

ljpical Application

Photo-CD Photovideotext

Video phone, Video conference

CD-ROM, CD-I Computer

applications

Video distribution Video contribution

Typical Image Format

Any Size 8 b/pel

QCF, CIF 10 Hz ... 30 Hz

SIF 25 Hz. 30 Hz

CCIR

Coded Bit Rate

0.25 ... 2.25 b/pel

p x 64 kbls 12 p I 30

1.2 MWs

4...9 Mb/s

communications with standardized low bit rate channels. International committees have been working for several years onto world wide image coding standards. According to the requirements of the applications different image formats and coding schemes have been specified. The major characteristics of image formats and coding standards are listed in Table 1 and Table 2. This tables are not complete. They should give just some indication of image formats, source rates, application fields and coded bit rates. A large range for the source rates of digitized video and coded bit rates can be recognized. There are coding schemes for still pictures (JPEG), noninterlaced video sequences (H.261, MPEG-1) and interlaced video sequences (MPEG-2). The coding schemes have a large variety of parameters in order to cover the requirements of a wide range of applications. In particular the MPEG-2 standard is a generic approach with different profiles and levels. In the following the basics of the JPEG and MPEG coding schemes will be explained.

B. Still Picture Coding The IS0 JPEG specified coding schemes for still picture

coding [7]. The basic scheme is adaptive DCT coding which will be further treated. In addition, a hierarchical scheme for progressive transmission and a lossless coding scheme (without quantization effects) has been defined.

In case of adaptive DCT coding, the image is divided into blocks of fixed size (typically 8 x 8 pels). Each block is then transformed into a frequency space. The transform most commonly used for this operation is the

'two-dimensional (2D) DCT [9]. The DCT is a real-valued transform similar to the Discrete Fourier Transform. The DCT transforms a block of image data to the same number of coefficients where to each coefficient is assigned a basis image with different frequency content. Coefficients with high index number are assigned to basis images with high frequencies. The properties of the DCT are:

1) The coefficients are highly decorrelated, and 2) most of the information content is in the low fre-

These properties support both, the reduction of redundant and irrelevant information. Because the sensitivity of the human eye is reduced for high spatial frequencies, transform coefficients with high index number can be

quency coefficients.

PIRSCH et al.: VLSI ARCHITECTURES FOR VDEO COMPRESSION-A SURVEY 22 1

Fig. 1. Baseline JPEG coder-decoder.

quantized more coarsely. If the amplitudes are below the smallest quantizer threshold they will be set to zero. A specified average transmission bitrate can be adjusted by the quantizer characteristic but this will also affect the picture quality. For transmission the 2D array of coefficients is reordered into a 1D stream, by following a zigzag route starting from the DC coefficient. In this manner, the coefficients are roughly arranged in order of ascending frequency. A run length coding (RLC) allows to efficiently represent the long chains of “0’s” that occur in this stream. Finally, a variable length coding (VLC) is used. This method, also called entropy coding, is based on the unequal probability of the different data of the stream and further reduce the transmission rate.

The coded video data are then multiplexed with some service informations and the resulting bit-stream is transmitted. The decoder performs the reverse operations: inverse VLC, inverse quantization and inverse DCT. The block diagram of a JPEG encoder-decoder is depicted in Fig. 1.

C. Coding of Image Sequences The JPEG coding scheme presented above could be

in principal also used for coding of images sequences, sometimes described as motion JPEG. This intraframe coding is not very efficient because the redundancy between successive frames is not exploited. The redundancy between succeeding frames can be reduced by predictive coding. The simplest predictive coding is differential interframe coding where the difference between a current pel of the present frame and the corresponding pel of the previous frame is quantized, coded and transmitted. To perform interframe prediction a frame memory is required. Higher efficiency than the simple differential interframe coding can be achieved by combination of DCT and interframe prediction. Hereby the interframe difference is similar to JPEG, DCT coded and transmitted. This kind of scheme is often described as hybrid coding. In order to have the same prediction at both the receiver and transmitter the decoder must be always incorporated into the coder. This results in a special feedback structure at the transmitter which avoids coder-decoder divergence.

Variable word length coding results in a variable bit rate which depends on image content, sequence change etc. Transmission of the coded information over a constant rate

222

Starch Starch Arca (N+ZW)~ - c

Fig. 2. Motion estimation by block matching.

channel requires a FIFO buffer at the output to smooth the data rate. The average video rate has to be adjusted to the constant channel rate. This is performed by quantizer control according to the buffer content. If the buffer is nearly full, the quantization is made more severe and thus the coded bitrate is reduced. Conversely, if the buffer is nearly empty, the quantization is relaxed.

Prediction from frame-to-frame can be further improved if the motion of objects is taken into account. For simpli- fication of implementation, motion estimation is generally performed by stepwise translation of objects. Spatiotem- poral differential algorithms can be used to determine the displacement of objects on a pel-by-pel basis [38], [28]. For rigid objects the displacement is computable for the entire object with reduced mathematical complexity. However, the object shapes have to be determined. Because of difficulties in segmenting moving objects and the large overhead for specification of boundary lines, most video coding schemes apply a simple block matching scheme. In this case, for rectangular blocks of N x N pels, one displacement vector is determined [36]. In MPEG and H.261 the block size is 16 x 16 pels.

The block matching algorithm identifies for reference blocks in the current frame the block with the best match in the previous frame. The offset between both blocks specifies the displacement vector for motion compensated prediction. In most cases the mean absolute difference between all pels of the reference block and the pels of the block in the previous frame is used as matching criterion. The search area can be limited to a maximum displacement w if the maximum motion of objects is assumed to be limited (Fig. 2). The basic scheme is full search where all (2w + 1)2 candidate blocks in the search area are investigated. Because of the high expense of this exhaustive search, special search strategies with reduced number of candidate blocks have been proposed [2], [32], [40]. Also hierarchical block matching schemes have been proposed for reduction of computational complexity by spatial decimation of pels for the matching criterion [26].

A block diagram of a video encoder and decoder considering motion compensated prediction, DCT coding of the prediction error, variable length coding and quantization control by the buffer content is shown in Fig. 3. This diagram indicated motion estimation based on the original


decoder i I

Fig. 3. MPEG encoder-decoder.

input data. In general, for reduction of hardware complexity the reconstructed image from the frame memory in the feedback loop is used for motion estimation. The block diagram in Fig. 3 specifies the basic structure of the H.261, MPEG- 1, and MPEG-2 encoder-decoder [l], [3], [4]. The H.261 codec has in addition a special loop filter for reduction of quantization effects on prediction.

In digital storage media special modes are required different from those for communication systems. These functions are random access, high speed search, and reverse play back. For picture quality improvements motion compensated frame interpolation instead of frame dropping should be applied. The MPEG coding schemes consider this by a special predictive coding strategy. The coding starts with a frame which is not differentially coded: it is called an Intra frame (I). Then prediction is performed for coding one frame out of every M frames. This allows to compute a series of predicted frames (P), while “skipping” M-1 frames between coded frames. Finally, the “skipped” frames are coded in either a forward prediction mode, backward prediction mode, or bidirectional prediction mode. These frames are called bidirectionally interpolated (B) frames. The most efficient prediction mode, in terms of bitrate, is determined by the encoder and its descriptor is associated to the coded data. Thus the decoder can perform the necessary operations in order to reconstruct the image sequence. Fig. 4 shows the temporal sequence of coding. The main difference between MPEG-1 and MPEG-2 is that MPEG-1 has been optimized for noninterlaced (progressive) format while MPEG-2 is a generic standard for both interlaced and progressive formats. Thus MPEG-2 includes more sophisticated prediction schemes considering field based modes.

To conclude about MPEG encoding, it is important to indicate that the MPEG standards only specify the syntax and the semantics of an MPEG bit-stream [4]. The bit-stream semantics refers to the decoding process, but encoders are not specified per se in the standard. A MPEG

F o w d prediction for P frames

w w v w Backward Forward B W o n a l

prsdiction prediction prediction

Fig. 4. MPEG temporal predictive coding.

compliant encoder (resp. decoder) is an encoder (resp. decoder) that can produce (resp. read) a MPEG bitstream. This allows for enhancements in the performances of the encoders and decoders, by improving the algorithms, by de- veloping suitable architectures and by exploiting advanced technologies.

111. FROM ALGORITHMS TO VLSI ARCHITECTURES The envisaged mass application of the discussed video

communication services calls for coding equipment of low manufacturing cost and small size. Manufacturing cost are dominated by the number of integrated circuits (chips), chip packaging and silicon area per chip. The number of chips can be kept small by large area silicon on the other hand. Because of the defect density, production of very large area chips is not economic. Thus, high complexity systems require in general several chips for implementation. The cost for chip packaging is related to the number of pads. For this reason the system partitioning into chips has to consider the interconnects. By application of advanced semiconductor technologies high complexity systems can be implemented with a moderate number of chips. It is the goal to achieve a VLSI implementation with the smallest silicon area for a specified source rate. As listed in Table 1 the source rate of digital video signals has a large range according to the kind of application. It is obvious that the hardware structures for low source rates such as the QCIF format with 9 Mb/s are different from those for very high source rates such as HDTV with 1.9 Gb/s.

The required silicon area for VLSI implementation of algorithms is related to the required resources such as logic gates, memory, and the interconnect between the modules. The amount of logic depends on the concurrency of operations. A figure of merit for the required concurrency can be derived by

Ncon,op = Rs . Nop,pel . Top (1)

with R, as source rate in pels per time unit, Nop,pel number of operations per pel and Top as average time for performing an operation. The number of operations per pel is an average value derived by counting all operations required for performing the coding scheme. The video coding schemes are periodically defined over a basic interval. In case of hybrid coding schemes as applied in MPEG this interval is a macro block. Almost all tasks of the coding

PIRSCH et al.: VLSI ARCHITECTURES FOR VIDEO COMPRESSION-A SURVEY 223

scheme are defined on a macro block of 16 x 16 luminance pels and the associated chrominance pels. For this reason counting have to be performed over one macro block. The number of operations depends on the specific algorithms. For example block matching based on exhaustive search requires much more operations than search strategies. The number of operations of a complete coding scheme (encoder and decoder) is in the order of 200 when applying block matching with 2D log search and a fast DCT algorithm. With at present available technologies, Top is (possible) in the order of 20 ns. From this follows that the number of concurrent operations ranges from 5 to 1000, depending on the video format.

The required interconnects between the operative part and the memory depends highly on the access rate. Considering one large external memory for storing video data and intermediate results the number of parallel bus lines for connecting the operative part and the memory becomes approximately

Nbus = R.9 . Nacc,pel ’ Tacc

where Nacclpel specifies the mean number of accesses per pel and Tacc the memory access time. Because of 1- and 2-operand operations Nacclpel is larger than Noplpel. For simplicity let us assume that Nacc,pel . Tacc is in the same order than Nop,pel Top. From this follows as a figure of merit the determined concurrency. Thus the number of bus lines becomes very large. Taking into consideration that the access rate is mainly influenced by multiple accesses of image source data and intermediate results, the access rate to an external memory can be essentially reduced by assigning a local memory to the operative part. The size of the local memory depends on the specific access structure of the coding algorithm. Because of the periodicity over the macro block the local memory is in the order of a macro block size.

Concurrency of operations can be achieved by architectures with parallel processing or pipelining. Since algorithms can be specified in a hierarchical manner also parallel processing and pipelining can be defined hierar- chically. Algorithms can be specified by tasks as given in Fig. 3. Accordingly, required tasks are block matching, prediction error determination, DCT, quantization, variable length coding and the associated inverse operations. As indicated in Fig. 3 the tasks have to be performed in a specific sequence. Assigning each task directly to a special processor and considering the specific processing sequence, a cascade of processors will be derived which can easily operate in pipeline by placing between the processors a memory of appropriate size. This kind of mapping the tasks onto the processors will be referred to as task distribution. By matching the architecture of the processor individually to a specific task, function specific architectures are derived. In this case the processor hardware can be utilized very efficiently by adapting the circuit architecture to the specific algorithm of the task. In particular, tasks like DCT, inverse DCT, and block matching have high computational requirements. These

224

computational requirements could be determined in terms of concurrency in operations by (1). These tasks are very regular with a predefined sequence of operations and data access. By refinement of the algorithm down to basic operations, techniques known from the literature can be applied for mapping regular algorithms onto architectures with extensive use of parallel processing and pipelining. Several alternatives for a function specific implementation of these tasks will be discussed in a following section.

By projection of the task in the direction of the processing sequence (see Fig. 5) each processor has to carry out all defined tasks in the specified sequence. According to this sequence the processor has to perform different operations. Because of the large variety in operations and data access a programmable processor would be appropriate. Since the results of one task are required for the proceeding task a local memory is required for storage of intermediate results. The required concurrency can be achieved by parallel operation of processors whereby a subsection of the image is assigned to each processor. Hereby the fact is exploited that image segments can be processed almost independently. The smallest segment for independent processing is in case of the MPEG coding a macro block. This kind of mapping tasks onto processors will be described as data distribution and is the basis of homogeneous programmable multiprocessors. In order to increase the silicon efficiency of programmable processors the operational part has to be matched to the tasks. This can be achieved by special operational parts or by additional coprocessors for specific functions. This will result in heterogeneous programmable multiprocessors. More details on programmable multiprocessors will be presented in a following section.

A. Assessment of Architecture Altematives Applying the discussed mapping strategies leads to a

wide variety of architectural solutions for the implementation of video compression schemes. For a comparison of these architecture alternatives an assessment measure considering architectural efficiency has to be introduced. In general, architectural efficiency can be defined as the ratio of performance and cost. qpically, the performance of an architecture can be equated with the reciprocal of the achieved effective processing time for one sample (l/Tp). The determination of cost for a specific architecture is more problematic, since the implementation cost are influenced by a wide range of interacting parameters, like silicon area, design style, architectural complexity, semiconductor process, pin count, etc. For a simplified first approach, cost can be expressed by the required silicon area for the implementation of a specific architecture. This leads to the well known AT-product for architecture assessment. In this case efficiency is defined by 1941, [951:

(3)

The through-put rate RT of an architecture is proportional to the inverse of the effective processing time for one

1 E = - . Asi * TP

PROCEEDINGS OF THE IEEE, VOL. 83, NO. 2. FEBRUARY 1995

pa6DICITON ERROR t--l ~ . Q I J " i l l Z A l l O N

VL ENCODING

Fig. 5. Functional space of the hybrid coder and mapping to multiprocessor systems exploiting data distribution and task distribution.

sample Tp 1 RT - -

TP' (4)

For an architecture with a specified fixed efficiency it follows from (3) that the silicon area is proportional to the through-put rate.

Asi = a~ * RT. (5 )

The relation above is obvious for operational parts where the increase of through-put requires an increase of concurrency by parallel implementation of basic processing units.

Since the required silicon area and the processing time for an implementation of a specific application depend on the applied semiconductor technology, a realistic architectural assessment has to consider the gains provided by the progress in semiconductor technology. A sensible way to achieve a realistic assessment, is the normalization of the architecture parameters according to a reference technology. In the following sections we assume a reference process with a grid length A0 = 1.0 pm. Assuming the "constant field model, the processing time of a given process with grid length A scales as A/& [99]. This model supposes a linear scaling of voltages and doping concentration. In the "constant voltage" model of scaling of MOS transistors the gate delay scales as (A/Ao)2. These two models do not take into account the long distance

interconnection delay (which basically does not scale [93]), the short channel effects and the limitations due to the power dissipation issue. Reality is not well described by the traditional models of scaling (the voltage, at least for function specific architectures reported up to now, has not been reduced linearly with A, the chip speed has somehow been limited by power issues.. .). Thus it is reasonable to use an empiric measure of processor speed versus technology. [93] produces a comprehensive list of microprocessors, with their clock frequency and their gate length A. From these data, it can be estimated that the cycle time of these processors, which results from a combination of all the factors mentioned above, scales approximately as (A/Ao)' .~. Thus for comparison of architecture alternatives we use in this paper a normalization by ( A / X O ) ~ for the silicon area and ( A o / A ) ~ . ~ for through-put.

, , \ 2

where the index 0 is used for the system with reference length XO. Combining (6) with (5) the silicon area becomes

(7)

Many processing units are composed of an operational part and a memory part which is not effected by the through-put rate. In case of the DCT this is the transposition memory. For block matching this is the memory for the reference block and the search area. The silicon area of a memory part is proportional to the memory capacity CM. By consideration of a memory part (7) has to be modified to

As; = ~ T , O . RT . (i) 3'6.

As a result it follows that for a given technology there is a linear relation between silicon area and through-put rate. Down scaling effects the silicon area of the operational part by a power of 3.6 but the memory part by a power of 2. Because of the dependency between through-put rate RT and computational rate Rc a similar relationship of (8) can be specified for the silicon area as a function of computational rate.

It should be noted that the criterion (3) neglects several essential properties of a specific implementation, e.g., design style (full-custom, semi-custom), power dissipation, semiconductor yield and architecture flexibility. For this reason, several extensions to the AT-product have been proposed [96]-[98].

IV. FUNCTION SPECIFIC ARCHITECTURES Considering available technologies and the required com-

putational requirements for video encoders and decoders, the use of dedicated - or function specific - implementations, is often mandatory. For high volume consumer

PIRSCH ef al.: VLSI ARCHITECTURES FOR VIDEO COMPRESSION-A SURVEY 225

In

VMA Vector Merging Adder TM: Transposition Memory

Fig. 6. Separated DCT implementation according to [23].

products, the silicon area optimization brought by dedicated architectures, compared to programmable architectures, leads to lower production cost.

In this section, we will develop two examples of function specific architectures: the DCT computation and the motion estimation. In each case, we will describe the algorithms, re- view different architectures and compare their performances in terms of speed and silicon area. Finally, a panorama of previously reported function specific chip-sets will be established.

A. DCT The discrete cosine transform (DCT) is a real-valued fre-

quency transform similar to the discrete Fourier transform (DFT). The DCT has first been proposed by Ahmed et al. [9]. A recent book [85] provides an extensive introduction and an in-depth analysis of the properties, the various algorithms and the applications of the DCT. Our goal in this chapter is to focus on the implementation issues of the DCT. When applied to an image block of size L x L , the two dimensional DCT (2D-DCT) can be expressed as follows:

(2i + 1)kT (2 j + 1)lT L-1’5-1

2L cos

2L y k , l = x i , j . c k , l ’ cos i = O j =o

k = O , l , ..., L - 1 O , l , ..., L - 1 for .

1 =

where c k , l =

with

1; otherwise

( i , j ) ( I C , I)

xi,j YkJ

coordinates of the pixels in the initial block; coordinates of the coefficients in the transformed block; value of the pixel in the initial block; value of the coefficient in the transformed block.

Computing a 2D DCT of length L directly from (9) requires L4 multiplications and additions. The implementation [21] is an example of a 64-tap filter that has roughly the complexity of a 2D DCT in the direct form.

An important property for the VLSI implementation of the 2D-DCT is its separability. It is possible to calculate it

226

Fig. 7. B.G. Lee fast flowgraph for the 8 points DCT [14].

by performing successively L one-dimensional DCT’s (1D- DCT’s) on the rows and then L 1D-DCT’s on the resulting columns. An L-point 1D-DCT can be expressed as follows:

i = O

for k = O , l , ..., L - 1

where c k = . (10) {: “,,:Iise

Computing separate horizontal and vertical ID-DCT’s of length L directly from (10) requires 2L3 multiplications and additions. Most people use this property for reducing the computational cost, compared to a direct 2D implementation. Several realizations applying a direct implementation of the separable DCT have been reported [23], [13], [181.

As an example, the DCT implementation according to [23] is depicted in Fig. 6. This architecture is based on two one-dimensional processing arrays. Since this architecture is based on a pipelined multiplier/accumulator implementation in carry-save technique, vector merging adders are located at the output of each array. The results of the 1D- DCT have to be reordered for the second 1D-DCT stage. For this purpose a transposition memory is used. Since both 1D processor arrays require identical DCT coefficients, these coefficients are stored in a common ROM.

Moving from a mathematical definition to an algorithm which can minimize the number of calculations required is a problem of particular interest in the case of transforms like the DCT. The 1D-DCT can also be expressed by the matrix-vector product:

where [C] is an L x L matrix and [XI and [Y] 8-point input and output vectors. As an example, with 0 = $16,


Fig. 8. Architecture of a A4 input inner product using distributed arithmetic.

the 8-points DCT matrix can be computed as denoted in (12).

As a first step, it is well known that, for even L, an L-point DCT can be decomposed into two (L/2) x (L/2) matrix- vector products. Thus the computation cost is reduced. As an example, an 8-point DCT can be computed in two separate parts:

More generally, the matrices in (13) and (14) can be decomposed in a number of simpler matrices, the compo- sition of which can be expressed as a flowgraph. Many fast algorithms have been proposed. Fig. 7 illustrates the flowgraph of the B. G. Lee’s algorithm, which is commonly used [ 141. Several implementation using fast flowgraphs have been reported [lo], [ 121.

Another approach which has been extensively used is based on the technique of distributed arithmetic. Distributed arithmetic is an efficient way to compute the DCT totally or partially as scalar products. To illustrate the approach, let us compute an scalar product between two length-M

vectors C and X: M-1 B-1

Y = ci . xi with xi = -xi,o + xi , j . 2-i (15) i = O j = O

where {q} are N-b constants and {x i } are coded in B bits in 2’s complement. Then (15) can be rewritten as

B-1

Y = cj * 2-j j = O

M-1

M-1

and i = O

The change of summing order in i and in j char- acterizes the distributed arithmetic scheme in which the initial multiplications are distributed to another computation pattem. Since the term C, has only 2M possible values (which depend on the x,,, values), it is possible to store these 2M possible values in a ROM. An input set of M bits { X O , ~ , XI,^, X Z , ~ , . . . , X M - ~ , , } is used as an address, allowing to retrieve the C, values. These intermediate results are accumulated in B clock cycles, for producing one Y value. Fig. 8 shows a typical architecture for the computation of a M input inner product. The inverter and the MUX are used for inverting the final output of the ROM in order to compute CO.

Fig. 9 illustrates two typical uses of distributed arithmetic for computing a DCT. Fig. 9(a) implements the scalar products described by the matrix of (12). Fig. 9(b) takes advantage of a first stage of additions and subtractions and the scalar products described by the matrices of (13) and (14).

The implementation of a 2D DCT requires 2L 1D- DCT of length L. The distributed arithmetic requires only additions and ROM storage. Several implementations using distributed arithmetic have been reported, either in the “pure” form [15], [19] or in “mixed” D.A./flowgraph form U11, [201, W l .

Many other algorithms have been proposed for computing the DCT: vector-radix algorithms, Winograd-DCT, rotations, cordic, prime factor algorithms, convolutional structures, polynomial transforms. A classification of these approaches can be found in [15]. Since, to our knowledge, none of them has yet been implemented, these algorithms are outside the scope of this paper.

1) Architecture Comparison: The algorithms presented above differ on several issues: the computation requirements (number of multiplications and additions) and the storage requirements (size of the ROM for the distributed arithmetic, size of the RAM for the transposition memory). Table 3 details these requirements.

Table 3 provides a first idea of the relative complexity required for implementing the different proposed algorithms. Unfortunately, several factors are blurring the panorama: ‘


Table 3 Computation and Storage Requirements for Different DCT Algorithms ( L f L DCT)

Number of Multiplication

Multiplications, additions, and ROM

space are expressed in 16 b words

Number of Additons

I I I I

Number of ROM words Number for treansposition Of RAM

- L4 L4

- 2 ~ 3 2 ~ 3

=L21og*L =L2log2L -

- 32L 2*L*2L

- 34L L*2m

Per L x L block

-

L2

L2

L2

L2

Direct computation

With separability

Fast algorithm

Pure D.A.

Mixed D.A.F.G.

Per Pixel

- Direct computation L2 L2

With separability 2L 2L

Fast algorithm log2L 210g2L -

Pure D.A. - 3 2 L 2*L*2L

Mixed D.A.F.G. - 34lL L*2M

-

-

L2

L2

L2

L2

the architectural effort: for implementing each of these algorithms, the designer can use a number of different architectural approaches (pipelining, bit-serial computation, hardware multiplexing, bus sharing. . .). the through-put constraint: depending on the target through-put, some architectures can be more interest- ing than others. As an example, it is more difficult to accelerate ROM-based designs: the through-put of the distributed arithmetic is limited by the access time of a ROM. the regularity of the algorithm: irregular architectures tend to generate a wiring overhead. In this respect, the distributed arithmetic and the direct implementations of scalar product are generally more regular than the flow-graph based architecture. the technology: depending on the relative cost of the storage (ROM) and the arithmetic operations (full adders), the choice between flowgraph or scalar product based architectures, and distributed arithmetic- based architectures can be modified.

Due to the complexity of the design space, no general assessment can be made. Nevertheless, we have established a comparison of several previously reported designs, that are presented in Table 4. The resulting performances for several dedicated DCT implementations are plotted on Fig. 10. The gray line in Fig. 10 indicates an empirical measure of the Asi = ~ ( R T ) function for a function specific implementation of the DCT. FOF a 1.0 pm CMOS process, we can derive from this figure that:

228

T l f l

The through-put rate RT in MpeVs can be transferred to the computational rate Rc with Nop,pel. For an 8 x 8 separable 2D DCT based on matrix vector multiplication, this results in

C Y C , ~ x 15 mm2/ GOPS. (18)

It should be noted that in the figure of merit above multiplications and additions are counted equally. When considering the real expense of multiplications as a multiple of additions the figure of merit can be modified to

(19) QC,O M 2 m2/ GADDS

with GADDS as giga additions per second. The specific implementations differ in architectural effi-

ciency. Architectures, located below the gray line achieve a higher efficiency. The characteristic linear relation in Fig. 10 describes an average efficiency for an implementation which is derived by curve fitting of the results of different architectures and different designs. For this reason the expected offset according to (8) for the required transposition memory did not occur.

B. Block Matching In Section II motion estimation has been presented as a

technique for the improvement of coding efficiency. Several techniques for motion estimation have been proposed in the past, e.g., [25], [271, 1281, [371, [381. Today, the most important technique for motion estimation is block matching, introduced by (321. Block matching is based on the matching of blocks between the current and a reference frame. This can be done by a full (exhaustive) search within


1

a search window, but several other approaches have been reported in order to reduce the computation requirements by using an “intelligent” or “directed” search [26], [291, [301, [341, [361, [391, Vol.

I ) Exhaustive Search Block Matching: Applying an exhaustive search block matching algorithm, a block of size N x N of the current image (reference block, denoted X) is matched with all the blocks located within a search window (candidate blocks, denoted Y). The maximum displacement will be denoted by w. The matching criterium generally consists in computing the mean absolute difference (MAD) between the blocks. Let z(i , j ) be the pixels of the reference block and y( i , j ) the pixels of the candidate block. The matching distance (or distortion) D is computed according to (20). The indexes m and n indicate the position of the

candidate block within the search window. The distortion D is computed for all the (2w + 1)’ possible positions of the candidate block within the search window in (21) and the block corresponding to the minimum distortion is used for prediction. The position of this block within the search window is represented by the motion vector v, see (21).

N-1 N-I

m v = [n]lDmi”.

The algorithm for computing a motion vector can be sequentially expressed as follows (see bottom of following page).


1 III 7-

The operations involved for computing D(m,n) and Dmin are associative. Thus the order for exploring the index spaces (i, j ) and (m, n) are arbitrary. Thus the block matching algorithm can be described by several different dependence graphs. As an example, Figs. 11 and 12 show possible dependence graphs (DG) for w= 1 and N = 4. In these figures, AD denotes an absolute difference and an addition, M denotes a minimum value computation. Since the index space encompasses four dimensions, the dependence graphs can be presented as a hierarchy of 2 DG’s, each one of dimension 2: all the nodes M of the left graphs embed the computation of D(m, n) (right graph).

Transforming these DG’s into practical architectures is performed with the usual operations: mapping of DG’s onto lower dimension systolic arrays, index projection, time scheduling, graph folding. [43] and [48] propose an extensive analysis of possible systolic and array architectures for full search block matching.

A direct mapping of the dependence graphs of Fig. 11 is possible. The dependence graph for computing D(m, n) is directly mapped into a 2D array of processing elements (PE), while the dependence graph for computing v ( X , Y ) is mapped into time (Fig. 13). In other words, block matching is performed by a sequential exploration of the search area, while the computation of each distortion is performed in parallel. Each of the AD nodes of the DG is implemented by an AD processing element (AD- PE). The AD-PE stores the value of x ( i , j ) and receives the value of y(m + i, n + j ) corresponding to the current position of the reference block in the search window. It performs the subtraction and the absolute value computation, and adds the result to the partial result coming from the upper PE (Fig. 14). The partial results are added on columns and a linear array of adders performs the horizontal summation of the row sums, and computes D(m, n). For each position (n, m) of the reference block, the M-PE checks if the distortion D(m,n) is smaller than the previous smaller distortion value, and, in this case, updates the register which keeps the previous smaller distortion value.

To transform this naive architecture into a realistic implementation, two problems must be solved: 1) a reduction of the cycle time, and 2) the U0 management.

1) The architecture of Fig. 13 implicitly suppose that the computation of D(m, n) can be done combinatorially in one cycle time. While this is theoretically possible, the resulting cycle time would be very large and would increase as 2N. Thus a pipeline scheme is generally added.

2) This architecture also suppose that each of the AD- PE receives a new value of y(m + i, n + j ) at each clock cycle.

Since transmitting the N2 values from an external memory is clearly impossible, advantage must be taken from the fact that these values belong to the search window. A portion of the search window of size N - (2.w + N) is stored in the circuit, in a 2D bank of shift registers, able to shift in up, down and right direction. Each of the AD-PES has one of these registers and can, at each cycle, obtain the value of y(m + i, n + j ) that it needs. To update this register bank, a new column of 2w + N pixels of the search area is serially entered in the circuit and is inserted in the bank of registers. Additionally, a mechanism must be provided for loading a new reference with a low U0 overhead: a double buffering of x(z,j) is required, with the pixels x’(i,j) of a new reference block serially loaded during the computation of the current reference block (Fig. 15).

This architecture corresponds to the architecture type-2 of [43], to architecture AB2 of [48] and also to the design proposed by [47]. Reference [45] uses the same architecture, with a bit-serial implementation of the PE’s. The design of [50] is also based on this architecture, combined with a time multiplex of the PE’s, which allows to divide by two the number of physical processing elements.

The dependence graph for computing D(m,n) can be mapped into a 1D array of Processing Elements, which is able to compute in parallel the partial distortion along one row. This 1D array can sequentially compute D(m,n) in N cycles. The dependence graph for computing V ( X , Y) is

230

DMIN=MAXVALUE WIN= (0 ,O) for m=-w to +w

for n=-w to +w D(m,n)=O for i=l to N

for j=1 to N

endf or endf or if D<DMIN then

DMIN=D (m, n) WIN= (m, n)

D (m, n) =D (m, n) + I x ( i , j ) -y ( i+m, j +n) I

endi f endf or

endf or


ROM

ROM

ROM

ROM

ROM

- U xi. LSB first

Addresses

xi. LSB fist

(b)

Fig. 9. Architecture of a 8 point 1-D DCT using distributed arithmetic (a) “Pure D.A.”: 8 scalar (b) “Mixed D.A.”: first stage of flowgraph decomposition products of 8 points followed by 2 times 4 scalar products of 4 points.

mapped into time (Fig. 16). This architecture corresponds to architecture AB1 of [48].

Another mapping of the dependence graphs of Fig. 11 is possible. The dependence graph for computing v(X, Y) is

0 2 0 4 0 6 0 8 0 1 0 0

Normalized throughput (MPelds)

Fig. 10. DCT circuits.

Normalized silicon area and through-put for dedicated

I

I I I I I I I I

X I

I I /$ y I

Fig. 11. Dependence graphs of the block matching algorithm. The computation of v(X, Y) and D ( m , n) are performed by 2D linear DG’s.

----- I

I----------- I

I I I I I I I I I I I I

Fig. 12. Other possible DG‘s for the computation of D(m, n). Due to the associativity of the operations, similar DG‘s could be used for the computation of D(m, n).

mapped into a 2D array of ( 2 ~ + 1 ) ~ Processing Elements, while the dependence graph for computing D(m,n) is mapped into time (Fig. 17).

Each processing element working in parallel keeps track of a particular distortion computation and sequentially

PIRSCH et al.: VLSI ARCHITECTURES FOR VIDEO COMPRESSION-A SURVEY 23 1

- -1 III 1

f- -1

0

+I

-1 0 +1

t = l I t = f i l t = 7 1 I I I

Refma Block i

b

N

T i allocorion

Fig. 13. Principle of the 2D block based architecture.

r - I I I I

I I-

___- - - - Y t I

Fig. 14. Internal architecture of AD-PE and M-PE.

explore the reference block. At each cycle, one PE receives a different value of y(m + i, n + j ) and all the PE receive the value of one pixel of the reference block which is broadcasted to the array. After N2 cycles, each of the ( 2 ~ + 1 ) ~ Processing Elements holds one value of D(m, n) corresponding to a particular displacement (m, n). These values can be stored in a slave register, which allows to start a new computation, while computing the minimum distortion value. One way to do this is to find the minimum value along the column by downshifting the D(m,n) in M-PE’s and find the final minimum value by left-shifting the resulting D(m,n) in the M-PE’s. This process takes ( 2 ~ + 1 ) ~ cycles, which, for commonly used values of w and N, is smaller than the N2 cycles available to perform

232

Y X

zw

N

Fig. 15. Practical implementation of the 2D block based architecture.

I I‘Search *z I t=s lt=8 I I I WIndOW

0

+1

7inu allocation Alchitecnur

Fig. 16. 1D block based architecture.

this computation. This architecture must also be extended, in a manner similar to the extension described in Fig. 1 5 , for storing a part of the search area of size w . (2w + N), in order to reduce the I/O’s.

This architecture corresponds to the architecture type- 1 of [43]. [42] uses the same architecture, combined with a


all the PES

\ Refe.ml.x block

t V(x.Y)

Fig. 17. 2D search area based architecture.

I= unit

Fig. 18. 2D architecture of [42].

time multiplex of the PE's, which allows to divide by two the number of physical processing elements.

A 1D search-area based architecture can be designed, where an array of (2w + 1) processing elements executes in N2 cycles the computation of the distortions D(m,n) corresponding to one line (resp. column) of possible motion vectors. This process is repeated sequentially 2w + 1 times for computing all the distortions. [51], [18], [86] have implemented such 1D arrays of 16 processing elements (Fig. 19).

Fig. 19. 2D architecture of [51].

2 ) Block Matching Algorithms with Reduced Computational Complexity: Exhaustive search block matching requires high computational power. For example for CCIR signals and w = 15 a processing power of approximately 30 giga operations per second (GOPS) is required.

To reduce the computational complexity required for block matching two strategies can be applied: 1) Decrease of the number of candidate blocks. 2) Decrease of the pels per block by subsampling of the

Typically, 1) is implemented by search strategies in successive steps. As an example, a modified scheme according to the original proposal of [36] will be discussed. In this scheme the best match v,-1 in the previous step s - 1 is improved in the present step s by comparison with displacements &A,. The displacement vector v, for each step s is calculated according to

image data.

Ds(ms, ns) N-1 N-I

i=o j=o with q E { - l , O , l } - -

A, depends on the maximum displacement w and the number of search steps N,. Typically, when w = 2k - 1, N, is set to k = log,(w + 1) and D, = 2k - s + 1. For example, for w = 15, four steps with D, = 8, 4, 2, 1 are performed. This strategy reduces the number of candidate blocks from (2w + 1)2 in case of exhaustive search to 1+8.l0g2(w+1), e.g., for w = 15 the number of candidate blocks is reduced from 961 to 33 which leads to a reduction of processing power by a factor of 29. For large block sizes N, the number of operations for the match can be further


Table 5 Comparison of Reported Motion Estimation Circuits

Architecture

BM

BM

BM

BM

BM

BM

Hie

Process (clm)

# Trans

52 151

27 OOO

540 OOO

405 OOO

140 OOO

Throughput (MMAD/s)

183

400

2300

960

5300

648

5200

Chip Size (mm2)

70

64

86

98

125

187

66

Core Size (nun2)

54

51

68

79

104

161

51

Comments

Bit serial

Half Pel

Configurable DCTRMVT

Hierarchical Block Matching

reduced by combining the search strategy with subsampling in the first steps. Architectures for block matching based on hierarchical search strategies are presented in [48], [44], 1351, 1331, [241, [311.

3) Architecture Comparison: Table 5 is a summary of previously reported implementations.

Fig. 20 shows normalized computational rate versus normalized chip area for block matching circuits.

The gray line in Fig. 20 indicates an empirical measure of the A = f(&) function for a block matching based motion estimation. Since one MAD operation consists of three basic ALU operations (SUB, ABS, ADD), for a 1.0 pm CMOS process, we can derive from this figure that:

a c , ~ x 1.9 mm2/GOPS (23) (YM,O . C, x 30 mm'. (24)

50 h

b v

2340 'a

0 1000 2000 3000 4OOo

Normalized Throughput (MMAD/s)

Fig. 20. Normalized silicon area and computational rate for dedicated motion estimation architectures.

The first term of this expression indicates that the block matching algorithm require a large storage area (storage of parts of the actual and previous frame), which cannot be reduced even when the through-put is reduced. The second term corresponds to the traditional constant AT- product, which represents the dependency on computation through-put. The second term has the same amount than determined for the DCT for GADDS because the three types of operations for the matching require approximately the expense of additions.

C. Dedicated Implementations of the Complete Hybrid Coding Scheme

The previous sections have shown different architectures and implementations of basic functions for video encoding and decoding schemes (DCT, motion estimation). Imple- mentations of other basic functions such as variable length coding/decoding have been proposed [52], [53], [541. In the first generation of circuits, each of these basic functions were implemented in one chip and a chipset was necessary for creating a system for MPEG video encoding or decoding [62], [731, [761, [84], [86], [87], [92]. In the following three examples of dedicated chips for the implementation of a complete hybrid coding scheme are shortly discussed.

234

[86] presents a chipset consisting of seven dedicated chips, which have been designed in a 1.5 pm CMOS technology and 1.0 pm CMOS technology, respectively. The motion estimation circuit performs an exhaustive search block matching with a maximum displacement of f N/2 and a reference block size of size N x N, where N is either 8 or 16. A DCT circuit performs forward and inverse discrete cosine transforms. A quantization processor has been designed for quantization and inverse quantization. VLC encoding/decoding and BCH encoding/decoding are implemented in two separate chips. An interframe processor performs tasks like loop filter and interhntra decision. For converting the blockwise output of the DCT circuit into a sequential data stream for quantization and for converting the sequential output of the inverse quantization into data blocks for the IDCT, a data reordering chip is required. The implementation of a complete hybrid codec according to the H.261 standard requires 10 chips.

Another example for a dedicated implementation of a complete hybrid coding scheme has been presented in [84]. This architecture aims at a single-chip MPEG-1 video decoder. The chip consists of four main modules: a decoder unit for variable length decoding, an IDCT unit, a motion compensation unit, and an instruction unit which controls


I l l I

Ra Rb Wc

Mrmoy (DRAM)

t=!!, Motion

T-443 Insauctions

Fig. 21. MPEG-Decoder according to [84].

the functional units and the external DRAM frame memory. For interfacing a code bus interface, a video bus interface and a memory bus interface have been integrated on the processor die. A single chip provides sufficient processing power for realtime decoding of SIF signals (352 x 240 pels, 30 Hz frame rate) at a'maximum channel data rate of up to 6 Mb/s. CCIR resolution can be achieved by use of an on-chip interpolation unit.

Reference [62] presents another dedicated single-chip MPEG-1 decoder. Decoding of the MPEG bitstream is performed by a VLD unit, an inverse quantization unit, an IDCT unit and an interpolation unit. Video, channel, and frame memory interfacing is provided by four I/O units.

V. PROGRAMMABLE ARCHITECTURES In contrast to function oriented approaches with lim-

ited functionality, programmable architectures enable the processing of different tasks under software control. The particular advantage of Programmable architectures is the increased flexibility. Changes of architectural requirements, e.g., due to changes of algorithms or an extension of the aimed application field, can be handled by software changes. Thus a generally cost-intensive redesign of the hardware can be avoided. Moreover, since programmable architectures cover a wider range of applications, they can be used for low-volume applications, where the design of function specific VLSI chips is not an economical solution.

On the other hand, programmable architectures require a higher expense for design and manufacturing, since additional hardware for program control is required. Moreover, programmable architectures require software development

32+' t gData + gAddr

Inshuetion

Ra Rb Wc

ALU CDVZ

Transform

Motion

p, Insmctions

Fig. 22. MPEG-Decoder according to [62].

for the envisaged application. Although several vendors provide development tools, including high level language compilers, the expense for software development may not be neglected and has to be considered for deciding which type of architecture, function oriented or programmable, has to be used for a specific application.

Image processing, especially video coding applications, often require a real-time processing of the image data. To achieve this goal, parallelization strategies have to be employed. The two basic alternative parallelization strategies, data distribution and task distribution, have been discussed in Section In.

A variety of programmable and dedicated VLSI architectures for video coding applications have been presented in the past [561-[611, [631-[66], 1681, [691, [711, [721, [74], [75], [77]-[83], [85], [86], [88]-[91]. In the following section examples of programmable architectures, suitable for real-time video compression, are presented. These examples clarify the wide range of architectural alternatives for the VLSI implementation of processors for video coding applications.

A. Strategies to Increase Perfomnce Several programmable architectures for video coding

applications have been proposed during the last years. Some of these architectures are designed especially for video coding applications, whereas others aim at a wider range of applications, e.g., desktop document image processing.

The main problem that has to be solved by an VLSI implementation is to support the high computational power required for video coding applications. For a coarse clas-

PIRSCH et al.: VLSI ARCHlTFCTURES FOR VIDEO COMPRESSION-A SURVEY 235

sification, three ways to cope with this problem can be distinguished:

Increase of clock frequency: Due to the close interaction of clock frequency and computational power, a way to increase processing power is the increase of the processor clock frequency by intensive pipelining. The ideal linear speedup is limited by two effects. First, due to pipeline hazards, for algorithms based on a data dependent control flow a linear speedup can not be achieved. Second, the data access rate to an external data memory is limited. This might be compensated by an increase of on- chip data memory which provides a high data access rate. Nevertheless, this increase of on-chip memory is limited by the semiconductor yield.

This approach exploits data distribution for the increase of computational power. Several parallel data paths are implemented on one processor die, which leads in the ideal case to a linear increase of supported computational power. Generally, this strategy is limited by the maximum degree of data parallelism supported by the envisaged algorithms. Moreover, the number of parallel data paths is limited by the semiconductor process, since an increase of silicon area leads to an decrease of hardware yield.

Coprocessors are known from general processor designs and are often used for specific tasks, e.g., floating point operations. The idea of the adaptation to specific tasks to increase computational power without an increase of the required semiconductor area has been applied by several designs. The main disadvantage of this approach is the decrease of flexibility by an increase of adaptation.

Most architectures exploit more or less all of the named architectural strategies, which leads to a wide range of different architectural approaches. The following examples give an overview of the design space of programmable VLSI components for video coding applications.

Parallel data paths:

Coprocessor concept:

B. Increase of Clock Frequency The through-put rate depends nearly linearly on the clock

frequency of the processor. Thus, one way to achieve a higher computational power is the increase of clock frequency. The maximum clock frequency achieved by a specific processor implementation is limited by the maximum delay between two successive register stages, which is based on the technology dependent gate delay. Since power consumption increase proportionally to clock frequency, an additional limiting factor might be the power consumption of the processor chip. To increase the processor clock frequency, the amount of gate stages between two successive registers has to be decreased by an implementation of additional register stages. This strategy is clarified in Fig. 23. The logic block F with delay TF is subdivided into two logical blocks F1 and F2 with delay T F ~ and

Fig. 23. Increase of through-put rate by pipelining.

1

I FIFO

AGU AmcsS Gauntion Unit PAU Pipelined Ari tbdc Unit

H N Host Interfacc Unit

X U Tuning Control Unit

KU: Pipelined convolver unit

IRAM SCU Squence Control Unit

extern. clock

Fig. 24. VSP3 architecture [71].

T F ~ , respectively. Due to the employment of an additional register stage, the maximum delay between two register stages is reduced to max(TFI,TF2). In the ideal case T F ~ and T F ~ equal T ~ l 2 and the clock frequency can be doubled.

The increase of clock frequency by pipelining increases the latency of the circuit. For algorithms which require a data dependent control flow this fact might limit the performance gain. Additionally, increasing arithmetic processing power leads to an increase of data access rate. Generally, the required data access rate can not be provided by external memories. The gap between provided external and required internal data access rate increases for processor architectures with high clock frequency. To provide the high data access rate, the amount of internal memory which provides a low access time has to be increased for high- performance signal processors. Moreover, it is not feasible

236 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 2, FEBRUARY 1995

to apply pipelining to speed up on-chip memory. Thus the minimum memory access time is another limiting factor for the maximum degree of pipelining. At least, speed optimization is a time consuming task of the design process, which has to be performed for every new technology generation.

Examples for video processors with high clock frequency are the S-VSP [67] and the VSP3 [71]. Due to intensive pipelining an internal clock frequency of up to 300 MHz can be achieved. The VSP3 consists of two parallel data paths, the Pipelined Arithmetic Logic Unit (PAU) and Pipelined Convolution Unit (PCU). The relatively large on- chip data memory of size 114 kb is split into seven blocks, six data memories and one FIFO memory for external data exchange. Each of the six data memories is provided with an address generation unit (AGU), which provides the addressing modes “block,” “DCT,” and “zigzag.” Con- trolling is performed by a Sequence Control Unit (SCU) which involves a 1024 x 32 b instruction memory. A Host Interface Unit (HIU) and a Timing Control Unit (TCU) for the derivation of the internal clock frequency are integrated onto the VSP3 core.

The Pipelined Arithmetic Logic Unit consists of two 16-b shifters, a 16-b ALU, a 20-b accumulator, a 16-b barrel- shifter, a 16-b minimudmaximum value detector, and a PAU-controller. Due to the employed pipeline structure the frequently required lai f bi I operation can be performed within a single clock period. In combination with the ALU the minimudmaximum detector provides a motion vector estimation for one macroblock based on a three-step-search with 8 x 8 reference area and 45 x 45 search area within 9.6 ms.

The Pipelined Convolver Unit consists of two 16-b shifters, a 16-b delay element to delay input data for one clock cycle, a 16-b x 16-b convolver, a 24-b adder, a 3-port 8-word x 24-b x 2-plane register file, a 16- b barrel shifter, a 16-b limiter for clipping input data, and a PCU control unit. To achieve a high through-put rate for the computation intensive tasks DCT and IDCT, respectively, the first butterfly stage of fast DCT algorithms can be exploited. This leads to a 1.88 times faster DCT implementation compared to the conventional one.

The entire VSP3 core consists of 1.27 million transistors, implemented based on an 0.5 pm BiCMOS technology on a 16.5 x 17.0-mm2 die. The VSP3 performs the processing of the CCITT-H.261 tasks (neglecting variable length coding) for one macroblock in 45 ms. Since realtime processing of 30 Hz-CIF signals requires a processing time of less than 85 ms for one macroblock, a H.261 coder can be implemented based on one VSP3.

C. Parallel Data Paths In Section V-B pipelining has been presented as a strat-

egy for processing power enhancement. Applying pipelining leads to a subdivision of a logic operation into sub- operations, which are processed in parallel with increased processing speed. Since the sub-operations are processed on different units, pipelining can be referred to as a specific

implementation of task distribution. As discussed in Section 111, an alternative to task distribution is the distribution of data among functional units. Applying this strategy leads to an implementation of parallel data paths. Typically, each data path is connected to an on-chip memory which provides the access to the distributed image segments.

Generally, two types of controlling strategies for parallel data paths can be distinguished. An MIMD concept provides a private control unit for each data path, whereas SIMD based controlling provides a single common controller for parallel data paths. Compared to SIMD, the advantage of MIMD is greater flexibility and higher performance for complex algorithms with highly data dependent control flow. On the other hand, MIMD requires a significantly increased silicon area. Additionally, the access rate to the program memory is increased, since several controllers have to be provided with program data. Moreover, a software-based synchronization of the data paths is more complex. In case of an SIMD concept, synchronization is performed implicitly by the hardware.

Since hybrid coding schemes require a large amount of processing power for tasks which require a data independent control flow, a single control unit for the parallel data path provides sufficient processor performance. The controlling strategy has to provide the execution of algorithms which require a data dependent control flow, e.g., quantization. A simple concept for the implementation of a data dependent control flow is to disable the execution of instruction in dependence of the local data path status. In this case the data path utilization might be significantly decreased, since several of the parallel data path idle while others perform the processing of image data. An alternative is an hierarchical controlling concept. In this case each data path is provided with a small local control unit with limited functionality and the global controller initiates the execution of control sequences of the local data path controllers. To reduce the required chip area for this controlling concept, the local controller can be reduced to a small instruction memory. Addressing of this memory is performed by the global control unit.

An example of a video processor based on parallel identical data path with a hierarchical controlling concept is the IDSP [91]. The IDSP processor includes four pipelined data processing units (DPU&DPU3), three parallel VO ports (PIOO-P102), one 8 x 16-b register file, five dual- ported memory blocks of size 512 x 16-b each, an address generation unit for the data memories, and a program sequencer with 512 x 32-b instruction memory and 32 x

The data processing units consist of a three-stage pipeline structure based on a 16 b-ALU, 16b x 16-b-multiplier and a 24-b-accumulator. This data path structure is well suited for L1 and L2 norm calculations and convolution-like algorithms. The four parallel data paths support a peak computational power of 300 MOPS at a typical clock frequency of 25 MHz. The data required for parallel processing are supplied by both, four cache memories (CMO-CM3) and a work memory (WM). Address generation for these

32-b boot ROM.

PIRSCH er al.: VLSI ARCHITECTURES FOR VIDEO COMPRESSION-A SURVEY 237

Fig. 25. IDSP architecture [91].

memories is performed by an address generation unit (AU) which supports address sequences like block scan, bit reverse, and butterfly. The three parallel U0 units contain a data U0 port, a 20-b address generation unit and a DMA control processor (DMAC).

A typical VLlW controlling of the four data paths and the other functional units of the IDSP would require an instruction word length of several hundreds of bits. To achieve a reduction of the instruction wordwidth without an significant decrease of flexibility, the IDSP applies a hierarchical instruction decoding strategy: The data flow, i.e., VBUS connections, DPU paths, initial addresses and DMAC parameters, is controlled by so called “SET-UP commands”, which perform the basic configuration of the processor. An “EXEC command” controls the operation of CMOCM3, WM, and PIOO-PIO2, and also points to addresses in the local program memories of each data path. In this way, the 160-b micro instructions can be encoded by 32-b program word length.

The IDSP integrates 910 000 transistors in 15,2 x 15,2 mm using an 0.8 pm BiCMOS technology. For a full-CIF H.261 video codec four IDSP’s are required.

D. Coprocessor Concept Most programmable architectures for video processing

applications achieve an increase of processing power by an adaptation of the architecture to the algorithmic requirements. A feasible approach is the combination of a flexible programmable processor module with one or more adapted modules. This approach leads to an increase of processing

.

238

power for specific algorithms and a significant decrease of required silicon area. The decrease of silicon area is caused by two effects. At first, the implementation of the required arithmetic operations can be optimized. Second, dedicated modules require significant less hardware expense for module controlling, e.g., for program memory.

Typically, computation intensive tasks, e.g., DCT, block matching, or variable length coding, are candidates for an adapted or even dedicated implementation. Besides the adaptation to one specific task, mapping of several different tasks onto one adapted processor module might be advantageous. For example, mapping successive tasks, like DCT, quantization, inverse quantization, IDCT, onto the same module reduces the internal communication overhead.

Coprocessor architectures which are based on highly adapted coprocessors achieve high computational power on a small chip area. The main disadvantage of these architectures is the limited flexibility. Changes of the envisaged applications might lead to an unbalanced utilization of the processor modules and therefore to a limitation of the effective processing power of the chip.

Applying the coprocessor concept opens up a variety of feasible architecture approaches, which differ in achievable processing power and flexibility of the architecture. In the following, several architectures are presented, which clarifies the wide variety of sensible approaches for video compression based on a coprocessor concept. Most of these architectures aim at an efficient implementation of hybrid coding schemes. As a consequence, these architectures are based on highly adapted coprocessors. An example for a more flexible architecture is the TMS32OC80 [68],


Fig. 26. Vision processor architecture [58].

which combines the implementation of parallel data paths in combination with a coprocessor concept based on flexible processors.

A coprocessor architecture for video coding applications based on two different chip types is presented in [70], [58]. The chip set consists of a vision controller (VC) which can be combined with up to four vision processors (VP). The number of vision processors depends on the required computational power. For example, one VC and one VP are required for the realization of a H.261 encoder for CIF-15 Hz signals.

The VP has a 64-b, parallel architecture, including fast multiply/accumulate circuitry, sixteen 8-b ALU’s, a shifter and tree adder for motion estimation and a RISC core with ALU and 32-word register file. Data transfer is performed by an DMA port. A function oriented run-length codec supports the encoding and decoding of run/level tokens in parallel to the data transfer. The VP applies a hierarchical controlling strategy. The VP is controlled by a sequencer in combination with a microcode. The sequencer itself is controlled by external macro commands, which are fed into a command queue.

The Vision Controller has been designed as a master processor for one or more VP’s. The VC’s architecture is a combination of a flexible programmable RISC core controller and function oriented units for variable length encoding and decoding and pre/postprocessing, e.g., up and down conversion between the CIF and CCIR video signals. A memory controller enables the employment of DRAM memories for external data storage. Communication with a host processor is supported by an 8-b host interface.

A superset architecture of the vision controller/vision processor combination, the VCP, for video encoding and decoding according to H.26 1, MPEGl , MPEG2, and JPEG has been announced. This chip supports the Next Profile at Low Level and Main Profile at Main Level of the MPEG2 standard.

A chip set for video coding has been proposed in [85]. This chip set consists of four devices: two encoder options (the AVP1300E and AVP1400E), the AVP1400D decoder, and the AVP14OOC system controller. The AVP1300E has been designed for H.261 and MPEGl frame based encod-

color conversion B upldown

Fig. 27. Vision controller overview 1581.

ing. Full MPEGl encoding (I-frame, P-frame, and B-frame) is supported by the AVP1400E. In the following, the architecture of the encoder chips is presented in more detail.

The AVPl3OOE combines function oriented modules, mask programmable modules, and user programmable modules. It consists of a dedicated motion estimator for exhaustive search block matching with a search area of +/- 15 pels. The variable length encoder unit contains a 16-b ALU, a 144 x 16-b register array, a 128 x 16-b coefficient RAM, and a 1024 x 20-b table ROM. Program instructions for the VLE unit are stored in a 1024 x 25-b ROM. Special instructions for conditional switching, run-length coding and variable-to-fixed-length conversion are supported. The remaining tasks of the encoder loop, i.e., DCTDDCT, quantization and inverse quantization, are performed in two modules, called SIMD processor and quantization processor (QP). The SIMD processor consists of 6 parallel 16-b processors each with ALU, multiplier- accumulator unit. Program information for this module is again stored in a ROM memory. The QP’s instructions are stored in a 1024 x 28-b RAM. This module contains 16-b ALU, a multiplier, and a register file of size 144 x 16-b. Data communication with external DRAM’S is supported by a memory management unit (MMAFC). Additionally, the processor scheduling is performed by a global controller

Due to the adaptation of the architecture to specific tasks of the hybrid coding scheme, a single chip of size 132 mm2 (at 0.9 pm CMOS technology) supports the encoding of CIF-30 Hz video signals according to the H.261 standard, including the computation intensive exhaustive search motion estimation strategy. An overview of the complete chipset is given in [ S I .

The VDSP2 [56] is another typical example for the combination of function oriented and programmable units within a single chip. The processor consists of a programmable DSP core, a VLCNLD circuit, a dedicated DCTADCT. Data access to external frame memories is supported by a DRAM controller which supports DMA transfer. A serial/parallel VO unit are available for the transmission and reception of data and control signals.

The DSP core contains four parallel vector processing units, which execute parallel vector operations in a SIMD mode. The data paths consists of a multiplier, accumulator,

W ) .


1111 I

Fig. 28. AVP encoder architecture [85].

Fig. 29. VDSP;? architecture [56].

shifter, an enhanced ALU, and memories. The data path has been adapted to the requirements of typical video coding schemes. Thus, one clock cycle is required for the execution of operations like quantization or clipping.

The VDSP has been designed using a 0.5 pm CMOS technology. It consists of approximately 2 500 OOO transistors. The VDSP can be clocked with a frequency up to 100 MHz. In this case two VDSP and an additional dedicated motion estimation chip perform the encoding of video data according to the MPEG-2 standard. One VDSP is required for decoding.

The AxPe640 V 1651 is another typical example of the coprocessor approach. To provide high flexibility for a broad range of video processing algorithms, the two processor modules are fully user programmable. A scalar RISC core supports the processing of tasks with data dependent control flow, whereas the typically more computation intensive low level tasks with data independent control flow can be executed by a parallel SIMD module.

The RISC core functions as a master processor for global control and for processing of tasks like variable length encoding and quantization. To improve the performance for typical video coding schemes, the data path of the RISC core has been adapted to the requirements of quantization and variable length coding, by an extension of the basic instruction set. A program RAM of size 1024 x 32-b is placed on-chip and can be loaded from an external EPROM during startup. The SIMD oriented arithmetic processing unit (APU) contains four parallel 8-b datapaths with a

240

Fig. 30. AxPe640 V overview [651.

Fig. 31. Architecture of the HVC processor [66].

subtracter-complementer-multiplier pipeline. The datapath can be configured to two 16-b pipelines. In both cases, the intermediate results of the arithmetic pipelines are fed into a multi operand accumulator with shiftflimit circuitry. The results of the APU can be stored in the internal local memory or read out to the external data output bus.

Since both, RISC core and APU, include a private program RAM and address generation units, these processor modules are able to work in parallel onto different tasks. This MIMD-like concept enables an execution of two tasks in parallel, e.g., DCT and quantization.

The AxPe640 V is currently available in a 66 MHz version, designed in a 0.8 pm CMOS technology. A QCIF- lOHz H.261 codec can be realized with a single chip. To achieve higher computation power several AxPe640 V can be combined to a multiprocessor system. For example, three AxPe640 V are required for an implementation of a CIF-10 Hz codec.

A more adapted coprocessor architecture for hybrid coding applications is the HVC processor [66]. The aim of this architecture is to provide a cost efficient video codec for low bitrate applications. Sophisticated coding schemes, e.g., MPEG, are provided by the combination of several processors to a multiprocessor system.

Besides the circuitry for video, channel and external memory interfacing, the architecture of the HVC consists of a dedicated block matching module for motion vector estimation within a maximum range of +/- 15 pels, a block- level coprocessor (BLC), and a RISC core for complex tasks, e.g., controlling of the quantizer stepsize.

The BLC module consists of two parallel data paths, which have been adapted to the requirements of tasks like DCTADCT, filtering and quantization. Each data path includes four ALU’s, a multiplier, and a shiftnimit unit. The architecture enables the execution of fast DCT (FDCT)


Shared RAMI

Fig. 32. TMS32OC80 (MVP) [68].

algorithms to achieve a higher performance for this computation intensive task. Compared to the conventional matrix- vector approach, FDCT algorithms require an increased word width for intermediate results for a desired arithmetic accuracy. Due to this fact the BLC contains 18-b data paths to confirm with the CCITT requirements for the IDCT accuracy. The instruction set of the BLC has been extended to support the computational requirements of tasks like quantization or inverse quantization. Especially, conditional ALU operations, e.g., “subtract if one operand is negative and add otherwise”, have been added to the instruction set. Moreover, adaptive quantization is supported by the data path, where one pipeline of the data path processes the DCT coefficients and the other parallel pipeline is used for threshold updating. The control module (CM) is based on a RISC core with ALU and multiplier pipelines. To improve the performance of this processor module, the RISC has been designed to execute two instructions in parallel.

The HVC architecture has been developed for an implementation based on a 0.6 pm CMOS process. Achieving a clock frequency of about 80 MHz, this architecture enables the implementation of a single chip CIF-30 Hz H.261 codec. The chipsize has been estimated to 80 mm2.

The TMS320C80 (MVP) aims at a wide range of image processing application, including video coding, image generation, and document image processing [68]. Due to the variable requirements of these application fields, the MVP supports a high degree of flexibility. Since the architecture is not adapted to specific tasks, this processor can not be referred as a coprocessor architecture in a strict sense.

The MVP consists of four parallel processors (PP) and one master processor (MP). The processors are connected to 50 kbyte on-chip data memory via a global crossbar interconnection network. A DMA controller provides the

Shared

RAMS

MP cache RAMS

Transfer Controller

data transfer to an external data memory and video YO is supported by an on-chip video interface.

The master processor is a general-purpose RISC processor with an integral IEEE-compatible floating-point unit (FPU). The processor has a 32-b instruction wordwidth and can load or store 8-, 16-, 32, and 64-b data sizes. The master processor includes a 32 x 32-b general purpose register file. The master processor is intended to operate as the main supervisor and distributor of tasks within the chip and is also responsible for the communication with external processors. Due to the integrated WU, the master processor will perform tasks like audio signal processing and 3D graphics transformation.

The parallel processors architecture has been designed to perform typical DSP algorithms, e.g., filtering, DCT, and to support bit and pixel manipulations for graphics applications. The parallel processors contain two address units, a program flow control unit, and a data unit with 32-b ALU, 16 x 16-b multiplier, and a barrel rotator.

The MVP has been designed using a 0.5 pm CMOS technology. Due to the supported flexibility, about four million transistors on a chip area of 324 mm2 are required. A computational power of 2 GOPS is supported. A single MVP is able to encode CIF-30 Hz video signals according to the MPEG-1 standard.

The examples presented above clarify the wide range of architectural approaches for a VLSI implementation of video coding schemes. The applied strategies are influenced by several demands, especially the desired flexibility of the architecture and maximum cost for realization and manufacturing. Due to the high computational requirements of real time video coding, most of the presented architectures apply a coprocessor concept with flexible programmable modules in combination with modules which are more

PIRSCH er al.: VLSI ARCHITECTURES FOR VIDEO COMPRESSION-A SURVEY 24 1

In 7-

Table 6 Overview of p1 Zrammable Architectures for Video Coding Applications

Architecture Technology (micron)

0.9

0.5

0.8

1.0

0.8

0.6

0.8

0.5

0.5

1 .o 0.8

~ ~~

Chipsize [mm2]

132 (Coder) 112(decod.)

312

160

71 (VP) 80 ? (VC)

144

80

202

324

28 1

225

23 1

or less adapted to specific tasks of the hybrid coding scheme. An overview of programmable architectures for video coding applications is given in Table 6.

The normalized AT-criterion can be applied for architecture comparison of programmable implementations. For this purpose, a typical application has to be selected as a benchmark for the performance of the different architectures. Since most of the references provide information on the performance for an H.261 codec, this application is applied for the architecture comparison. The comparison for an implementation of a H.261 codec2 is depicted in Fig. 33.

Applying the AT-criterion leads to two architecture classes:

Flexible programmable architectures These processors provide moderate to high flexibility. The architectures are based on coprocessor concepts as well as parallel datapaths and deeply pipelined designs with high clock frequency.

These architectures achieve an increased efficiency by adaptation of the architecture to the specific requirements of video coding applications. All of these architectures provide dedicated modules for several tasks of the hybrid coding scheme, e.g., motion estimation or variable length coding.

Adapted programmable architectures

[57] has been designed for MPEG-2 applications. For this comparison the processing power of this architecture has been scaled by the picture format ratio CIF30 HdCCIR601-25 Hz. Since the increased computational complexity of the MPEG-2 standard is neglected, this architecture is expected to achieve a slightly higher performance.

242

# Chips required for realtime processing of a CIF-30 HZ H.261 cod%

2

1

3

2 vc + 3 VP

6

1

5

1

1

4

4

Comments

wlo motion estimation. This chip has been designed for

applications

frame rate 15 Hz

VC's chip size has not been

published

MPEG-2

estimated chipsize

# Chips estimated

Fig. 33 shows that adapted processor designs can achieve an efficiency gain (in terms of the AT-criterion) by a factor of about 6 to 7 compared to a more general architecture. Assuming approximately 200 operations per pel for the implementation of a H.261 codec, this comparison leads to

a c , ~ M 100 mm2/GOPS (25)

CXC,O M 15 mm2/GOPS (26)

for flexible programmable architectures and

for adapted programmable architecture. The figure of merit for the adapted programmable ar-

chitectures corresponds to that achieved for the DCT (18). But it should be noted that different types of operations and different occurrence of operations are the basis for that numbers.

VI. CONCLUSION Video compression applications have been growing in

significance for several years. Consequently, standards have been developed and adopted. There is a large demand for efficient VLSI architectures for implementing these standards. This paper has attempted to give a survey on today's compression schemes, focusing on VLSI implementation aspects. For a coarse classification, the architectural approaches can be subdivided into two classes: dedicated (function specific) architectures and programmable architectures.

Dedicated architectures enable efficient implementations of specific tasks. Optimization of architectures for dedicated


Silicon area ["*I

1400

1200

lo00

800

600

400

200

xible programmable xible programmable

I I I I I I I I I I I I I I I I I I I I I I I I I I

5 10 15 20 25 30 Frame Rate

I H Z I

Fig. 33. Normalized silicon area and throughput (frame rate) for programmable architectures for a H.261 codec.

modules is possible by selecting the most appropriate one out of a set of possible architectures. Alternative architectures result from the fact that algorithms can be formulated differently by utilizing features of algebraic theorems, linearity and symmetry. For video compression applications, this dedicated approach is mainly applied to the implementation of computation-intensive tasks, namely block matching and DCTADCT. Moreover, several dedicated architectures have been proposed for implementation of complete hybrid coding schemes. The main disadvantage of dedicated architectures is the lack of flexibility. Thus a variety of programmable architectures for video compression applications have been presented in the past. Programmable architectures provide a significant higher flexibility compared to dedicated approaches, since, for ex-

ample, modifications of the envisaged applications require software changes instead of a more cost intensive hardware redesign. Generally, this increased flexibility leads to an decreased architectural efficiency. Thus several architectures are based on a combination of adapted, not necessarily dedicated, modules (typically for motion estimation or DCTDDCT) and programmable modules for the remaining tasks of the hybrid coding scheme, like quantization, coder control etc. The implementation efficiency of reported designs can be compared by normalization to a common semiconductor process. The major results of reported designs have been normalized to a fictive 1.0 pm CMOS process. The normalized results fit very well with a linear relationship between silicon area and through-put rate or computational rate. Figure of merits for the silicon area '


have been determined for DCT, block matching, general programmable video processor architectures, and adapted programmable architectures. By adapting programmable architectures to specific tasks or incorporating dedicated modules, efficiency can be improved by a factor of about 6-7.

This paper focussed on recent hybrid coding schemes. Besides this hybrid coding approach, several video compression schemes are discussed for future video compression applications, considering object-based coding strategies [loo], [ loll . These object-based coding strategies are based on more complex tasks, e.g., labeling, segmentation, contour tracking, etc. From this follows that more general- purpose video signal processors with high flexibility will be required to fullfil the specific requirements of future video coding schemes.

REFERENCES

[ l ] CCITT Study Group XV: Recommendation H.261, Video Codec for Audiovisual Service at p x 64 kbitls, Rep. R37, Genf, July 1990.

[2] H. G. Musmann, P. Pinch, and H.-J. Grallert, “Advances in picture coding,” Proc. IEEE, vol. 73, no. 4, pp. 523-548, 1985.

[3] ISO-IEC JTC/SC2/WGll MPEG-90/176 Rev. 2, 1990.

[5] A. N. Netravali and B. C. Haskell, it Digital Pictures, Repre- sentation and Compression. New York: Plenum, 1988.

[6] Tzou Kou-Hu, “Video coding techniques: An overview,” in V U 1 Implementations for Image Communications. Amster- dam: Elsevier, 1993, pp. 1-47.

[7] G. K. Wallace, “The JPEG still picture compression standard,” Comm. ACM, vol. 34, no. 4, pp. 3 1 4 , Apr. 1991.

[8] K. K. Chau, I. F. Wang, and C. K. Eldridge, “VLSI implementation of a 2D-DCT in a compiler,” Proc. IEEE ICASSP, pp. 1233-1236, Toronto, Canada, 1991.

[9] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete Cosine Transform,” IEEE Trans. Comput., vol. C-23, pp. 88-93, Jan. 1974.

[lo] A. Artieri, E. Macoviak, F. Jutand, and N. Demassieux, “A VLSI one chip for real time two-dimensional discrete cosine transform,” Proc. IEEE Int. Symp. on Circuits And Systems, Helsinki, 1988.

[ 1 11 J. C. Carlach, P. Penard, and J. L. Sicre, ‘TCAD: A 27 MHz 8 x 8 discrete cosine transform chip,” Proc. Int. Con$ on Acoustics Speech and Signal Process., vol. 2.3, 1989.

[ 121 P. C. Jain, W. Schlenk, and M. Riegel, “VLSI implementation of two-dimensional DCT processor in real-time for video Codec,” IEEE Trans. Consumer Electron., vol. 38, Aug. 1992.

[I31 S. P. Kim and D. K. Pan, “Highly modular and concurrent 2- D DCT chip,” Proc. of IEEE Int. Symp. on Circ. and Systems, 1992.

[14] B. G. Lee, “A new algorithm to compute the discrete cosine transform,” IEEE Trans. Acoust., Speech and Signal Process., vol. ASSP-32, pp. 1243-1245, Dec. 1984.

[15] M. Matsui et al., “200 MHz video compression macrocelss using low swing differential logic,” Proc, Int. Solid State Circ. Con$, 1994.

[16] Z . J. Mou and F. Jutand, “A high-speed low-cost DCT architecture for HDTV applications,” Proc. IEEE ICASSP, pp. 1153-1 156, Toronto, Canada, 1991.

[17] P. Ruetz, P. Tong, D. Bailey, D. Luthi, and P. Ang, “A high- performance full-motion video compression chip set,” IEEE Trans. Circuit and Syst. for Mdeo Technol., vol. 2, June 1992.

[I81 P. A. Ruetz and P. Tong, “A 160-MpixeYs IDCT Processor for HDTV,” IEEE MICRO, vol. 12, no. 5 , pp. 28-32, Oct. 1992.

[I91 M. Sheu, J. Lee, J. Wang, A. Suen, and L. Liu, “A high throughput-rate architecture for 8*8 2-D DCT,” Proc. Int. Con$ on Acoust. Speech and Signal Process., pp. 1587-1590, 1993.

[201 D. Slawecki and W. Li, “M=TIIDCT processor design for high data rate image coding,” IEEE Trans. Circuit and Syst. for Mdeo Technol., vol. 2, June 1992.

[4] ISO-IEC JTCI/SC29/WGll MPEG-93/225, 1993.

244

[21] C. Steams, D. Luthi, P. Ruetz, and P. Ang, “A reconfigurable 64-TAP transversal filter,” Proc. IEEE Custom Integ. Circ. Con$, 1988.

[22] M. T. Sun, T. C. Chen, and A. M. Gottlieb, “VLSI Implementa- tion of a 16 x 16 discrete cosine transform,” IEEE Trans. Circ. and Syst., vol. 36, Apr. 1989.

[23] U. Totzek, F. Matthiesen, S. Wohlleben, and T. G. Noll, “CMOS VLSI implementation of the 2D-DCT with linear processor arrays,” Proc. Int. Con$ on Acoust. Speech and Signal Process., vol. 3.3, 1990.

[24] Baek et al., “A fast array architecture for block matching algorithm,” Proc. IEEE Int. Symp. on Circ. and Syst., vol. 4,

[25] J. Biemond, L. Looijenga, and D. E. Boekee, “A pel-recursive Wiener-based displacement estimation algorithm for interframe image coding applications,” Proc. SPIE Esual Comm. and Image Proc. 11, vol. 845, pp. 424-431, 1987.

[26] M. Bierling, “Displacement estimation by hierarchical block- matching,” Proc. SPIE Msual Comm. and Image Proc., vol.

[27] C. Cafforio and F. Rocca, “Methods for measuring tsmall displacements of television images,” IEEE Trans. Inform. Theory, vol. IT-22, pp. 573-579, Sept. 1976.

[28] -, “The differential method for image scene analysis,” in Image Processing and Dynamic Scene Analysis, T. S . Huang, Ed.

[29] K. H. Chow and M. L. Liou, “Genetic motion search for video compression,” Proc. IEEE Msual Sig. Proc. and Comm., Melbourne, Australia, pp. 167-170, Sept. 1993.

[30] M. Ghanbari, “The cross-search algorithm for motion estimation,” IEEE Trans. Commun., vol. 38, pp. 950-953, July 1990.

[31] Gupta et al., “VLSI architecture for hierarchical block matching,” Proc. IEEE Int. Symp. on Circuits and Systems, vol. 4, pp.

[32] J. R. Jain and A. K. Jain, “Displacement measurement and its application in interframe image coding,” IEEE Trans. Commun.,

[33] Jong et al., “Parallel architectures of 3-step search block- matching algorithms for video coding,” Proc. IEEE Int. Symp. on Circ. and Syst., vol. 3, pp. 209-212, 1994.

[34] S . Kappagantula and K. R. Rao, “Motion compensated interframe image prediction,” IEEE Trans. Commun., vol. COM-33, pp. 1011-1015, Sept. 1985.

[35] F m et al., “A pipelined systolic array architecture for the hierarchical block-matching algorithm,” Proc. IEEE Int. Symp. on Circ. and Syst., vol. 3, pp. 221-224, 1994.

[36] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion compensated interframe coding for video conferenc- ing,” Proc. Nat. Telecom. Con$, New Orleans, pp. G5.3.1-5.3.5, N 0 v . k . 1981.

[37] J. 0. Limb and J. A. Murphy, “Measuring the speed of moving objects from television images,” IEEE Trans. Commun., vol. COM-23, pp. 474-478, Apr. 1975.

[38] A. N. Netravali and J. D. Robbins, “Motion-compensated television coding: Part I,” Bell Syst. Tech. J., vol. 58, pp. 631-670, Mar. 1979.

[39] A. Pun, H. M. Hang, and D. L. Schilling, “An efficient block- matching algorithm for motion compensated coding,” Proc.

[40] R. Srinivasan and K. R. Rao, “Predictive coding based on efficient motion estimation,” IEEE Trans. Commun., vol. COM-33, pp. 888-896, Aug. 1985.

[41] D. R. Walker and K. R. Rao, “Improved pel-recursive motion estimation,” IEEE Trans. Commun., vol. COM-32, pp. 950-953, July 1990.

[42] 0. Colavin, A. Artieri, J. F. Naviner, and R. Pacalet, “A dedicated circuit for real-time motion estimation,” EuroASIC 1991.

[43] De Vos et al., “Parametrizable VLSI architectures for full-search block-matching algorithms,” IEEE Trans. Circ. and Syst., vol. 36, Oct. 1989.

[44] De Vos, “VLSI-architectures for the hierarchical block matching algorithm for HDTV applications,” SPIE visual Commun. and Image Proc. ’90, vol. 1360, pp. 398409.

[45] Dianysian et al., “Bit-serial architecture for real-time motion compensation,” Proc. SPIE Msual Commun. and Image Proc., 1988.

[46] Hervigo et al., “A multiprocessor architecture for HDTV motion

pp. 211-214, 1994:.

1001, pp. 942-951, 1988.

New York Springer-Verlag, 1983, pp. 104-124.

215-218, 1994.

vol. COM-29, pp. 1799-1808, Dec. 1981.

IEEE ICASSP, pp. 25.4.1-25.4.4, 1987.


1 111 I

estimation system,” IEEE Trans. Consum. Electron., vol. 38, Aug. 1992.

[47] C. Hsieh et al., “VLSI architecture for block-matching motion estimation algorithm,” IEEE Trans. Circ. and Syst. for video Technol., vol. 2, June 1992.

[48) Komarek et al., “Array architectures for block matching algorithms,” IEEE Trans. Circ. and Syst., vol. 36, Oct. 1989.

[49] Y. Tokuno et al., “A motion video compression LSI with dismbuted arithmetic architecture,” Proc. IEEE Custom Integ. Circ. Con$, 1993.

[50] Uramoto et al., “A half-pel precision motion estimation processor for NTSC-resolution video,” Proc. IEEE Custom Integ. Circ. Con$, 1993.

[51] Yang et al., “A family of VLSI designs for the motion compensation block-matching algorithms,” IEEE Trans. Circ. and Syst., vol. 36, Oct. 1989.

[52] S. F. Chang and D. Messerschmitt, “Designing high throughput VLSI decoder Part I-Concurrent VLSI architectures,” IEEE Trans. Circuits and Syst. for video Technol., vol. 2, June 1992.

[53] H. D. Lin and D. Messerschmitt, “Designing high through- put VLSI decoder Part 11-Parralel decoding methods,” IEEE Trans. Circ. and Syst. for video Technol., vol. 2, June 1992.

[54] M.-T. Sun., K.-M. Yang, and K.-H. Tzou, “High-speed programmable ICs for decoding of variable-length codes,” SPIE Applications of Digital Image Processing HI, pp. 28-39, vol. 1153, 1989.

[55] B. Ackland, “The role of VLSI in multimedia,” IEEE J. Solid- State Circ., vol. 29, S. 1886-1893, Dec. 1992.

[56] T. Akari et al., “Video DSP architecture for MPEG2 codec,” Proc. ICASSP ’94, vol. 2, pp. 417-420, IEEE Press, 1994.

[57] K. Aono, M. Toyokura, A. Othani, H. Kodama, and K. Okam- ato, “A Video Digital Signal Processor with a Vector-Pipeline Architecture,” IEEE J. Solid-state Circ., vol. 27, pp. 1886-1893, Dec. 1992.

[58] D. Bailey, M. Cressa, D. Neubauer, H. K. J. Rainnie, and C.-S. Wang, “Programmable vision processorkontroller,” IEEE

[59] S. Bose, “A single chip multistandard video codec,” Proc. IEEE Hot Chips V, Stanford, CA, Aug. 1993.

[60] D. Brinthaupt, L. Letham, et al., “A video decoder for H. 261 video teleconferencing and MPEG stored interactive video applications,” Proc. IEEE Int. Solid State Circ. Con$, pp. 25-27, 1993.

[61] T. Demura et al., “A single-chip MPEG2 video decoder LSI,” Proc. IEEE Int. Solid State Circ. Con$, pp. 72-73, 1994.

[62] T. Fautier, “VU1 implementation of MPEG Decoders,” Int. Symp. on Circ. and Syst., 1994.

[63] H. Fujiwara et al., “An all ASIC implementation of a low bit-rate video codec,” IEEE Trans. Circ. and Syst. for video Technol., vol. 2, pp. 123-134, June 1992.

[64] T. Fukushima, “A survey of image processing LSIs in Japan,” IEEE 10th Int. Con$ on Patt. Recog., Atlantic City, NJ, pp. 394-401, June 1990.

[65] K. Gaedke, H. Jeschke, and P. Pirsch, “A VLSI-based MIMD architecture of a multiprocessor system for real-time video processing applications,” J. VLSI Signal Proc., vol. 5, pp. 159-169, Apr. 1993.

[66] W. Gehrke, R. Hoffer, and P. Pirsch, “A hierarchical multiprocessor architecture based on heterogeneous processors for video coding applications,” Proc. ICASSP ’94, vol. 2, IEEE Press 1994.

[67] J. Goto et al., “250-MHz BiCMOS super-high-speed video signal processor (S-VSP) ULSI,” IEEE J. Solid-state Circ., vol. 26, no. 12, pp. 1876-1884, 1991.

[68] K. Guttag, “The multiprocessor video processor, MVP,” Proc. IEEE Hot Chips V , Stanford, CA, Aug. 1993.

[69] C. M. Huizer et al., “A programmable 1400 MOPS video signal processor,” IEEE Custom Integ. Circ. Con$, San Diego, CA, May 1989.

[70] IIT Vision Controller, IIT Vision Processor, data sheets, Inte- grated Information Technology, 1992.

[711 T. Inoue et al., “A 300-MHz BiCMOS video signal processor,” IEEE J. Solid-state Circ., vol. 28, Dec. 1993.

[72] K. Konstantinides and V. Bhaskaran, “Monolithic architectures for image processing and compression,” IEEE Comp. Graphics and Applications, vol. 12, pp. 75-86, Nov. 1992.

[731 LSI Logic, 64700 series (JPEG: L64735, L64745, L64755, MPEG: L64715, L64720, L64730, L64740, L64750, ~64760). data sheet.

MICRO, vol. 12, pp. 33-39, Oct. 1992.

1741 T. Micke, D. Muller, and R. HeiR, “ISDN-Bildtelefon auf der Grundlage eines Array-Prozessor-IC, Mikroelektronik, Springer-Verlag, vol. 5, no. 3, pp. 116-119, May/June 1991 (in German).

[75] T. Minami et al., “A 300-MOPS video signal processor with a parallel architecture,” IEEE J. Solid-state Circ., vol. 26, pp. 1868-1875, Dec. 1991.

1761 E. Morimatsu et al., “Development of a VLSI chip set for H. 261/MPEG-1 video codec,” Proc. SPIE Con$, vol. 2094, 1993.

[77] Motorola, “MPEG full motion video decoder,” Product Pre- view, Motorola Ltd., England, 1992.

[78] S . 4 . Nakagawa et al., “A 24-b 50-11s digital image signal processor,” IEEE J. Solid-state Circ., vol. 25, pp. 1484-1493, Dec. 1990.

[79] T. Nishitani, “Parallel video signal processor configuration based on overlap-save technique and its LSI processor element: VISP,” in Journal of V U 1 Signal Procesing, vol. 1. Amsterdam: Kluwer, 1989, pp. 25-34.

[80] U. Nishii et al., “A lo00 MIPS BiCMOS microprocessor with superscalar architecture,” Proc. IEEE Int. Solid State Circ. Con$, pp. 114-115, 1992.

[81] GEC Plessey, “VP 2611 video compression source coder,’’ Preliminary Data, Cheney Manor, Swindon, UK, Mar. 1992.

[82] -, “VP 2615 video reconstruction processor,” Preliminary Data, Cheney Manor, Swindon, U.K., Mar. 1992.

E831 G. Privat and E. Petajan, “Processing hardware for real-time video coding,” IEEE MICRO, vol. 12, no. 5, pp. 9-12, Oct. 1992.

[84] S. C. Purcell and D. Galbi, “C-Cube MPEG video processor,” SPIE, Image processing and interchange, vol. 1659, 1992.

1851 S. K. Rao and M. H. Matthew, et al., “A Real-Time P*64/MPEG Video Encoder Chip,” Proc. IEEE Int. Solid State Circ. Con$,

[86] P. Ruetz, P. Tong, D. Bailey, D. A. Luthi, and P. H. Ang, “A high-performance full-motion video compression chip set,” IEEE Trans. Circ. and Syst. for video Technol., vol. 2, pp. 111-122, June 1992.

[87] STV3200, STI3208, STI3220 Data Sheet, SGS-Thomson. [88] S. Sutardja, J. Fandrianto, B. Martin, H. Rainnie, and C.-

S. Wang, “A 50 MHz vision processor,” IEEE 1991 Custom Integrated Con$, pp. 12.3.1-12.3.3, 1991.

[89] I. Tamitani et al., “An encoder/decoder chip set for the MPEG video standard,” Proc. IEEE Int. Con$ on Accoust. Speech and Sign. Proc., pp. V-661-V-664, 1992.

[90] M. Toyokura et al., “A video DSP with macroblock-level- pipeline and a SIMD type vector-pipeline architecture for MPEG2 CODEC,” Proc. IEEE Int. Solid State Circuits Con$, pp. 74-75, 1994.

[91] H. Yamauchi et al., “Architecture and implementation of a highly parallel single chip video DSP,” IEEE Truns. Circuits and Syst. for videotechnol., vol. 2, pp. 207-220, June 1992.

pp. 32-35, 1993.

[92] ZR36020, ZR36031 Data Sheet, ZORAN. [93] H. B. Bakoglu, Circuits Interconnections and Packaging for

VLSI. [94] R. Jain, A. C. Parker, and N. Park, “Predicting system-level

area and delay for pipelining and nonpipelining designs,” IEEE Trans. Comp.-Aided Design, vol. 1 I , pp. 955-965, Aug. 1992.

[95] H. Jeschke, K. Gaedke, and P. Pirsch, “Multiprocessor performance for real-time processing of video coding applications,” IEEE Trans. Circuits and Syst. for videotechnol., vol. 2, pp. 221-230, June 1992.

[96] H. Jeschke and P. Pirsch, “Performancemodellierung von Multiprozessoranordnungen fur die Echtzeitvideosignalver- arbeitung,” Institut fur Theoretische Nachrichtentechnik und Informationsverarbeitung, Universitat Hannover. Deutsche Forschungsgemeinschaft, Forschungsvorhaben Pi 16914-2, 1994.

[97] P. Pirsch, W. Gehrke, and R. Hoffer, “A hierarchical multiprocessor architecture for video coding applications,” Int. Symp. on Circ. and Syst., vol. 5, 1993.

[98] T. G. Noll and E. De Man, “Pushing the performance Iimits due to power dissipation of future ULSI chips,” Proc. ISCAS ’92, IEEE Press, pp. 1652-1655, 1992.

[99] N. Weste and K. Eshragion, Principles of CMOS VLSI Design. Reading, MA: Addison Wesley, 1988.

[ 1001 P. Gerken, “Object-based analysis-synthesis coding of image sequences at very low bit rates,” to be published in IEEE Truns.

Reading, MA: Addison Wesley, 1987.


Circ. Syst. Ed. Tec., special issue on Very Low Bitrate Video Coding.

[ l o l l H. G. Musmann, M. Hotter, and J. Ostermann, “Object-oriented analysis-synthesis coding of moving images,” Signal Process- ing: Image Commun., vol. 1, no. 2, pp. 117-138, Oct. 1989.

Peter Pirsch (Senior Member, IEEE) received the Ing. grad. degree from the engineering col- lege in Hannover, Hannover, Germany, in 1966, and the Dip1.-Ing. and Dr.-Ing. degrees from the University of Hannover, in 1973 and 1979, respectively, all in electrical engineering.

From 1966 to 1973 he was employed by Telefunken, Hannover, working in the Televi- sion Department. In 1973 he became a Research Assistant at the Department of Electrical Engi- neering, University of Hannover, and a Senior

Engineer in 1978. During 1979-1980 and during the summer of 1981 he worked at the Visual Communication Research Department of Bell Laboratories, Holmdel, NJ. During 1983 to 1986 he was Department Head of Digital Signal Processing at the SEL research center, Stuttgart. Since 1987 he has been Professor in the Department of Electrical Engineering at the University of Hannover. His present research includes VLSI implementations for image processing applications and the mapping of image processing algorithms onto array architectures. He is the author or coauthor of more than 100 papers.

Dr. Pirsch was the recipient of the NTG paper price award in 1982. He presently serves as an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY.

Nicolas Demassieux was born in 1961. In 1983, he received the engineering degree in telecommunications from Ecole Nationale Sup6rieure des T616communications, Paris.

He joined the staff of b o l e Nationale Sup6rieure des Tklbcommunications where he started to develop research and teaching in the emerging field of VLSI design. He is currently the head of the department of Electronics which has activities in systems and integrated circuits digital- and analog-, CAD tools and device

modelling. His research covers VLSI and system architecture modelling, optimization and synthesis, with applications for real-time image and signal processing. In this field, he holds four patents, has been in charge of several VLSI designs, including the world’s first real-time 16’16 DCT chip and has managed a number of industrial contracts.

Winfried Gehrke received the Dip].-Ing. degree in electrical engineering from the University of Hannover, Germany, in 1990.

Since then he has been a Research Assistant at the Laboratory for Information Technology of the Department of Electrical Engineering, University of Hannover. His current research in- terests are parallel VLSI architectures for video compression and image processing.

246

7 1 1 1

PROCEEDINGS OF THE IEEE. VOL. 83, NO. 2, FEBRUARY 1995

~ I

Documents

VLSI Architectures for Video Compression-A Surveyread.pudn.com/downloads77/sourcecode/multimedia/streaming/295… · telecommunications, computer and media industries. The progress