VLSI Design of DCT/IDCT Processor Video verification of the chip has been carried out. The timing simulations

  • View

  • Download

Embed Size (px)

Text of VLSI Design of DCT/IDCT Processor Video verification of the chip has been carried out. The timing...


    VLSI Design of High-speed Time-Recursive 2-D DCT/IDCT Processor for Video Applications

    Vishnu Srinivasan, Member, IEEE, and K. J. Ray Liu, Senior Member, IEEE

    Abstract-In this paper we present a full-custom VLSI de- sign of high-speed 2-D DCTRDCT processor based on the new class of time-recursive algorithms and architectures which has never been implemented to demonstrate its performance. We show that the VLSI implementation of this class of DCT/IDCT algorithms can easily meet the high-speed requirements of high- definition television (HDTV) due to its modularity, regularity, local connectivity, and scalability. Our design of the 8 x 8 DCTnDCT can operate at 50 MHz (or have a 50 MSamplesk throughput) based on a very conservative estimate under 1.2 p CMOS technology. In comparison to the existing designs, our approach offers many advantages that can be further explored for even higher performance.

    I. INTRODUCTION DVANCES in various aspects of digital technology have A made possible many applications of digital video such

    as high-definition television (HDTV), teleconferencing, and multimedia&ommunications. These applications require high- speed transmission of vast amounts of video data. Most video standards such as HDTV video coding, H.261, JPEG, and MPEG use discrete cosine transform (DCT) as a standard transform coding scheme [ 11-[4]. The DCT is, however, very computationally intensive. To realize high-speed and cost- effective DCT for video coding, one needs efficient VLSI implementations so that the high throughput requirements can be matched. There has been considerable research in efficient mapping of these algorithms to practical and feasible VLSI implementations in the recent past [5]-[9]. These have, however, employed irregular butterfly structures with global communications resulting in complex layout, timing, and reliability concems which severely limit the operating speed and expandability in VLSI implementations.

    Recently a new class of transform coding architec- tures based on time-recursive approach has been proposed [lo]-[12]. The complexity of this class of parallel architectures is low, e.g., only 4N - 4 multipliers are needed for computing the 2-D DCT. To perform inverse DCT (IDCT), the computational structure is the same with only an additional multiplier needed [12]. Thus, the DCT and IDCT can be

    Manuscript received November 9, 1994; revised February 23, 1995 and October 5, 1995. This paper was recommended by Associate Editor T. Nishitani. This work was supported in part by NSF Grant MIP9457397, ONR Grant NO001 4-93- 10566 and Maryland Industrial Partnership MIPS/Micro- Star Grant.

    V. Srinivasan is with the Crystal Semiconductor Corporation, Austin, TX 78741 USA.

    K. J. R. Liu is with the Electrical Engineering Department and Institute for System Research, University of Maryland at College Park, College Park, MD 20742 USA.

    Publisher Item Identifier S lOSl-8215(96)01322-5.

    naturally combined and implemented together. This class of architectures has excellent scalability, i.e., the transform size N can be made any integer by adding or deleting computational modules [lo]-[ 121. In addition, these are highly parallel, modular, regular, fully-pipelined, and locally- connected. Thus it is a very good candidate for high-speed video applications. Also, the architecture is very suitable for real-time applications as the time-recursive concept has been exploited to eliminate the waiting time for data to arrive. From the VLSI implementation point of view, the need for global communication is eliminated as the parallel computational IIR structures are decoupled into independent modules.

    In this paper we present a novel VLSI implementation for the time-recursive 2-D DCTDDCT processor. This class of time-recursive parallel architectures has never been designed and implemented to prove its superior properties-our goal here is to show its performance under full-custom VLSI implementation and to make comparisons with other existing VLSI designs based on different algorithms. The chip design has been carefully optimized based on appropriate choice of wordlength and device elements to meet the required signal- to-noise ratio, the design of distributed arithmetic ROM units, and transformation and redistribution of clocking and pipelined stages to maximize the throughput. Complete functionality verification of the chip has been carried out. The timing simulations of the 8 x 8 2-D DCT/IDCT chip shows that it will operate at a system clock rate of 50 MHz corresponding to a data throughput of 50 MSamples/s under 1.2 p CMOS technology which implies that it can perform DCTODCT under the HDTV requirements.

    The paper is organized in the following manner. We give a brief summary of the algorithm and architecture in Section 11. The finite wordlength and architectural considerations are given in Section 111. Section IV discusses the hardware de- sign and the VLSI implementation aspects. A comparison to existing DCTDDCT VLSI design is presented in Section V, followed by the conclusion in Section VI.

    11. ALGORITHM AND ARCHITECTURE The time-recursive two-dimensional DCT for a N x N

    image block {z (m,n) :m = t , t + I,...,t + N - l ; n = 0 ,1 , . . . ,N - 1) is defined [ l l ] , [12] as

    t+N-l N - l 2 N XC(k I, t ) = --C(k)C(I) 4 m , n) m=t n=O

    (1) %-[2(m - t ) + l ] k n(2n + 1)1

    2N cos

    2N ' cos

    10.5-8215/96$05.00 0 1996 IEEE

  • where

    Mdtbiier I Forward DCT if k = 0, Inverse DCT I, otherwise. M I

    M2 The time index t in X, (k , t ) denotes that the transform starts from x( t ) . The 2-D IDCT is defined in a similar manner

    -2 cos (G) ' -2 cos ( (2nz+N1)* ) (-l)kC(k)J+OS (S) J j$ cos ( 2n+l ;', N-l)?r

    DCTYIDCT is a very computationally intensive operation. To be able to use this technique for high-throughput applications such as HDTV coding, an efficient VLSI implementation is essential. In the past, several fast algorithms have been mapped onto VLSI chips, but they are not particularly well adapted for VLSI where regularity, modularity, timing, layout complexity, and area are of more concern. The IIR algorithm [I21 for the computation of the DCT is a direct 2-D method and does not require transposition, unlike more traditional row- column algorithms. The structure is derived by considering the transform operation to be a filter which transforms the serial input data into their transform coefficients.

    The 1-D DCT and its inverse for a N block input data, starting from z ( t ) and ending with z ( t + N - 1) are defined as


    X C ( k , t + N - 1) = C ( k ) 6 4 n ) n=t

    .cos [ ( n - t + $ $ ] ,

    k = O , l , . . . , N - l , (3)

    x,(n,t + N - 1) = k=t

    . cos [ ( n + ;) 1 where C( k ) is as defined in (1). To derive the transfer function for the forward 1-D DCT, consider the one-sided 2 transfonn of the left-hand side (L.H.S.) of (3):


    Z(L.H.S.) = X,(k, t + N - l)Yt = X, ,k (z )P- ' (5) t=O

    Likewise, the one-sided Z transform of the right-hand side (R.H.S.) in (4) can be written as:


    M3 I Not applicable 1 - 1)

    Further manipulations on (5) and (6) show the transfer function for (3) to be (7).


    Similarly, taking the one-sided 2 transform on (4), the transfer function of the I-D inverse DCT transform can be shown in (81, at the bottom of the page.

    If we are, however, interested in only the N-block trans- form, all z - ~ , where k > N terms drop off, and (7) and (8) can be simplified to:

    H&) = ( - l ) k C ( k ) fi - cos (&)

    In an actual implementation this is achieved by resetting all registers every N clock-cycles. The signal flow graph (SFG) shown in Fig. 1 implements (9) or (IO), i.e., the forward and the inverse I-D DCT depending on the multiplier coefficients and the modifications to the SFG as indicated by the dashed lines. The multiplier coefficients are essentially derived from (9) and (lo). Depending on whether the Forward or the Inverse transform is being computed, M1, M2, and M 3 are defined as shown in Table I.

    The kernel shown in Fig. 1 computes a single DCT channel coefficient based on the multiplier coefficient encoded in that particular filter. N such parallel modules (each with the appropriate multiplier coefficient corresponding to k ) form


    - ----) M(O)

    c - M(2) ~

    Shift Second Register

    - s Dimension Array ----1


    A + -

    Fig. 1. IIR Structure for DCT/IDCT computation.


    (even) (even)

    M(N-2) =ll First Dimension



    11 Precision I Size I Delay ROM (one coeff.) 11 12 x 12 I 1217X x 1532X I 15 nS

    - - M(1) c - M(3)

    + ___ -

    Shift - Second Register

    (odd) (odd)

    - - s Dimension Array -1 -

    A - - + M(N-1) -

    a filter bank that computes the N coefficients of the 1-D transform. Every N cycles, the 1-D transform coefficients for a new data set are computed in parallel by the N filter bank modules. These 1-D transform coefficients are then fed through the circu