Download pdf - VLSI architectures for the discrete wavelet transform

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5. MAY 1995 305

VLSI Architectures for the Discrete Wavelet Transform

Mohan Vishwanath, Member, IEEE, Robert Michael Owens, and Mary Jane Irwin, Fellow, IEEE

Abstract- A class of VLSI architectures based on linear systolic arrays, for computing the 1-D Discrete Wavelet Transform (DWT), is presented. The various architectures of this class differ only in the design of their routing networks, which could be systolic, semisystolic, or RAM-based. These architectures compute the Recursive Pyramid Algorithm, which is a reformulation of Mallat’s pyramid algorithm for the DWT. The DWT is computed in real time (running DWT), using just N,(J-1) cells of storage, where N , is the length of the filter and J is the number of octaves. They are ideally suited for single-chip implementation due to their practical U 0 rate, small storage, and regularity. The N-point 1-D DWT is computed in 2N cycles. The period can be reduced to N cycles by using N, extra MAC’S. Our architectures are shown to be optimal in both computation time and in area. A utilization of 100% is achieved for the linear array. Extensions of our architecture for computing the M-band DWT are discussed. Also, two architectures for computing the 2-D DWT (separable case) are discussed. One of these architectures, based on a combination of systolic and parallel filters, computes the N2-point 2-D DWT, in real time, in N 2 + N cycles, using 2NNw cells of storage.

I. INTRODUCTION

N THE LAST few years there has been a great amount of I interest in wavelet transforms, especially after the discovery of the Discrete Wavelet Transform (DWT) by Mallat [8], [9]. The DWT [8], [9] can be viewed as a multiresolution decomposition of a signal. This means that it decomposes a signal into its components in different frequency bands (to be specific, in octave bands). The Inverse DWT (IDWT) does exactly the opposite, i.e., it reconstructs a signal from its octave band components. The applications of this transform (and its slight variants) are numerous, ranging from image and speech compression to solving partial differential equations [ 81, [2], [6], [20]. In this paper we study the feasibility of implementing the DWT (both 1-D and 2-D) in VLSI and we propose architectures, based on linear systolic arrays, for computing the DWT in VLSI. All the architectures (except 1) are based on the Recursive Pyramid Algorithm (RPA) [16]. The RPA is a reformulation of the pyramid algorithm discovered by Mallat

Manuscript received October 9, 1992; revised February 17, 1994. Prelim- inary versions of parts of this paper were presented at ASAP’92 and IEEE VLSI Signal Processing Workshop, 1992. This paper was recommended by Associate Editor K. Yao.

M. Vishwanath is with the Computer Science Lab, Xerox Palo Alto Research Center, Palo Alto, CA 94304 USA.

R. M. Owens and M. J. Irwin are with the Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802 USA.

IEEE Log Number 94 10949

[8] and is highly amenable to VLSI implementations. We show that there is a strong link between the RPA and linear systolic arrays. These architectures can be extended to handle most other QMF filter bank trees [19], [18]. The area and time complexities of the architectures describe in this paper are shown to be optimal. These architectures are highly flexible and can be easily scaled to handle filters of any size (and they are independent of the input size). These architectures can also be used for computing the M-band DWT. We also show that 100% utilization of the linear systolic array (filter) is always possible for the DWT.

In the 2-D case the dependence on the input size (the smaller of the two dimensions) cannot be eliminated because of the limited U 0 rate and the row scan (raster scan) or column scan input format. In this paper we only consider the separable case of the 2-D DWT. The architectures for the 2-D DWT rely, to a large extent, on the 1-D architecture.

Previous related work, definitions and the complexity results are presented next. The RPA is briefly introduced in Section 111. The 1-D architectures are presented in Section IV, while the architectures for the 2-D-DWT are presented in Section V. Performance figures and comparisons (with each other and with architectures described in [ l ] and [ 5 ] ) are presented in Section VI.

11. PRELIMINARIES

Very little work has been done in mapping the DWT into VLSI. The first architecture for computing the DWT was designed by Knowles [5] . This architecture was not well suited for VLSI since it used large multiplexors for routing the intermediate results. Later, Lewis and Knowles [7] designed an architecture for computing the 2-D DWT. A major drawback of this architecture is that it is heavily dependent on the properties of a specific wavelet, namely, the Daubechies 4- tap wavelet. In fact it needs no multipliers when used with the Daubechies 4-tap wavelet, but it is not an architecture which would work efficiently with any other wavelet. Aware Inc., has come out with a chip called the Wavelet Transform Processor (WTP) [ 11. It essentially consists of a 4-tap filter (in this case, 4 multiply-accumulate cells) and some external memory and control and has no special features that take advantage of the structure of the DWT. It relies heavily on the software for computing the DWT. Recently, Parhi and Nishitani have proposed folded architectures and digit-serial architectures for the 1-D DWT [l l ] . These architectures do not easily scale with the filter size and the number of octaves computed.

1057-7130/95$04.00 0 1995 IEEE

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5 , MAY 1995

w-" - '" I l h

Fig. 1. The DWT filter bank.

A. The Wavelet Transfom The Wavelet Transform (WT) of a signal 2( t ) is given by

t - u ,&(U, s ) = ?(t>L(-)dt

s s

where h(t) is the wavelet function. The Wavelet Transform of a sequence ~ ( i ) (sampled version of the continuous signal 2( t ) ) , discretized on a grid whose samples are arbitrarily spaced both in time (b) and scale (a) [12], is given by

W ( b , a ) = - i=b

I a

where N is the number of input samples, N , is the size of the suppop of the basic wavelet h, and h is obtained by sampling h(t) . Also, a is of the form a = CUT, a0 > 1, c is a constant, and m is an integer. The number of distinct m considered is J, in other words, J is the total number of scales. At each scale k, a = cat , and the number of samples in the time dimension is Bk, where Bk 5 N . Thus the properties of the wavelet transform are heavily dependent on the properties of the basic wavelet. All the architectures that we have developed in this paper are independent of the wavelet function and are hence flexible, In general there are two special cases of the WT, the Discrete Wavelet Transform (DWT) and the Continuous Wavelet Transform (CWT). In this paper we have only considered the former.

DWT: The DWT can be viewed as the multiresolution decomposition of a sequence [8]. It takes a length N sequence, ~ ( n ) , and generates an output sequence of length N . The output is the multiresolution representation of ~ ( n ) . It has N / 2 values at the highest resolution, N / 4 values at the next resolution, and so on. Let N = 2 p and let the number of frequencies or resolutions, be J. (Since we are only considering octaves, J 5 P.) The structure of the DWT is due to the dyadic nature of its time-scale grid; the points on the grid that we are concerned with are such that BI, = fi, a0 = 2, and a = 2 a , i , k E 0,1,. . . , J - 1. The DWT filter bank structure is shown in Fig. 1.

CWT: In its most general form, the CWT is defined by equation 1. In other words, the CWT is defined as a WT with no decimation at any scale and at any desired frequency resolution. Thus Bk = N at all the J scales. The CWT takes a length N input sequence, and produces a length N output sequence at each scale. The most commonly used version of the CWT is one [ 121 where the frequency spacing (resolution) is logarithmic (octaves) as in the case of DWT. Thus a length N input sequence produces a length N output sequence at each of the J scales. where J < logN.

0

B. Lower bounds

In this section we present lower bounds for computing the Wavelet Transforms. The bounds have been derived in [ 151 and [17]. These bounds are for single chip implementations and are derived under the following practical spatial restrictions on the U 0 protocol [3].

1) Unilocal: Each input/output bit is available at only one pad.

2) Place-determinate: U 0 data are available at prespecified (instance-independent) places.

3) Word-local: For any cut 1 partitioning the chip, only a constant number of input (output) words have some bit entering (exiting) the chip on each side of 1. That is except for maybe a small number of inputs (outputs) all the lc bits of the inputs (outputs) enter (exit) the chip on either the left or the right side of the partition.

The results outlined below are for the 1-D case. These results can be extended directly to the 2-D case. All these bounds hold under the assumption that J 5 llog2 ($!-)I + 1. Bounds have been derived for the case when J does not satisfy this condition [15]. Let a0 = 2 and N = 2 p . Then the area=A and time=T satisfy the following lower bounds.

For 1-D DWT, AT2 2 (J2N:k2) . For 1-D DWT, under the word-serial \footnote(The word- serial model is one in which at any time instant, at most one input (output) word has some, but not all, of its bits already read (written).) model, A 2 (JN,k) and T 2 N . For 1-D CWT, AT2 2 (N2N:lc2). For 1-D CWT under the word-serial model, A 2

Note that under the word-serial model, the pipeline period for an N-input, N-output function is lower bounded by N . Also, under this model, the computation delay is lower bounded by the pipeline period. Lower bounds for the 2-D case can be obtained [15] by simply replacing N by N 2 and N, by N: in the bounds for the 1-D case (this is in contrast to the 1-D and 2-D DFT case, where a different technique had to be used to obtain the bounds for the 2-D Dm). However, for the 2-D case, the word-serial model does not place any restriction on the order of the inputs (as long as they are input in a word-serial manner). This means that the inputs could be available in a predetermined but arbitrary order. This is not the case in practical imaging systems where the digital image, say, is available in a raster scan format. Lower bounds with this additional constraint (that is, in addition to the word-serial constraint) have not been derived, but it is conjectured that for 2-D DWT, the area is bounded by A 2 (NN,k) , and the time is bounded by T 2 N 2 . The systolic array based architectures for 1-D and 2-D DWT (described in the next few sections) satisfy the word-serial model, and are optimal, both in terms of area and time.

(NN,k) and T 2 N J .

111. THE RECURSIVE PYRAMID ALGORITHM

Consider an N point DWT, N = 2p. We are interested in J octaves. J < log N . Firstly, we present the classical pyramid

VISHWANATH et al.: VLSI ARCHITECTURES FOR THE DISCRETE WAVELET TRANSFORM 307

- 1 0% -w

+ + + + + + + +

+ + + +

+ +

L

Tlmc ---)

Fig. 2. Dyadic sampling grid for the DWT.

algorithm for the DWT [8],

beginIDirect Pyramid} input : x [ 1. .NI for(j=l to J)

Do the stage j filtering using output of stage (j-1) as input

end{Direct Pyramid}

A direct implementation of the above algorithm either requires J filters (too expensive) operating in a pipelined fashion or it requires one filter and O ( N ) memory, which are again too expensive for a I-D running DWT.

The Recursive Pyramid Algorithm (RPA) [I61 is a reformulation of the classical pyramid algorithm for the DWT, which allows computation of the DWT in a real time (running) fashion using just N,( J - 1) cells of storage. It consists of re-arranging the order of the N outputs such that an output is scheduled at the ‘earliest’ instance that it can be scheduled. The ‘earliest’ instance is decided based on a strict precedence relation, i.e., if the ‘earliest’ instance of the ith octave clashes with that of the (i + 1)th octave, then the ith octave output is scheduled. A simple way of obtaining this output schedule is to consider the sampling grid for the DWT output, shown in Fig. 2. Now push (up or down) all the horizontal lines of samples until they form a single line. The order of the outputs obtained in this manner gives us the output schedule. The pseudo-code for the RPA is shown in Fig. 3. The output schedule generated by the RPA for N = 16 and J = 4 is shown below, here yi(n) is the lowpass output of the ith octave,

Y I ( ~ ) , y2(1), ~ ( 2 1 , y d l ) , YIP), y2(2), ~ 1 ( 4 ) , 314(1),

y1(5), &), n(61, ~ 0 1 , YIP), ~ 2 ( 4 ) , m(8) . . . (2) The highpass output schedule is exactly the same.

The following are the main advantages of this algorithm. Since each output of the j th octave is scheduled at the ‘earliest’ instance, only the latest N , outputs of the ( j - 1)th octave need to be stored. Thus a total of at most N,( J - 1) words of storage is required. Note that it is N , (J - 1) and not N , J since the last octave output need not be stored. Due to its structure it is highly amenable to systolic and pipelined approaches.

beginIRecursive Pyramid}

input: X(i)=x(i), i:Cl,NI I* N is a poser of 2. */ f * For ease of notation the the first N r f

f * valuea of X are assrued same as I. t f

forci-1 to N-1) /I Once for each output * f r d d (i, 1)

endCRecursive Pyramid)

rdut(i,j) beginkdvt)

if (i is odd) L-(i+1)/2

ft Compute output number k of octave j. tf

I* last L outputs of octave (j-1). rf I* octave j is y(i+(N-(NfZ**(j-l)))), where 2**(j) is 2 to the poser * f

* f

This is computed using the note that the iIth1 ofp of

ft j . The ofps of the low-pass filter are the X’s.

s d - 0 SW-0 for(m=k dorm to (k-L+l))

This is computed using the last L outputs of octave (j-1). *I

sumL=smL+X(m+(l-(W 2**( j -2) 1) )*vCk-ml; sunB=sunH+X(m+(ll-(Nf 2**( j -2)))) *hCk-d ;

X(k+(N-(N/Z*r( j-1)) ))=ad; /I LOV pass output */ I8 Righ pass output *I y(k+(N-(NfZ**( j-1)) ))=sumH;

else It Recursion to determine correct octave *I rdst(if2. j+l)

endCrdd)

Fig. 3. The recursive pyramid algorithm.

It can be combined with the short-length FIR filtering algorithms to obtain a reduction in the number of operations [ 161 (multiplications and additions). This is comparable to that obtained by computing the classical pyramid algorithm using the short-length FIR filtering algorithms [lo]. Since the filters used in the DWT are generally short, this technique needs lesser operations than the FIT approach [ 121. It produces the output in an order which is ideal for many applications, like subband coding and transmultiplexers.

In the next section we show how the parallelism inherent in this algorithm is exploited.

Iv . ARCHITECTURES FOR COMPUTING THE DWT

In this section we present three architectures for computing the DWT. All three are based upon linear systolic arrays and implement the RPA but use different ways (systolic, semisystolic, and memory based) of routing the N,(J - 1) stored words.

As shown in Fig. 1 and as explained in the second section, the DWT can be implemented as a sequence of convolutions performed on exponentially decreasing (size) input sets. In fact there are at most logN stages, i.e., octaves, for an N - point input sequence. The input to each stage comes from the previous stage, except the first stage which is fed the input sequence. Of the N outputs, N - 1 terms are output from the high pass filters, while the remaining 1 term is the output of the lowermost lowpass filter (see Fig. 1). But since the bottleneck is the lower leg, i.e., the lowpass filter branch, we mainly consider the lowpass filtering branch. The recurrence relations, in space-time, for the DWT (lower branch) are shown below. Here z(n) is the input sequence, while yj(n) is the output of the j th octave and w m is the lowpass filter.

308 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5, MAY 1995

Input Equations:

Wj,n,o G 0,

Computation Equation:

1 5 j 5 log N , 1 5 n 5 N/2j

W O , n , N = x ( n ) , 1 5 n 5 N .

w j , n , m - W j , n , m - l + W j - l , a n - m + l , N . wm l L j S l o g N , l < n < N / 2 j , 1 5 m < N w .

Output Equations: y l ( n ) = W l , n , N 1 5 5 N/2 y 2 ( n ) = W 2 , n , N 1 5 5 N/4

(3) (4)

Fig. 4. Overall architecture.

ylog N (n) = W I o g N , n , N n = 1. The general approach we take is to map the first

dimension, j , of the linear recurrences described above, onto time. Now the remaining part is just a convolution operation followed by a decimation. This is computed in the standard manner on a linear systolic array, i.e., the second dimension, namely, n, is mapped onto time and the third dimension, namely, m, is mapped onto the cells of the array. The decimation by 2 is easily handled by inserting zeros in the appropriate slots of the output schedule. We now present three architectures for computing the 1-D DWT using the RPA. These architectures have identical features and differ only in the implementation of their routing network. These architectures can handle most QMF filter bank trees, like the Laplacian Pyramid, and Subband Coding.

A. Systolic Architecture

We now describe an efficient (both area and time) and practical systolic architecture for computing the DWT. Our architecture can be used to compute the DWT in a near optimal manner (Area = O(N,kJ) and Delayperiod = 2N) Note that achieving this delay/period is trivial if we are allowed Area = O ( N k ) , but it is impractical to allow the size of the design to scale linearly with the size of the quasiinfinite input sequence, especially since this architecture will also be used as a module in the 2-D architecture. Also note that N , << N and J 5 log(N) and for most practical applications J is between 3 and 6. The architecture is easily scalable and is particularly well suited for subband coding(using the DWT). It can easily handle M-band extensions of the DWT and can achieve 100% utilization of the linear systolic array filter. It can be easily modified to compute the Laplacian Pyramid.

Our architecture basically implements the RPA and is designed to meet the following goals:

1) Systolic; 2) VO rate bounded as O ( k ) , where k is the precision; 3) Period and delay of O ( N ) , where N is the input length; 4) Area that is independent of the input length; 5) Scalable with filter length and number of octaves.

Given the second goal, the lower bound on the period and delay is obviously O ( N ) , i.e., the third goal hits the lower bound. The overall architecture is shown in Fig. 4. The DWT is computed as follows on our architecture. The lowpass filtering (lower leg) in Fig. 1, i.e., the sequence of filtering by

w(n) and decimating, is computed in F 2 , while the highpass filtering is done in F l . These filters are built using a linear systolic array. The routing network is systolic and basically consists of a mesh of registers. It has J - 1 rows and N , columns, where J is the number of octaves (5 log(N)) and N , is the size of the filter. The first and last columns of the routing network consist of special cells which do the actual routing function. The output port is just a simplified version of the first column of the routing network and during the analysis phase it produces an ith octave output every 4 x 2i-1 clock cycles. Assuming minimum area for the multiply-accumulate cells of the filter (each cell needing an area of O ( k ) [13]), the total area is bounded as O(N,kJ) (since there are N,( J - 1) cells in the routing network and each has a capacity of IC- bits). Also, the architecture can be easily scaled with the filter size by stripping off the R,-cell column and cascading to the right (and putting the R,.-cell column back as the rightmost column). It can also be scaled with the number of octaves by cascading the routing network vertically.

If one looks carefully at (3) then one can see that the output schedule of the RPA essentially consists of interspersing the computation of the various octaves in between the first octave computations. The basic idea behind our architecture is to in- terleave the computation of the different octaves by computing the largest convolution (first octave) in the normal manner on the linear systolic array and interspersing the computation of the remaining octaves in between the first octave computations (thus it implements the RPA). Only the first octave is computed in the conventional systolic manner on the linear array, all the other octaves are computed in an unconventional manner. Given the practical VO rate constraint, it produces the outputs in optimal time. The interspersing of the computations of the various octaves is managed by the routing network. Our algorithm is general in the sense that it can be mapped to a number of other architectures depending on the needs. For example, in the next sections, we show how the systolic routing network can be replaced by a semisystolic one and also how it can be replaced by address generators and a RAM. The detailed working of the architecture is presented next.

We know that the RPA schedules an output as soon as it can, but we have to show that this schedule can be implemented correctly on a linear array. This can be proved as follows; consider any yj(k), let us assume that at some point in time

- . . . I -

VISHWANATH er 01.: VLSI ARCHITECTURES FOR THE DISCRETE WAVELET TRANSFORM 309

...... x(12) 0 x(11) 0 x(10) 0 x(9) 0 x(8) .""".".

...." y(4) y"(1) 0 0 Y(5) y'(3) 0 0 y(6) y"(2) 0 0 .._.....I - 0 Y(4) 0

0 0

From the Routlng Network

>

-- J(4)

. . - o

0

0

0

Y" 0

0

0

y"

0

0 j l 0

L

From the Routing Network

I... x(12) 0 x(11) 0 x(10) 0 x(9) I..._

Fig. 5. Snapshots of the lowpass branch.

gj( lc) is at the ith cell of the array. Thus at the input side of this cell yjv1(2IC - 2) should be available at this point in time. If the output schedule is formed as described above, then clearly yj_1(2IC - 2) has been scheduled at least 22 + 1 places before yj (IC), i.e., it is output at least 42 + 1 clock cycles before yj(IC)(because, as we show below, not more than 2 outputs are produced every four clock cycles). Thus it is always possible to make this available at the ith cell.

Snapshots of the lowpass filter are shown in Fig. 5 (it is assumed to be FIR of length 4). Each cell of the filter is a MAC (Multiply-Accumulate) cell with slight modifications as explained below. During the odd clock cycles each cell of the filter takes an input from the stream of d s , while during the

even cycles it takes input from the routing network. Thus, while on one hand the filter is computing the first octave (only the first octave depends directly on s(n)), on the other hand it i s computing the other octaves interspersed between the first octave computation, where the interspersing is as per the output schedule of the RPA. The inputs to the lowpass and the highpass filters are the same. Therefore the whole DWT is computed in 2 N steps (since the first octave takes exactly 4 x $ = 2N steps). It is clear that the routing network is the heart of the above strategy and we elaborate on its design below.

I) The routing network: The job of the routing network is to provide the filter cells the right values during the even cycles (for computing all the octaves other than the first one). These

310 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I1 ANALOG AND DIGITAL SIGNAL PROCESSING. VOL. 42, NO. 5, MAY 1995

To Film

. .

. .

. .

. .

NI columns

Fig. 6. The routing network.

.

octaves are computed in an unconventional manner on the systolic array. When an output variable enters the filter (at the right comer), the input it needs at that cell, say t , is provided by the routing network. At the next cycle it (the output variable) moves left and the input it needs at that cell, say U , is again provided by the routing network and so on until it reaches the leftmost cell and is output to either the output port (highpass filter) or to the routing network (lowpass filter). Unlike in regular systolic convolution, the inputs t and u do not move along with the input stream in the direction opposite to the output stream (i.e., to the right). This has to be done because if we try to schedule all the inputs in the conventional manner then the number of cycles needed to compute J octaves (with N inputs) becomes proportional to N J (it is easy to see this if one considers a filter of length >4). In this case, the storage needs also increase to O ( N k ) .

We first consider a systolic version of the network. As shown in Fig. 6, it consists of a N,(J - 1) mesh of registers. Each of the J - 1 rows of registers is associated with an octave. The clocking of these registers is controlled by the R,-cells and they are loaded (just the two leftmost registers) by the Rl-cells. The loading and the shifting occur out of phase, with the loading being the first phase. Since there are 2 outputs for every 4 clocks of the filter, the Rl- cells are clocked twice for every 4 clocks. The R,-cells are in lock-step with the rightmost corner of the filter and are clocked once for every 4 clocks of the filter. This is because there is a nonfirst octave output scheduled only once every 4 clocks. The inputloutput relations of the R,, Rl, and the Sij (storage) cells are shown in Fig. 7. One can view the &-cells as routing the inputs to the network and the R,-cells as scheduling the outputs from the network. Each one of the columns of registers is clocked once every 4 clock cycles. The ith column of the network gets cllcup one cycle before the (i - 1)th column (this includes the R,- cell column and excludes the RI-cell column). This is required for skewing the inputs to the filter (see Fig. 5). Each S;j-cell contains 2 registers, one register holds the current jth value of the ith octave, while the other register holds the upward shift value. When clkup is true, an S-cell will release the contents of the former register to the upward stream only if sup (shift up) is also true and it does this by copying the contents of the first register onto the second, i.e., it shifts up and also retains the value. sup is propagated from right to left in a row, note

usdata(t)

Fig. 7. thin lines are 1-b wide.

Cells of the routing network. The thick lines are k-b wide, while the

that this is consistent with the lag of clkup mentioned earlier. The signal sri, which controls the right shift of a cell, follows two cycles behind sup and is true for 2 cycles, i.e., a cell shifts its contents up and then right twice. The rightward shifting happens in an incremental manner, i.e., sri is propagated from right to left two cycles behind sup. This right shifting by two is done because an ith octave output is produced for every two (i - 1)th octave outputs. Therefore, if the filter length is N , then after an ith octave output is produced, the first N , - 2 values of the (i - 1)th octave become its last N , - 2 values. Hence the right shift by two.

An ith octave output is scheduled once every 4 ~ 2 ~ - ~ cycles, therefore the initiation of sup for the (i - 1)th row is done once every 2i-1 clocks by the R, cell which is connected to it ( remember that one clock of the R, cell is equivalent to 4 clocks of the filter). This is controlled by a token of length (atmost) log log N that it contains. Each &-cells also contains a similar token which it uses to decide if the current output belongs to its particular octave. The Rl-cells have to route the output yi(k) of the filter to the ith row. This is easier than it seems, for example, a first octave output is produced every 4 clock cycles, a second octave output is produced every 8 clock cycles etc., thus the Rl- cells just shift the tokens right every clock cycle, and the Rl-cell with a zero will be the destination row. As shown in Fig. 6, each R1- cell is connected to two cells of its row. It alternately loads these cells , with the second cell being loaded first. As explained earlier, these will be the two latest values of that octave, the other N , - 2 being the shifted values. An &-cell (or Rl-cell) consists of a log log( N)-bit comparator (or decrementer), two log log( N ) - bit registers and about 10 gates for glue logic. For all practical purposes log log(N) 54.

A point worth noting is that the leftmost cell of the second and third rows of the network are slightly different in that, as soon as they are loaded they store this value and shift it up, too. This is needed since the number of clock cycles it takes a value to propagate up from the ith row to the filter is 42. And, according to our output schedule, the minimum number of clock cycles between the production of an ith octave output and the production of an (i+l)th octave output, which depends

VISHWANATH et al.: VLSI ARCHITECTURES FOR THE DISCRETE WAVELET TRANSFORM

~

311

To PPI

I

N C o M

Fig. 8. Semisystolic routing network.

on the former, is 4 x 2i-2. Thus, for proper operation of the network we require that 42 5 (4 x 2i-2). This is obviously not true for i = 1,2,3. But i = 1 is a special case, since the sup signal reaches the leftmost cell of the first row just in time.

B. Semisystolic Architecture We now consider an alternative to the routing network.

Instead of having the network systolic in both the horizontal and the vertical directions, we allow global connections in the vertical directions. This increases the wiring complexity but simplifies the design of the various cells. The semisystolic routing network is shown in Fig. 8. The global connections in the vertical direction eliminates the need for the signal clkup and also the need for two registers per cell. When sup reaches a cell, it just enables its vertical output, thus placing its contents on the vertical line. Note that it also needs to retain this value. Also note that at any point in time only one of the output cells of a column has its vertical output enabled.

This architecture has the same properties as the systolic one (it even retains the scalability property). This design is not semisystolic in the conventional sense and seems to be the most practical of the three designs.

C. Memory Based Architecture Another possible design is to replace the network with

address generators and a N,(J - 1) x k-bit RAM. The interaction between the routing network and the linear array is such that at most values have to be fed to the linear array by the routing network during any given clock cycle. This is because there is exactly 1 nonfirst octave output scheduled for every 4 outputs. During any cycle, at most 1 output is stored into the routing network. Thus a RAM that replaces the routing network will need read ports and 1 write port. Since typically N , 5 32, the number of read ports is 58. Also typically J 5 6, thus N,( J - 1) 5 160. Therefore at most 8 address lines are required. No contention logic is required since there is only one write port and it operates out of phase with the read ports. The architecture is shown in Fig. 9.

We now describe how the address generation is done. The incrementer S has a J 5 log N bit register, which is initialized to zero and incremented once every 4 cycles (starting from the second cycle). Let the bits of this register be numbered J , J - 1, . . . ,1 , where J is the most significant bit. Since the

. .

Fig. 9. Memory based architectures.

I++@- Fig. 10. M-band DWT.

outputs are scheduled according to the WA, the position of the first (least significant) 1 in the register, say i, will indicate that an (i + 1)th octave output has been scheduled. Thus the ith octave needs to be input. The position of the first 1 can be found in log J bit steps (mvial compared to the multiplication time) using exactly J, 1-b comparators. This position is then fed to an address decoder, which uses it as the row address. The column address always starts at 1 and goes up to N,. When an address decoder is assigned a row address it reads out the whole row ( N , words) over the next N , cycles. Thus each address decoder consists of a log N, bit incrementer. The rest of the circuit is clear from the diagram. The major disadvantages of this design are its nonscalability and the wiring complexity due to the multiplexors and the number of read ports.

D. M-Band Extensions, Utilization, and Period Till this point we have been considering 2-band integer

band implementations of the DWT. It has been shown that the design of the wavelet filter (QMF) is generally made easier by using an M-band extension [14] of the DWT. More compact signal representations have also been quoted as a reason for considering the M-band DWT. The M-band DWT is shown in Fig. 10. A study of the feasibility of implementing this on a linear array leads us to a general form for the utilization of the

312 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGlTAL SIGNAL PROCESSING, VOL. 42, NO. 5. MAY 1995

array. Note that the equivalent of octave in the M-band case is a stage and band refers to one of the M filters at each stage.

The number of outputs per band at the stage (or ‘octave’) i of the M-band DWT is given by rzi = p, where no = N = M P . The total number of outputs per band over all the stages is given by

N N - 1 R = -=-

M a M - 1 ‘ i=l

Note that since N = M P , will always be a positive integer for P 2 1 and M 2 2 . Due to the decimation by M , every output has N , - M inputs common with the previous output of the same stage (and same band). Thus if M 2 N , there is no overlap.

In our implementation of the RPA on a linear array, we compute the first stage in a systolic manner and this determines the number of cycles needed to compute all the stages in that band. The number of outputs for a band in the first stage is g. In a systolic output schedule, with decimation by M , the gap between two outputs is 2M. Thus the total number of cycles needed to compute a band in the first stage is

T = 2M x (g - 1) + 2 = 2 ( N - M + 1).

Since the outputs of the other octaves (stages) are interspersed between the computation of the first octave, the gap between any 2 outputs of the zth octave is 2Mi. Thus the R1 and R, cells will have to be modified to take this into account.

In our architectures (so far), each band (lowpasshighpass in the 2 band case) is computed on a separate linear array, over all the stages. Thus the utilization of each linear array is given by

R 1 N-+m T 2 ( M - 1)’

Utilization = lim - =

Thus the utilization of the linear array is 50% for architectures described in the previous sections.

The utilization can be improved very easily. At all the stages of an M-band decomposition, there are M filters. Also we know that in a systolic output schedule, with decimation by M , the gap between two outputs of the first stage is 2M. And in our implementation of the RPA, there is exactly one nonfirst stage output scheduled in the cycle immediately after the first stage output. These two facts combined with the fact that all the M filters at any stage share the same inputs implies that we can compute all the M bands on just one linear array. There is no penalty to pay with respect to the computation time. The complexity of each multiply-accumulate cell is increased by adding (M-1)+2 extra registers and an M-bit counter. The (M-1) extra registers are needed to hold the filter terms for the other M-1 filters. Two more registers are needed to hold the current inputs from the input stream and from the routing network. Now the utilization of the linear array is given by

M - - ( N - l)M Utilization= lim N-+m 2 N ( M - 1) 2(M - 1)’

I

u l r c r r e d r o

Fig. 11. The 2-D DWT pyramid.

Thus the utilization is always greater that 50%. For the most common case, i.e., M = 2, the utilization is 100%. Note that this implies that both the high pass filter and the lowpass filter in the 2-band case can be incorporated into the same linear array and hence achieving a 100% utilization. Previously, for the 2-band case, in each block of 4 outputs of the lowpass filter, the first one was a first octave output, while the second one was a nonfirst octave output and the other 2 outputs were zero. Similarly for the high pass filter. Under the combined filter the first 2 outputs of a block of 4 will be the lowpass outputs, while the other 2 outputs will be the corresponding highpass outputs.

A clear disadvantage of the architectures described so far is the input sampling rate (of 1 input every 2 cycles) and subsequently a period of 2N. This is clearly due to the use of the classical systolic convolution for computing the first octave outputs. A simple way to rectify this is to add another row to the routing network and feed the input directly into this row. This gives us a sampling rate of 1 input every cycle and thus a period of N cycles. The problem with this approach it that now we need another array (size N,) of MAC’s for computing the high-pass outputs. Thus by adding N , MAC’s and N , registers to the 1-D architecture, we can get a period of N and 100% utilization.

v. ARCHITECTURES FOR THE 2-D DWT In this section we present two architectures for computing

the 2-D DWT. One of these architectures is strongly linked to the 1-D architectures discussed earlier. We also show that a blocked implementation is impractical. We begin with the definition of the DWT in two dimensions. We only consider the separable case in this paper.

Mallat showed in [8] and [9] that if @(ZI,ZZ) = ~(zI)~’(zz) and if ~(zI), d’(~2) are scaling functions in one dimension, then Q ( z ~ , z ~ ) is a scaling function in two dimensions. Under these conditions the 2-D DWT is computed as shown in Fig. 11, where the filters h(n) and w(n) are same as the 1-D filters (shown in Fig. 1) and are constructed from the scaling functions and the corresponding wavelets. This is the pyramid algorithm in two dimensions. From a direct algorithm point of view the separable 2-D DWT can be

VISHWANATH er al.: VLSI ARCHITECTURES FOR THE DISCRETE WAVELET TR ANSFORM 313

Input

1 Fig. 12. The direct implementation of the 2-D DWT.

expressed succinctly as follows:

where M is the 1-D “DWT matrix,” X is the N x N input matrix, and Y is the N x N output matrix. Since the DWT is a linear transform, it can always be written in the form of a matrix-vector multiplication. The “DWT matrix” corresponds to the matrix, the application of which, to a vector, produces the DWT of the vector. Thus the 2-D DWT consists of computing the 1-D DWT of each of the N columns of X and then computing the 1-D DWT of each of the N resulting rows. Each one of the I-D DWT can be computed using the pyramid algorithm in one dimension. This leads us to the first architecture. In this section a 1-D DWT module refers to the combined module presented in Section IV-D, i.e., with the highpass and lowpass filters combined onto one linear array.

A. Direct Approach

A straightforward implementation of the 2-D DWT which uses N2 memory cells is shown in Fig. 12. It essentially consists of a 1-D DWT module which is used repeatedly, in the manner prescribed in (5) , to compute the 2-D DWT. The address generators are needed to do an in-memory transpose which is clearly a part of the computation in (5). Consider using the 1-D DWT architecture described in the previous sections as the I-D DWT module in Fig. 12. The number of clock cycles needed to compute the N column DWT’s is 2 N x N = 2N2, similarly, 2N2 cycles are required to compute the N row DWT’s. Thus the 2-D DWT is computed in 4N2 clock cycles. This is clearly optimal (within a constant factor, 4) under the limited U0 model, i.e., U 0 rate of O(k), where IC is the precision. The input to the circuit can be presented either in column-major or in row-major order. The area requirements are obviously dominated by the N2k2 bits of storage.

Clearly the strength of this architecture lies in its simplicity. It relies only on the 1-D DWT module for its efficiency. But the fact that it needs N2k2 cells of storage implies that a single chip implementation will be a very difficult task for N 2 256 and N, 2 6. Another disadvantage is that there is a latency of 2N2 cycles before the first output is produced. This cannot be tolerated by many applications.

B. Block Filtering Approach A well known technique to reduce the amount of first level

(fastest) memory required is block filtering. We now show that 2-D block filtering schemes [4], which are very useful for

filtering stored 2-D data, are not very useful for the 2-D DWT (even when the input data is stored).

Consider the 2-D DWT as shown in Fig. 11. Since each one of the stages does a 2-D filtering operation, we can try using a block filtering technique in 2-D [4]. This is essentially an extension of 1-D block filtering techniques. This would allow us to compute the 2-D filtering operation using only O(B2k2) bits of storage, where N >> B 2 N, and B2 is the block size. This is a major savings in storage required. But there are two major problems with this approach, firstly, it requires the input to be presented in a nonuniform manner, i.e., not in row-major or column-major order, but in some overlapped- block order. This is a big disadvantage since most images (the most common 2-D data) are available in row-scan or column- scan form. But, there could be situations, for example, while dealing with images already stored in memory (or on disk), when an arbitrary input pattern is possible. Hence this seems like a good alternative for the 2-D DWT. But, as we explain below, this is not true.

The savings considered in the previous paragraph are for 2-D filtering. Each stage of the 2-D DWT involves a 2-D filtering operation. Hence, to use the block filtering approach, each stage needs to access the outputs of the previous stage in arbitrary order. The input, which is stored in memory or on disk, is used only by the first stage. The subsequent stages use the output generated by the previous stage as input, thus we need to store these intermediate outputs. Keeping in mind that the filter size is N, and that the rate and number of inputs (and outputs) at each stage is ( i ) th of the previous stage , it is clear that the amount of storage needed (over and above that used to store the input) is O(NN,k)-bits. This dependence on N along with the nonuniform access requirements make block filtering schemes unattractive for the 2-D DWT. This is especially true because the architecture developed in the next section uses the same amount of storage, has a latency of just 1 cycle before the first output is produced and takes its inputs in row-major or column-major form.

C. Systolic-Parallel Architecture This architecture uses a systolic filter (linear array) to do the

DWT in one dimension while the other dimension is handled by a parallel filter, i.e., one that needs all the N, inputs at the same time. It uses the same concept as the 1-D architectures, namely, interspersing the computation of the various octaves with the computation of the first octave. Note that the term octave is not correct, since the decimation is by 4 at each stage. But we use it loosely throughout this section. The architecture is shown in Fig. 13.

The systolic 1-D DWT modules are similar to the 1-D DWT architectures described in the previous sections. The only difference is that the routing network gets its inputs from the parallel filter, rather than from its associated systolic filter. This is in accordance with the way the 2-D pyramid algorithm works, as shown in Fig. 1 1. Namely, the input to the ith octave row (column) filter comes from the column (row) filter of the (2- 1)th octave. The parallel filter could be just N, multipliers

314 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5, MAY 1995

Systolic/Semi-SysIolic 1D DWT Modules column filters can now operate on these 2N columns, each of height N,, to produce 4 rows, corresponding to the 4 outputs of B and C. The row corresponding to the output of w(n) of C is fed as input to one of the row filter (to be precise, to the routing network of that row filter). Hence, one of the row filters (S2) is only producing first octave outputs, while the other filter is producing all the octaves. This is due to the decimation by two along rows.

I

The operation of the architecture is as follows. 1) Start filtering (decimated by 2) of the ith and (z + 1)th

rows on S1 and S2, respectively. The inputs for the S1 routing network will come from PI.

2) The outputs of these two filters (Sl and S2) are fed into

output3

Parallel 1D DWT Modules

Fig. 13. The systolic-parallel architecture

followed by a log N,-level adder tree. We make the following observations about the operation of the 2-D pyramid algorithm.

The number of inputs to the first octave row filter is N x N, i.e., N rows, each with N elements. The number of outputs produced by this filter is N x $, i.e., N rows with $ elements each. The number of inputs to the second octave row filter is rows with $ elements each. The number of outputs is rows with $ elements per row. Thus the sum of the number of elements produced by the row filters of all the octaves per row of the input (X), is N . The column filters of each octave have similar charac- teristics as the row filters, except that they take columns as inputs. This implies that in order for a stage of the pyramid to operate in a pipelined fashion, both within the stage and with its predecessor and successor stages, the row filters must produce their outputs in column form and the column filters must produce their outputs in row form. This can only be achieved by using some type of blocking/data converters.

We now consider the mapping between Fig. 11 and Fig. 13, i.e., the mapping of the 2-D DWT pyramid onto our architecture. All the blocks marked A in Fig. 1 1 are computed on the systolic DWT modules using the RPA. Two modules are used in Fig. 13 to facilitate the decimation by two of stages B and C. All the B and C blocks are computed on the parallel filters. In reality the schedule is dictated by S1 and S2 (and hence the 1-D WA), and thus there is a large amount of flexibility involved in the implementation of PI and P2. For every block of 4 outputs produced by both S1 and S2, the first 2 constitute the output of ~ ( n ) of A while the next two constitute the output of h(n) of A. For each one of these outputs PI and P2 compute 2 outputs, corresponding to w(n) and h(n) of B or C. Note that the decimation by 2 for B and C is taken care of at the end of every 2N cycles by shifting the holding and block cells (this explained below). The blocking used by our architecture to achieve maximum pipelining is as follows; at any point in time the N, latest rows of outputs (all the octaves) from the row filter are stored. This requires 2NN, cells of storage (as explained earlier the sum of the number of outputs over all octaves per row of the input is N and we are combining both the highpass and lowpass outputs). The two

k

the holding cells. The holding cells shift their contents into the block cells once every 2N clock cycles, i.e., once every 2N clock cycles, the holding and block cells shift right by two. Since these cells are fed by S1 and S2, the values in them are stored in the same order as the output schedule described for the 1-D DWT (MA).

3) The filters PI and Pz compute 4 rows, over 2N cycles, one row (from the set of 4) is to be used by the routing network of SI. The other 3 rows are the outputs of this octave. They produce the outputs in the same order as the row filters and this is exactly the order required by the routing network (and subsequently by the row filters).

SI and Sa operate in a lock-step manner. Due to the decimation by two in B and C, only S1 needs to compute all the (row) octaves while S2 only needs to compute the first octave. We use two of them to accommodate the decimation by two in the column filters. The holding cells are simple latches and during any clock cycle only one cell in a column has its input enabled. This means that T and U have to drive only two inputs during any clock cycle; a latch input and a filter cell input. The order in which the inputs to the latches are enabled is simple. After the right shift by two (i.e., when it starts processing the next 2 rows of the input) the ith output of the filter (SI or S,) is stored in the ith cell. The cell numbering begins from the bottom, i.e., the bottom most cell is loaded first. This ordering can be achieved in a simple manner by passing a bit token upward. The block cells form a simple shift register network. Each cell is made up of two registers. During the right shift operation, which happens once every N cycles, both the registers of a cell are made equal. During regular operation one of the registers shifts downward while the other retains its contents. Similar to the semisystolic routing network, the use of 2 registers in each block cell can be eliminated by using global lines in the vertical direction.

We now deal with the boundary conditions, i.e., the computation of the first row. The first output row of the all the octaves depend only on the first row of the previous Octave. Thus at the beginning of the computation only one row should fed to the circuit, namely the first row of the input should be fed to S2. After 2N cycles, the second and the third rows are fed to SI and S2, respectively, and so on (as described above).

The 2-D DWT is computed by this architecture in N 2 + N cycles with a latency of 1 cycle and it requires area A = O(NN,Ic). More exact figures are explained in the next section.

VISHWANATH et al.: VLSI ARCHITECTURES FOR THE DISCRETE WAVELET TRANSFORM

Architecture New (all 3) Aware’s WTP Knowles’s Arch.

~

315

Area Delay Period

O(Nk) O(N log N ) O(N1og N ) O(N,Llog N ) 2N 2N

O(N,LlogN(logN, t IoglogN)) O(N) O(N)

TABLE I COMPARISON OF VARIOUS ARCHITECTURES. THE PERIOD AND DELAY OF THE

NEW ARCHITECTURES CAN BE MADE N BY USING N , EXTRA MAC’s

VI. PERFORMANCE AND COMPARISONS For this section we make a practical assumption that the

input and output rates are limited to one value per cycle (input or output), whose precision is k bits. We also assume that all the log, N octaves are being computed, i.e., J = log, N and b = 2 for the 1-D case and b = 4 for the 2-D case. A synchronous model of computation is assumed and thus a cycle is taken to be the time required for the slowest component of the circuit to compute its output. In this case a cycle is the time required to compute a k-bit multiplication. Period is defined as the minimum time between the first input of one computation and the first input of the next computation, i.e., it is a measure of the ‘pipelinability’ of the circuit. The delay is defined as the time between the first input and the last output. The comparisons and performance figures are presented together. The various architectures are compared both asymptotically (which takes wiring complexity into account) and also by component counts. We compare the 1-D architectures first.

All the three 1-D architectures considered by us have a delay and period of 2N. This is because the number of cycles to compute the first octave is 4 x ; and since the computation of all the other octaves is interspersed among the first octave computation. Consider the area required by each MAC cell, since each cell is essentially a k-bit multiply-accumulate unit, the minimal area is O ( k ) [131. There are N , MAC cells. But, the area required is bounded, asymptotically, by the area required by the routing network and is given by A = O(N,k log N ) (note that in practice the area requirements are dominated by the multiply-accumulate cells and not by the routing network). Thus we have AT2 = O(N2N,k log N ) , which is a optimal (see Section 11).

The architecture described in [SI, which we call Knowles’s Architecture, has a period and delay of O ( N ) , while it has an area requirement of A = 0 ( N , k log N ( log N , + log log N ) ) , therefore AT2 = O(N2N,k log N(1og N , + log log N ) ) . It is not valid to compare this with the lower bound derived by us, since Knowles’s Architecture is not systolic. The WTP released by Aware, Inc. has a period and delay of O ( N log N ) , while it requires an area of O(Nk)(for storing the intermediate octave values), thus AT2 = O ( N 3 k log2 N ) . This is a far from optimal. This comparison is made, because during the computation phase the WTP behaves in a systolic manner.

All the three architectures considered by us are easily cascadable to allow larger filter sizes. The WTP can be easily cascaded to allow larger filters, while it does not seem possible to cascade Knowles’s Architecture. Similarly the architecture described in [ 111 cannot be scaled easily since it uses register allocation techniques to minimize the number of storage elements for a given input size, filter size, and number octaves to be computed. These comparisons are shown in Table I.

TABLE I1 COMPONENT COUNTS OF VARIOUS ARCHITECTURES

[Architecture I MAC Cells 1 Multipliers I Adders [ Latches I Wiring 1

As shown in Section V-A, the direct implementation of the 2-D DWT has a period and delay of 4N2 and a latency of 2N2 cycles before the first output. The area required is dominated by the N 2 k 2 bits of storage required. For the systolic-parallel implementation the number of cycles required to compute an N 2 point 2-D DWT is given by T = (: x 2 N ) + (T x 2) = N 2 + N , where x 2 N cycles are required by the row filters to handle the N rows of the input (with the other octaves interspersed) and 7 x 2 cycles are needed to shift the holding cells and block cells to the right by 2, every 2N cycles. The latency is 1 cycle since the row filter (Sa) puts its first output on line T at the end of the first cycle and PI takes this and computes the first output at the end of the second cycle. Since there are 2NNwk bits of storage in the block cells, the area required is asymptotically bounded by A = O(NN,k).

The number of MAC cells, multipliers, adders, and latches required by the various architectures is shown in Table 11. The multipliers and adders columns contain the number of multipliers and adders excluding the ones in the MAC cells. We assume a brute-force implementation of the parallel filter for the systolic-parallel architecture. The column on wiring complexity is based only on the number of nonlocal wires required by that circuit. It does not take into account the loading on these long wires. For example, the semisystolic version of the routing network has long wires, but the loading on these wires is just one latch (input) and one MAC cell (output). On the other hand the global lines in the memory based architecture (for example the address lines for the multiported memory) are loaded much more heavily.

VII. CONCLUSION A class of architectures, based on linear systolic arrays,

for computing the 1-D Discrete Wavelet Transform (DWT) using the Recursive Pyramid Algorithm (RPA) was Our 1- D architectures are optimal in computation time and area. The main features of the 1-D architectures include, systolic computation, scalability (except for the memory-based architecture), practical VO requirements, area required is independent of the input sequence length (Area=O(N,kJ)), and optimal period and delay (= 2N) . A utilization of 100% is achieved for the linear array, while computing the DWT. At the expense of doubling the number of MAC’s and adding N , registers, a period and delay of N can be achieved while maintaining the utilization at 100%. The semisystolic routing network seems to be particularly attractive, due to its low component count and simple design. Since the Laplacian pyramid and certain other subband coding applications have the same structure as the DWT, they can be computed efficiently on our architectures. With a few simple modifications the Inverse DWT can also be computed.

316 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5, MAY 1995

The circuits are easily cascadable, thus these architectures can be laid out easily. The VO rate required is low (1 or 2 inputs and outputs every cycle) and very little storage is required (N,(J - 1)k-bits), thus most of the area is taken up by the MAC cells. Hence, for N, 5 12, single-chip implementations are possible with current technology. A point worth noting is that, for the purpose of deriving the bounds the area required by the multipliers was taken to be as low as possible (as described in [13]). But in most high performance situations some area will be sacrificed to get faster multipliers.

Extensions of our architecture for computing the M-band DWT were discussed. It is clear that as the number of bands increases the utilization of the linear array decrease. For example, utilization is 66.7% for the 4-band case. But the utilization is always greater than 50%. Two architectures for computing the 2-D DWT (separable case) were discussed. The Systolic-Parallel architecture is very efficient since it computes computes the N2-point 2-D DWT, in real time, in N 2 + N cycles, using 2NNw cells of storage. Also, it has has a 100% utilization of the 1-D DWT modules. Note that the 2-D DWT involves a decimation by 4 at each stage and is thus a 4-band structure. But the 100% utilization is achievable (as opposed to 66.7%) since we are considering separable transforms and the Systolic-Parallel architecture achieves a decimation by 4, by doing a decimation by 2 along both the column and the row dimension. Another point worth noting is that we have assumed the filters to be the same size in both dimensions. This is not necessary for the correct operation of the architecture. But in practice the filters in both dimensions are kept same to ensure a uniform treatment of the images.

REFERENCES

Aware, Inc., Aware Wavelet Transform Processor ( WTP) Preliminary, Cambridge, MA, 1991. G. Beylkin, R. Coifman, and V. Rokhlin. Fast Wavelet Transforms and Numerical Algorithms I . New Haven, CT: Yale Univ., 1989 (preprint). Scot Homick and Majid Sarrafzadeh, “On problem transformability in VLSI,” Algorithmica, vol. 2, pp. 97-1 11, 1987. B. R. Hunt, “Block-mode digital filtering of pictures,” Mathematical Biosciences, vol. 11, pp. 343-354, 1971. G. Knowles, “VLSI architecture for the discrete wavelet transform,” Electron. Lett., vol. 26, no. 15, pp. 1184-1185, July 1990. R. Kronland-Martinet, J . Morlet, and A. Grossmann, “Analysis of sound pattems through wavelet transforms,” Znt. J . Pattem Recognit. and Artgcial Intell., vol. 1, no. 2, pp. 273-302, 1987. A. S. Lewis and G. Knowles, “VLSI architecture for 2-d daubechies wavelet transform without multipliers,” Electron. Lett., vol. 27, no. 2, pp. 171-173, Jan. 1991. S. Mallat, “Multifrequency channel decompositions of images and wavelet models,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 12, pp. 2091-2110, Dec. 1989. -, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Trans. Pattem Anal. and Machine Intell., vol. 11, no. 7, pp. 674693, July 1989. Z. Mou and P. Duhamel, “Short-length fir filters and their use in fast nonrecursive filtering,” ZEEE Trans. Signal Process., vol. 39, no. 6, pp. 1322-1332, June 1991. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelet transforms,” ZEEE Trans. VLSI Sysr., vol. 1, no. 2, June 1993. 0. Rioul and P. Duhamel, “Fast algorithms for wavelet transforms,” ZEEE Trans. Inform. Theory, vol. 38, no. 2, pp. 569-586, Mar. 1992. C. D. Thompson, “Fourier transforms in VLSI,” IEEE Trans. Comput., vol. C-32, no. 11, pp. 1047-1057, Nov. 1983. M. Vetterli, “Wavelets and filter banks for discrete time signal processing,” in Wavelets and Their Applications, R. Coifman et al., ms. Jones and Barlett, 1991.

[ 151 M. Vishwanath, “Time-frequency distributions: complexity, algorithms and architectures,” Ph.D. dissertation, Dept. of Computer Science, Pennsylvania State University, University Park, May 1993.

[I61 -, “The recursive pyramid algorithm for the discrete wavelet transform,” IEEE Trans. Signal Process., vol. 42, no. 3, pp. 673-677, Mar. 1994.

[17] M. Vishwanath, R. M. Owens, and M. J. Irwin, ‘The computational complexity of time-frequency distributions,” in Proc. Sixth SP Workshop Statistical Signal &Array Process., Oct. 1992, pp. 444-446.

[I81 -, “Discrete wavelet transforms in VLSI,” in Proc. Znt. Con$ Applicat. Specific Array Processors, Aug. 1992, pp. 218-229.

[19] -, “An efficient systolic architecture for qmf filter bank trees,” in Proc. I992 ZEEE Workshop VLSI Signal Process., Oct. 1992, pp. 175-184.

[20] W. R. Zettler, J. Huffman, and D. C. P. Linden, “Application of com- pactly supported wavelets to image compression,” in SPZE/SPSE symp. Electron. Imaging Sci. Technol., no. 1244, Feb. 1990, pp. 150-160.

Mohan Vihwanath (S’92-M93) was bom in New Delhi, India, on April 18, 1967. He received the B.E(Hons.) degree in computer engineering from the University of Bombay in 1988 and the Ph.D. degree in computer science from The Pennsylvania State University in 1993.

Since May 1993 he has been a member of the research staff of the Computer Science Lab at the Xerox Palo Alto Research Center. His current research interests are VLSI signal processing, video coding, and hardware prototyping methodologies.

Robert Michael Owens received the M.S. (1977) degree in computer science from the Virginia Poly- technic Institute and State University, Blacksburg, and the W.D. (1980) degree in computer science from Pennsylvania State University, University Park.

He is presently with Penn State as an Associate Professor of Computer Science and Engineering. Before he joined Penn State, he was with IBM and the Naval Surface Weapons Center. His research interests include computer architecture, massively

parallel computing, VLSI architectures, and the CAD tools associated with their implementations.

Dr. Owens is on the IEEE Signal Processing Technical Committee on VLSI and will be serving as a Program Cochair of the next ASAP Conference. He is author of UREP, a computer communication package which has been distributed to hundreds of location worldwide. He has authored over 100 scholarly works.

Mary Jane Irwin (S174-M77-SM’89-F‘94) received the M.S. (1975) and Ph.D. (1977) degrees in computer science from the University of Illinois at Urbana-Champaign.

She is currently with Pennsylvania State Univer- sity, University Park, as a Professor of Computer Science and Engineering. She is the Principal Inves- tigator of a Small Scale Institutional Infrastructures grant from NSF. Her primary research interests include computer architecture, the design of application specific VLSI processors, high-speed computer

arithmetic, and VLSI CAD tools. Dr. Irwin is on the executive committees of the Design Automation

Conference and the Supercomputing Conference, is on the editorial board of the Journal of VLSI Signal Processing and the IEEE TRANSACTIONS ON COMPUTERS, and is an elected member of the Computing Research Board, the IEEE Computer Society Board of Govemors, and the ACM Council. She has authored over 125 scholarly works.