IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5. MAY 1995 305
VLSI Architectures for the Discrete Wavelet Transform
Mohan Vishwanath, Member, IEEE, Robert Michael Owens, and Mary Jane Irwin, Fellow, IEEE
Abstract- A class of VLSI architectures based on linear sys- tolic arrays, for computing the 1-D Discrete Wavelet Transform (DWT), is presented. The various architectures of this class differ only in the design of their routing networks, which could be sys- tolic, semisystolic, or RAM-based. These architectures compute the Recursive Pyramid Algorithm, which is a reformulation of Mallats pyramid algorithm for the DWT. The DWT is computed in real time (running DWT), using just N,(J-1) cells of storage, where N , is the length of the filter and J is the number of octaves. They are ideally suited for single-chip implementation due to their practical U 0 rate, small storage, and regularity. The N-point 1-D DWT is computed in 2N cycles. The period can be reduced to N cycles by using N, extra MACS. Our architectures are shown to be optimal in both computation time and in area. A utilization of 100% is achieved for the linear array. Extensions of our architecture for computing the M-band DWT are discussed. Also, two architectures for computing the 2-D DWT (separable case) are discussed. One of these architectures, based on a combination of systolic and parallel filters, computes the N2-point 2-D DWT, in real time, in N 2 + N cycles, using 2NNw cells of storage.
N THE LAST few years there has been a great amount of I interest in wavelet transforms, especially after the discovery of the Discrete Wavelet Transform (DWT) by Mallat , . The DWT ,  can be viewed as a multiresolution decomposition of a signal. This means that it decomposes a signal into its components in different frequency bands (to be specific, in octave bands). The Inverse DWT (IDWT) does exactly the opposite, i.e., it reconstructs a signal from its octave band components. The applications of this transform (and its slight variants) are numerous, ranging from image and speech compression to solving partial differential equations [ 81, , , . In this paper we study the feasibility of implementing the DWT (both 1-D and 2-D) in VLSI and we propose architectures, based on linear systolic arrays, for computing the DWT in VLSI. All the architectures (except 1) are based on the Recursive Pyramid Algorithm (RPA) . The RPA is a reformulation of the pyramid algorithm discovered by Mallat
Manuscript received October 9, 1992; revised February 17, 1994. Prelim- inary versions of parts of this paper were presented at ASAP92 and IEEE VLSI Signal Processing Workshop, 1992. This paper was recommended by Associate Editor K. Yao.
M. Vishwanath is with the Computer Science Lab, Xerox Palo Alto Research Center, Palo Alto, CA 94304 USA.
R. M. Owens and M. J. Irwin are with the Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802 USA.
IEEE Log Number 94 10949
 and is highly amenable to VLSI implementations. We show that there is a strong link between the RPA and linear systolic arrays. These architectures can be extended to handle most other QMF filter bank trees , . The area and time complexities of the architectures describe in this paper are shown to be optimal. These architectures are highly flexible and can be easily scaled to handle filters of any size (and they are independent of the input size). These architectures can also be used for computing the M-band DWT. We also show that 100% utilization of the linear systolic array (filter) is always possible for the DWT.
In the 2-D case the dependence on the input size (the smaller of the two dimensions) cannot be eliminated because of the limited U 0 rate and the row scan (raster scan) or column scan input format. In this paper we only consider the separable case of the 2-D DWT. The architectures for the 2-D DWT rely, to a large extent, on the 1-D architecture.
Previous related work, definitions and the complexity results are presented next. The RPA is briefly introduced in Section 111. The 1-D architectures are presented in Section IV, while the architectures for the 2-D-DWT are presented in Section V. Performance figures and comparisons (with each other and with architectures described in [ l ] and [ 5 ] ) are presented in Section VI.
Very little work has been done in mapping the DWT into VLSI. The first architecture for computing the DWT was designed by Knowles  . This architecture was not well suited for VLSI since it used large multiplexors for routing the intermediate results. Later, Lewis and Knowles  designed an architecture for computing the 2-D DWT. A major drawback of this architecture is that it is heavily dependent on the properties of a specific wavelet, namely, the Daubechies 4- tap wavelet. In fact it needs no multipliers when used with the Daubechies 4-tap wavelet, but it is not an architecture which would work efficiently with any other wavelet. Aware Inc., has come out with a chip called the Wavelet Transform Processor (WTP) [ 11. It essentially consists of a 4-tap filter (in this case, 4 multiply-accumulate cells) and some external memory and control and has no special features that take advantage of the structure of the DWT. It relies heavily on the software for computing the DWT. Recently, Parhi and Nishitani have proposed folded architectures and digit-serial architectures for the 1-D DWT [l l ] . These architectures do not easily scale with the filter size and the number of octaves computed.
1057-7130/95$04.00 0 1995 IEEE
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 42, NO. 5 , MAY 1995
w-" - '" I l h
Fig. 1. The DWT filter bank.
A. The Wavelet Transfom The Wavelet Transform (WT) of a signal 2( t ) is given by
t - u ,&(U, s ) = ?(t>L(-)dt
where h(t) is the wavelet function. The Wavelet Transform of a sequence ~ ( i ) (sampled version of the continuous signal 2( t ) ) , discretized on a grid whose samples are arbitrarily spaced both in time (b) and scale (a) , is given by
W ( b , a ) = - i=b
I a where N is the number of input samples, N , is the size of the suppop of the basic wavelet h, and h is obtained by sampling h(t) . Also, a is of the form a = CUT, a0 > 1, c is a constant, and m is an integer. The number of distinct m considered is J, in other words, J is the total number of scales. At each scale k, a = cat , and the number of samples in the time dimension is Bk, where Bk 5 N . Thus the properties of the wavelet transform are heavily dependent on the properties of the basic wavelet. All the architectures that we have developed in this paper are independent of the wavelet function and are hence flexible, In general there are two special cases of the WT, the Discrete Wavelet Transform (DWT) and the Continuous Wavelet Transform (CWT). In this paper we have only considered the former.
DWT: The DWT can be viewed as the multiresolution decomposition of a sequence . It takes a length N sequence, ~ ( n ) , and generates an output sequence of length N . The output is the multiresolution representation of ~ ( n ) . It has N / 2 values at the highest resolution, N / 4 values at the next resolution, and so on. Let N = 2 p and let the number of frequencies or resolutions, be J. (Since we are only considering octaves, J 5 P.) The structure of the DWT is due to the dyadic nature of its time-scale grid; the points on the grid that we are concerned with are such that BI, = fi, a0 = 2, and a = 2 a , i , k E 0,1,. . . , J - 1. The DWT filter bank structure is shown in Fig. 1.
CWT: In its most general form, the CWT is defined by equation 1. In other words, the CWT is defined as a WT with no decimation at any scale and at any desired frequency resolution. Thus Bk = N at all the J scales. The CWT takes a length N input sequence, and produces a length N output sequence at each scale. The most commonly used version of the CWT is one [ 121 where the frequency spacing (resolution) is logarithmic (octaves) as in the case of DWT. Thus a length N input sequence produces a length N output sequence at each of the J scales. where J < logN.
B. Lower bounds In this section we present lower bounds for computing the
Wavelet Transforms. The bounds have been derived in [ 151 and . These bounds are for single chip implementations and are derived under the following practical spatial restrictions on the U 0 protocol .
1) Unilocal: Each input/output bit is available at only one pad.
2) Place-determinate: U 0 data are available at prespecified (instance-independent) places.
3) Word-local: For any cut 1 partitioning the chip, only a constant number of input (output) words have some bit entering (exiting) the chip on each side of 1. That is except for maybe a small number of inputs (outputs) all the lc bits of the inputs (outputs) enter (exit) the chip on either the left or the right side of the partition.
The results outlined below are for the 1-D case. These results can be extended directly to the 2-D case. All these bounds hold under the assumption that J 5 llog2 ($!-)I + 1. Bounds have been derived for the case when J does not satisfy this condition . Let a0 = 2 and N = 2 p . Then the area=A and time=T satisfy the following lower bounds.
For 1-D DWT, AT2 2 (J2N:k2) . For 1-D DWT, under the word-serial \footnote(The word- serial model is one in which at any time instant, at most one input (output) word has some, but not all, of its bits already read (written).) model, A 2 (JN,k) and T 2 N .