23
IEEE Transactions on VLSI Systems Vol. 4, No. 4, pp. 421-433, Dec 1996 1 VLSI Implementation of Discrete Wavelet Transform A. Grzeszczak, M. K. Mandal, S. Panchanathan, Member, IEEE, and T. Yeap, Member, IEEE Visual Computing and Communications Laboratory Department of Electrical Engineering University of Ottawa, Ottawa, Canada K1N 6N5 Tel. (613) 562-5800 x.6219, Fax (613) 562-5175 email: [email protected] Abstract - This paper presents a VLSI implementation of discrete wavelet transform (DWT). The architecture is simple, modular, and cascadable for computation of one, or multi-dimensional DWT. It comprises of four basic units: input delay, filter, register bank, and control unit. The proposed architecture is systolic in nature and performs both high-pass and low- pass coefficient calculations with only one set of multipliers. In addition, it requires a small on-chip interface circuitry for interconnection to a standard communication bus. A detailed analysis of the effect of finite precision of data and wavelet filter coefficients on the accuracy of the DWT coefficients is presented. The architecture has been simulated in VLSI and has a hardware utilization efficiency of 87.5%. Being systolic in nature, the architecture can compute DWT at a data rate of N × 10 6 samples/sec corresponding to a clock speed of N MHz. 1. Introduction In the last decade, there has been an enormous increase in the applications of wavelets in various scientific disciplines [1]-[5]. One of the main contributions of wavelet theory is to relate the discrete time filterbank with the theory of continuous time function space. Typical applications of wavelets include signal processing [6], [7], image processing [8]-[10], numerical analysis [11], statistics [12], bio-medicine [13], etc. Wavelet transform offers a wide variety of useful features, in contrast to other transforms, such as Fourier transform or cosine transform. Some of these are: Adaptive time-frequency windows Lower aliasing distortion for signal processing applications Computational complexity of ON ( ) , where N is the number of data samples Inherent scalability Efficient VLSI implementation Since DWT requires intensive computations, several architectural solutions using special purpose parallel processor have been proposed [15]-[20] in order to meet the real time requirement in many applications. The solutions include parallel filter architecture, SIMD linear array architecture, SIMD multigrid architecture [17], [19], 2-D block based architecture [20], and the AWARE's wavelet transform processor (WTP) [16]. The first three architectures, namely the parallel filter architecture, SIMD linear array architecture and the SIMD multigrid architecture, are special purpose parallel processors that implement the high level abstraction of the pyramid algorithm. The 2-D block based architecture is a VLSI implementation that uses four multiply and accumulate (MAC) units to execute the forward and inverse transforms. It requires a small on-chip memory and implements 2-D wavelet transform directly without data transposition. However, this feature can be a drawback in certain applications. In addition, the block based architecture may introduce block boundary effects degrading the visual quality. The AWARE's WTP is capable of computing forward and inverse wavelet transforms for 1-D input data using a maximum of six filter coefficients. It can be cascaded to execute transforms using higher order filters. The WTP has been clocked at speeds of 30 MHz and offers 16 bits precision on input and output data. The DWT computation is executed in a synchronous pipeline fashion and is under complete user control. However, the AWARE's WTP is a complex design requiring extensive user control. Programming such a device is therefore tedious, difficult, and time consuming. There is a clear need for designing and implementing a DWT chipset that explores the potential of DWT particularly in the area of decomposition algorithm and hardware implementation and which operates in a turnkey fashion. Here, the user is required to input only the data stream and the high-pass and low-pass filter coefficients.

VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

IEEE Transactions on VLSI Systems Vol. 4, No. 4, pp. 421-433, Dec 1996

1

VLSI Implementation of Discrete Wavelet Transform

A. Grzeszczak, M. K. Mandal, S. Panchanathan, Member, IEEE, and T. Yeap, Member, IEEE

Visual Computing and Communications LaboratoryDepartment of Electrical Engineering

University of Ottawa, Ottawa, Canada K1N 6N5Tel. (613) 562-5800 x.6219, Fax (613) 562-5175

email: [email protected]

Abstract - This paper presents a VLSI implementation of discrete wavelet transform (DWT). The architecture is simple,modular, and cascadable for computation of one, or multi-dimensional DWT. It comprises of four basic units: input delay,filter, register bank, and control unit. The proposed architecture is systolic in nature and performs both high-pass and low-pass coefficient calculations with only one set of multipliers. In addition, it requires a small on-chip interface circuitry forinterconnection to a standard communication bus. A detailed analysis of the effect of finite precision of data and waveletfilter coefficients on the accuracy of the DWT coefficients is presented. The architecture has been simulated in VLSI and hasa hardware utilization efficiency of 87.5%. Being systolic in nature, the architecture can compute DWT at a data rate of

N ×106 samples/sec corresponding to a clock speed of N MHz.

1. Introduction

In the last decade, there has been an enormous increase in the applications of wavelets in various scientific disciplines [1]-[5].One of the main contributions of wavelet theory is to relate the discrete time filterbank with the theory of continuous timefunction space. Typical applications of wavelets include signal processing [6], [7], image processing [8]-[10], numericalanalysis [11], statistics [12], bio-medicine [13], etc. Wavelet transform offers a wide variety of useful features, in contrast toother transforms, such as Fourier transform or cosine transform. Some of these are:

• Adaptive time-frequency windows

• Lower aliasing distortion for signal processing applications

• Computational complexity of O N( ) , where N is the number of data samples

• Inherent scalability

• Efficient VLSI implementation

Since DWT requires intensive computations, several architectural solutions using special purpose parallel processor havebeen proposed [15]-[20] in order to meet the real time requirement in many applications. The solutions include parallel filterarchitecture, SIMD linear array architecture, SIMD multigrid architecture [17], [19], 2-D block based architecture [20], andthe AWARE's wavelet transform processor (WTP) [16]. The first three architectures, namely the parallel filter architecture,SIMD linear array architecture and the SIMD multigrid architecture, are special purpose parallel processors that implementthe high level abstraction of the pyramid algorithm. The 2-D block based architecture is a VLSI implementation that usesfour multiply and accumulate (MAC) units to execute the forward and inverse transforms. It requires a small on-chip memoryand implements 2-D wavelet transform directly without data transposition. However, this feature can be a drawback in certainapplications. In addition, the block based architecture may introduce block boundary effects degrading the visual quality.

The AWARE's WTP is capable of computing forward and inverse wavelet transforms for 1-D input data using a maximum ofsix filter coefficients. It can be cascaded to execute transforms using higher order filters. The WTP has been clocked atspeeds of 30 MHz and offers 16 bits precision on input and output data. The DWT computation is executed in a synchronouspipeline fashion and is under complete user control. However, the AWARE's WTP is a complex design requiring extensiveuser control. Programming such a device is therefore tedious, difficult, and time consuming.

There is a clear need for designing and implementing a DWT chipset that explores the potential of DWT particularly in thearea of decomposition algorithm and hardware implementation and which operates in a turnkey fashion. Here, the user isrequired to input only the data stream and the high-pass and low-pass filter coefficients.

Page 2: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

2

This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21].The proposed VLSI architecture computes both highpass and lowpass frequency coefficients in the same clock cycle and thushas an efficient hardware utilization. The design is simple, modular, and cascadable for computation of 1-D or 2-D datastreams of fairly arbitrary size. The proposed architecture requires a small on-chip interface circuitry for purposes ofinterconnection to a standard communication bus.

The paper is organized as follows. A brief introduction to discrete wavelet transform is presented in section 2. The effect offinite precision in DWT computation is presented in section 3. The VLSI systolic array architecture for computing DWT ispresented in section 4, followed by the conclusions in section 5.

2. Discrete Wavelet Transform

This section presents a brief introduction to discrete wavelet transform (DWT). DWT represents an arbitrary squareintegrable function as superposition of a family of basis functions called wavelets. A family of wavelet basis functions can begenerated by translating and dilating the mother wavelet corresponding to the family. The DWT coefficients can be obtainedby taking the inner product between the input signal and the wavelet functions. Since, the basis functions are translated anddilated versions of each other, a simpler algorithm, known as Mallat’s tree algorithm or pyramid algorithm, has beenproposed in [2]. In this algorithm, the DWT coefficients of one stage can be calculated from the DWT coefficients of theprevious stage, which is expressed as follows:

W n j W m j h m nL Lm

( , ) ( , ) ( )= − −∑ 1 2 (1a)

W n j W m j g m nH Lm

( , ) ( , ) ( )= − −∑ 1 2 (1b)

where W p qL ( , ) is p -th scaling coefficient at the q -th stage, W p qH ( , ) is the p -th wavelet coefficient at q -th stage,

h n( ) and g n( ) are the dilation coefficients corresponding to the scaling and wavelet functions, respectively.

For computing the DWT coefficients of the discrete-time data, it is assumed that the input data represents the DWTcoefficients of a high resolution stage. Eq. 1 can then be used for obtaining DWT coefficients of subsequent stages. Inpractice, this decomposition is performed only for a few stages. We note that the dilation coefficients h n( ) represent alowpass filter (LPF), whereas the corresponding g n( ) represent a highpass filter (HPF). Hence, DWT extracts information

from the signal at different scales. The first level of wavelet decomposition extracts the details of the signal (high frequencycomponents) while the second and all subsequent wavelet decompositions extract progressively coarser information (lowerfrequency components). A schematic of three stage DWT decomposition is shown in Fig. 1.

In order to reconstruct the original data, the DWT coefficients are upsampled and passed through another set of lowpass andhighpass filters, which is expressed as:

W n j W k j h n k W l j g n lL L Hlk

( , ) ( , ) ( ) ( , ) ( )= + ′ − + + ′ −∑∑ 1 2 1 2 (2)

where ′h n( ) and ′g n( ) are respectively the lowpass and highpass synthesis filter corresponding to the mother wavelet. It is

observed from Eq. 2 that j -th level DWT coefficients can be obtained from ( )j +1 -th level DWT coefficients.

Compactly supported wavelets are generally used in various applications. Table 1 lists a few orthonormal wavelet filtercoefficients (h n( ) ) popular in compression applications [6]. These wavelets have the property of having the maximumnumber of vanishing moments for a given order, and are known as Daubechies wavelets. The entries in column 2, 3, and 5provide the filter coefficients with the minimum phase, whereas entries in column 4 and 6 provide the filter coefficients withthe least asymmetric phase.

The 2-D DWT is usually calculated using a separable approach [2]. To start with, the 1-D DWT computation is performed oneach row. This is followed by a matrix transposition operation [28]. Next, the DWT operation is executed on each row of the2-D data. Hence, 2-D DWT can be implemented in a straightforward manner by inserting a matrix transposer between two 1-D DWT modules. Fig. 2 shows a 3-level wavelet decomposition of an image. In the first level of decomposition, one lowpass subimage (LL2 ) and three orientation selective highpass subimages (LH2 , HL2 , HH2 ) are created. In the second level

Page 3: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

3

of decomposition, the lowpass subimage is further decomposed into one lowpass and three highpass subimages( LH3 , HL3 , HH3 ). This process is repeated on the low pass subimage to derive the higher level decompositions. In other

words, DWT decomposes an image into a pyramid structure of subimages with various resolutions corresponding to thedifferent scales. The inverse wavelet transform is calculated in the reverse manner, i.e., starting from the lowest resolutionsubimages, the higher resolution images are calculated recursively. We note that nonseparable wavelets have also beenproposed in the literature. However, they are not widely used because of their complexity.

2.1 Computational Complexity of the DWT

It is observed from Eq. 1a-1b that the complexity of each stage of wavelet decomposition is linear in the number of inputsamples, where the constant factor depends on the length of the filter. We note that for a dyadic wavelet decomposition, thenumber of input samples decreases by 50% at subsequent stages of decomposition. For wavelet order L , number ofdecomposition stages J , the computational complexity for an 1-D N -point sequence is

C NN N N

Ldyadic J= + + + +

−2 4 2

21

. . . . . . . * FLOP

( )= − −4 1 2 J NL FLOP (3)

where FLOP corresponds floating point operations and usually refers to multiplications and additions. We note that a simplepolyphase decomposition has been assumed in the above calculation. The complexity can be further reduced using moresophisticated algorithms, such as FFT, first running FIR filtering [14]. However, these algorithms need complex controlcircuitry for hardware implementation and has hence not been considered in the proposed architecture.

In many applications, a regular tree, instead of a dyadic tree, might be more appropriate. The computational complexity ateach stage of a regular tree is 2NL FLOP. Hence, the total complexity for a J level decomposition is:

C JNLregular = 2 FLOP (4)

The complexity of an irregular tree, or a wavelet packet algorithm is upper bounded by Cregular .

2.2 Data Dependencies within DWT

The wavelet decomposition of an 1-D input signal for three stages is shown in Fig. 1. The transfer functions of the sixth orderhighpass (g n( ) ) and lowpass (h n( ) ) FIR filter can be expressed as follows

H z g g z g z g z g z g z( ) ( ) ( ) ( ) ( ) ( ) (5)= + + + + +− − − − −0 1 2 3 41 2 3 4 5 (5a)

L z h h z h z h z h z h z( ) ( ) ( ) ( ) ( ) ( ) (5)= + + + + +− − − − −0 1 2 3 41 2 3 4 5 (5b)

For clarity, the intermediate and final DWT coefficients in Fig. 1, are denoted by a, b, c, d, e, f and g. The DWT computationis complex because of the data dependencies at different octaves. Eq. 6a-6n show the relationship among a, b, c, d, e, f and g.We note that the data dependencies at various octaves are represented by identical symbols used in more than one octaveequation.

1st octave:

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 1 2 2 3 3 4 4 5 5= + − + − + − + − + − (6a)

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )2 0 2 1 1 2 0 3 1 4 2 5 3= + + + − + − + − (6b)

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 3 2 2 3 1 4 0 5 1= + + + + + − (6c)

b g a g a g a g a g a g a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )6 0 6 1 5 2 4 3 3 4 2 5 1= + + + + + (6d)

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 1 2 2 3 3 4 4 5 5= + − + − + − + − + − (6e)

Page 4: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

4

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )2 0 2 1 1 2 0 3 1 4 2 5 3= + + + − + − + − (6f)

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 3 2 2 3 1 4 0 5 1= + + + + + − (6g)

c h a h a h a h a h a h a( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )6 0 6 1 5 2 4 3 3 4 2 5 1= + + + + + (6h)

2nd octave:

d g c g c g c g c g c g c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 2 2 4 3 6 4 8 5 10= + − + − + − + − + − (6i)

d g c g c g c g c g c g c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 2 2 0 3 2 4 4 5 6= + + + − + − + − (6j)

e h c h c h c h c h c h c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 2 2 4 3 6 4 8 5 10= + − + − + − + − + − (6k)

e h c h c h c h c h c h c( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 0 4 1 2 2 0 3 2 4 4 5 6= + + + − + − + − (6l)

3rd octave:

f g e g e g e g e g e g e( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 4 2 8 3 12 4 16 5 20= + − + − + − + − + − (6m)

g h e h e h e h e h e h e( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )0 0 0 1 4 2 8 3 12 4 16 5 20= + − + − + − + − + − (6n)

As shown in Fig. 1 and Eq. 6, several intermediate results (c, e) are first computed, and then used to calculate multiple outputsamples. We note that the intermediate results must be stored and made available for further processing at specific timeinstants.

3. Finite Precision Effect

The accuracy of DWT coefficients depend on the precision of both the input data and the DWT coefficients. In this section,the performance of a finite precision wavelet transformer is evaluated. We note that a multistage DWT is calculatedrecursively. Therefore in addition to the wavelet decomposition stage, an extra space is required to store the intermediatecoefficients. Hence, the overall performance depends significantly on the precision of the intermediate DWT coefficients. It isassumed that both the intermediate and final DWT coefficients are represented with the same precision, and their dynamicrange is symmetric about zero.Let’s assume that the precision of the various data are as follows :

i) Input data : i bitsii) Intermediate wavelet coefficients : j bitsiii) Wavelet filter coefficients (h n( ) ) : m bits

Generally, the dynamic range of the DWT coefficients is greater than the dynamic range of the input data, and hence j

should be greater than i . In our implementation, we multiply the input data by 2 j i− to make themj -bits precision. We thencalculate the first stage DWT and scale down the DWT coefficients to j bits. Subsequent stages of DWT decomposition isexecuted in the same manner.

We note from Table 1 that the maximum absolute value of a filter coefficient is between 0.5 and 1. Therefore, to represent

the filter coefficients in m bits precision, we multiply all the coefficients by 2 1m− and round to nearest integer value. Instead

of rounding, sometimes, we may have to select either h n( ) , or h n( ) so that h n( )∑ is closest to 2 (for orthonormality)

and g n( )∑ is closest to zero. In this case, the filter coefficients range from − −2 1m to ( )2 11m− − .

Page 5: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

5

DWT coefficients can be recursively calculated using Eq. (1a) and (1b), where W represents wavelet coefficients of a certainstage and h g, represent the corresponding filter coefficients. The precision of W and h are j and m bits respectively.

To execute the operation in Eq. 1, we need a multiplier of j m× bits. After the multiplication is performed, each term on the

right hand side of the equation will have a precision of m j+ −1 bits. We note that the dynamic range of the DWT

coefficients will increase because of the sum of several terms. The rate of increase will be upper bounded by h nn

( )∑ which

is listed in Table 1 for a few wavelets. Table 2 shows the maximum dynamic range of DWT coefficients at various stages.We note that in most cases, the value of z in Table 2 is less than 2. Hence, for an 1-D transform, the dynamic range willatmost increase by 2. In other words, the accumulator should have at least m j+ bits precision. Since, the precision of the

intermediate storage is j bits, all the coefficients are scaled down toj bits by dividing them by 2m. We note that the input

data was scaled up initially by 2 j −i. After the first stage of decomposition, the coefficients will be effectively 2 j −i −1

timesgreater than the ideal coefficients. To derive the ideal DWT coefficients, the resulting DWT coefficients should be scaleddown by the scaling factors provided in Table 3.

The inverse transform is calculated using Eq. 2. The wavelet coefficients are upsampled and passed through a set of LPF andHPF synthesis filters (which are different but related to the analysis filters) and the filter outputs are added for reconstruction.We note that there is a difference between the forward and inverse DWT with respect to the precision of the coefficients. Inaddition, the dynamic range of the forward wavelet coefficients increases with the tree depth because of the summationoperation in Eq. 1a-1b. However, in the case of inverse transform, the dynamic range of the coefficients will decrease due tothe perfect reconstruction property of wavelet transform. The inverse transform of k -th stage wavelet coefficients, results inthe ( )k −1 -th stage lowpass coefficients which have a lower dynamic range. Hence, while performing the inverse DWT, the

individual terms of the convolution sum have m j+ −1 bits precision. However, after the summation, the resulting

coefficients have m j+ − 2 bits precision. To ensure that the coefficients have j bits precision for intermediate storage, all

the coefficients are divided by 2m−2. When all the stages have been computed, the resulting output data are with j bits

precision. To obtain the original data, the output data should be divided by 2 j −m.

To determine the accuracy of finite precision wavelet transformer, 1-D step (shown in Fig. 3a) and sinusoidal (shown in Fig.4a) input data with 8 bits precision were decomposed for 3 stages using Daubechies 8 tap wavelet. The differences of theDWT coefficients obtained with 12 bits precision, and the ideal coefficients are shown in Fig. 3 and 4. For comparisonpurposes, the ideal coefficients are also shown along with the error coefficients. It is observed that the error coefficients dueto finite precision are small when the data is uniform in nature (Fig. 3). However, when the input data changes rapidly, thevalue of the error coefficients becomes high (near the step discontinuity and throughout the sinusoidal signal).

A useful measure of the accuracy of DWT coefficients is the signal to noise ratio (SNR). Here, the signal is the floating pointDWT coefficient and noise is the difference between the floating point and finite precision coefficients. Fig. 5 provides theperformance variation for 1-D signal with respect to the precision of filter coefficients with a fixed 12 bits DWT coefficients.The data were decomposed with Daubechies 8 tap wavelet for 3 stages. It is observed that a performance of 50-70 dB SNRcan be obtained with 12 bits precision of both DWT coefficients and filter coefficients. Two test images Lena and Mandrill(2-D) were decomposed using Daubechies 8 tap wavelet for 3 stages. The performance with respect to various precisions ofDWT and filter coefficients are detailed in Fig. 6. It is observed that with 12 bits precision, a SNR of 40-50 dB can beachieved. We note that in the 2-D case more stages (row and column) of DWT are involved, resulting in a decrease in theSNR (compared to 1-D case).

4. The Proposed Systolic Array Architecture

A digit serial wavelet architecture using the pyramid algorithm was outlined in [15]. The proposed systolic array (DWT-SA)architecture is an improvement over the above architecture. Here, only one set of multipliers and adders has been employed,in contrast to two parallel computational hardware employed in [15]. The multiplier and adder set performs all necessarycomputations to generate all highpass and lowpass coefficients. The DWT-SA architecture does not use any external orinternal memory modules to store the intermediate results and therefore avoids the delays caused by memory access andmemory refresh timing. In addition, since a set of registers controlled by a global clock is employed, the control circuitrydoes not need to take the intermediate products in and out of the memory. This results in a simple and efficient systolicimplementation for 1-D DWT computation.

Page 6: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

6

In order to compute separable 2-D DWT, two modules (one for row transform and another for column transform) of theproposed architecture can be used along with a transposer. The details of the schematic and the design of the transposer canbe found in [21], [28]. We note that the proposed architecture is a complete setup for computing forward DWT. The inverseDWT can be calculated by replacing the decimator with an interpolator.

4.1 DWT-SA Architecture

The design of DWT-SA is based on a computation schedule derived from Eq. 6a - 6n which are the result of applying thepyramid algorithm for eight data points (N = 8) to the six tap filter. We note that Eq. 1a and 1b represent the highpass andlowpass components of the six tap FIR filter.

The proposed DWT-SA architecture is shown in Fig. 7. It comprises of four basic units: Input Delay, Filter, Register Bank,and Control. The following sections present the design of each unit. First, we present the design of the Filter Unit and itssubcomponent-the Filter Cell. The design of Storage Units are then discussed, which is followed by the description of theControl Unit.

4.2 Filter Unit (FU)

The Filter Unit (FU) proposed for this architecture is a six-tap non-recursive FIR digital filter whose transfer function for thehighpass and lowpass components are shown in Eq. 1 where g g( ) ( )0 5− and h h( ) (5)0 − are the coefficients of the HPF

and LPF, respectively.

Computation of any DWT coefficient can be executed by employing a multiply and accumulate method where partialproducts are computed separately and subsequently added. This feature makes possible systolic implementation of DWT.The latency of each filter stage is 1 time unit (TU). Since partial components of more than one DWT coefficient are beingcomputed at any given time, the latency of the filter once the pipeline has been filled is also 1 (TU). The systolic architectureof a six tap filter is shown in Fig. 8. Here, partial results (one per cell) are computed and subsequently passed in a systolicmanner from one cell to the adjacent cell.

Filter Cell (FC)

Eq. 1a-1b show that computations of the highpass and lowpass DWT coefficients at specific time instants are identical exceptfor different values of the LPF and HPF filter coefficients. By introducing additional control circuitry, computations of bothhighpass and lowpass DWT coefficients can be executed using the same hardware in one clock cycle. The highpasscoefficient calculation is performed during the first half of the clock cycle whereas the lowpass coefficient calculation isperformed during the second half. Subsequently, the partial results are passed synchronously in a systolic manner from onecell to the adjacent cell. The proposed filter cell therefore consists of only one multiplier, one adder, and two registers tostore the high-pass and the low-pass coefficients, respectively, as shown in Fig. 9.

In order to meet the real time requirements of applications such as video compression, a fast multiplier design is required. Forthis purpose, a high speed Booth multiplier [24] is used in the filter cell. The Booth multiplier uses a powerful algorithm forsigned-number multiplication which treats both positive and negative numbers uniformly and is significantly faster than anarray multiplier due to the reduced number of add stages required [24], [25]. The additional speedup is a result of using onlyhalf the number of adder stages The full adders and half adders employed inside the multiplier are variants of thosedescribed in [26]. Both designs have been modified to reduce the carry out delay which is critical in achieving the fastestpossible multiplication.

4.3 Storage Units

Two storage units are used in the proposed architecture: Input Delay and Register Bank. The data registers used in thesestorage units have been constructed from standard Master-Slave, Edge Triggered, D-type flip-flops [26], [27]. The followingpresents the structure of each storage unit.

Input Delay Unit (ID)

Eq. 1a and 1b show that the value of computed filter coefficient depends on the present as well as the five previous datasamples (the negative time indexes in Eq. 1a-1b correspond to the reference starting time unit 0). It is therefore required thatthe present and the past five input data values be held in registers and be retrievable by the FU and the CU. Therefore, five

Page 7: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

7

data registers are connected serially in a chain, as shown in Fig. 7, and at any clock cycle each register passes its contents toits right neighbor which results in only five past values being retained.

Register Bank Unit (RB)

Several registers are required for storage of the intermediate partial results. Analysis of Eq. 1a and 1b justifies therequirement for the register bank, however it does not explain its size. It will be shown in the next section that 26 dataregisters connected serially are required to implement RB.

4.4 Control Unit (CU)

One of the most important aspects of the DWT-SA architecture is its potential for real time operation. The proposed DWT-SA architecture computes N coefficients in N clock cycles and achieves real time operation by executing computations ofhigher octave coefficients in between the first octave coefficient computations. The first octave computations are scheduledevery N/4 clock cycles, while the second and third octaves are scheduled every N/2 and every N clock cycles, respectively.

There are several approaches for scheduling the octave computations. In the DWT-SA architecture, a schedule based on filterlatency of 1 TU is proposed to meet the real time requirements in some applications. The computations are scheduled at theearliest possible clock cycle, and computed output samples are available one clock cycle after they have been scheduled asshown in Table 4. The delay is minimized through the pipeline facilitating real time operation. For example, the computationof d(0) can only be executed after the calculation of c(0) has been completed. The calculation of c(0) scheduled for cycle 1 iscompleted in cycle 2 due to the filter latency of 1 TU. d(0) is therefore scheduled for computations in a later cycle, i.e. cycle4.

The schedule presented in Table 4 is periodic with period N, and the hardware is not utilized in cycle kN+2 where k is a non-negative integer. The computation schedule in Table 4 corresponds to a high hardware utilization of 87.5% (i.e. 7/8).

4.4.1 Register Allocation

The next step in designing the DWT-SA architecture is the design of the Control Unit (CU) and the Register Bank (RB). Thetwo components synchronize the availability of operands. There are two schemes which can be employed for this purpose,namely the Forward Register Allocation (FRA), or the Forward-Backward Register Allocation (FBRA). The FRA methoduses a set of registers which are allocated to intermediate data on the first come first served basis. It does not reassign anyregisters to other operands once its contents have been accessed. The FBRA scheme is similar, except that once the operandstored has been used, the register is reallocated to another operand. The FRA method is simpler, requires less controlcircuitry and permits easy adaptation of the architecture for coefficient calculation of more than 3 octaves. It results however,in less efficient register utilization.

In either scheme, the coefficient computations are periodic and hence, each register containing a specific variable will bereserved for the same variable in the next period. The construction of the register allocation tables for both FRA and FBRAare presented below.

FRA Register Allocation

To demonstrate the construction of the register allocation table using the FRA approach, we will consider the case ofcomputing the coefficients c(0) and c(2).

As shown in Table 4, coefficient c(0) is computed in cycle one, whereas the coefficient c(2) is scheduled for computation incycle 3. These two computed coefficients along with other four; c(4), c(6), c(8) and c(10) are needed for the computation ofcoefficient e(0) (see Eq. 6k). According to Table 4, e(0) computation is scheduled for cycle 4. However, the six coefficientsc(0), c(2), c(4), c(6), c(8), c(10) will not be available until cycle 12, and therefore the calculation of e(0) has to be scheduledfor cycle 12 (i.e., 4 + N). Similarly, coefficient e(4) is computed not in cycle 8, but in cycle 16 (i.e.,. 8 + N). The number ofregisters needed becomes apparent once one complete frame of computations has been scheduled.

Systematic application of the described method, yields a complete schedule of computation for all intermediate and finalcoefficients as shown in Table 5. Due to its size, the table has been divided into two parts.

In the FRA register allocation approach where data moves systolically in one direction only, it is possible to increase thenumber of DWT decomposition octaves by placing additional registers in series after register R26. The new registers hold theintermediate coefficients needed for the computation of the next octave decomposition. Hardware utilization of the higheroctave decomposition registers is inversely proportional to the order of computed coefficients.

Page 8: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

8

FBRA Register Allocation

Table 5 shows that not all registers in the FRA register allocation scheme are used at every time instant. In fact, closeexamination of Table 5, suggests that for the first to second octave calculations, registers R1 to R11 are used 87.5 % of thetime, whereas for the second to third octave calculation, registers R12 to R26 are used 25 % of the time.

Table 6 shows a complete register allocation table for the DWT-SA using the FBRA approach. It is observed that a higherregister utilization could be achieved by applying the reallocation scheme where empty registers are reallocated to othervariables once the original variables held in them are no longer valid. This approach, which in the ideal case would make useof every register at each time instance, has a negative impact on the complexity of the architecture. However, it requirescomplex control circuitry for data reallocation, and results in a less modular architecture compared to the FRA approach.Therefore, FRA architecture, instead of FBRA, has been employed in the proposed architecture.

4.4.2 Activity Periods

Tables 5 and 6 show that the last pair of coefficients i.e. f(0) and g(0) for the first group of eight samples is available at timeinstance 38. This implies that coefficient computations are overlapped for 5 periods. Table 5 also shows the time periodswhen the intermediate coefficients must remain valid in order to produce the subsequent higher octaves of coefficients. Forexample consider the case for a first octave coefficient c(0). It remains active from the time instant it has been computed tothe time of calculation of the next higher octave of coefficients, i.e., from time instance 1 to time instance 12 in the presentconfiguration. Similarly, e(0), a second octave coefficient remains active until the third octave coefficients (in which it isused) are computed, i.e. from time instance 12 to time instance 38. All the intermediate results, and the associated periods ofactivity are listed in Table 7.

The number of registers required in this architecture is directly proportional to the number of levels of DWT decomposition,and is calculated during the construction of the timetable of computations. For the DWT-SA architecture which computesthree octaves of DWT decomposition and employs the FRA register allocation method, the top row of Table 5 indicates thenumber of registers is 26.

We note that since no variable in Table 5 has a negative time index, a periodic interpretation of that table is required.Consider for example, the variable c(-2) in the computation of d(0) and e(0) in cycle 4. The periodic interpretation of Table 5implies that the register which holds the variable c(-2) in cycle 4, also holds the variable c(-2+8) in clock cycle 12 (4+8).Table 5 shows that c(6) is held in register R5.

4.4.3 Complete Design of CU

The complete design of the Control Unit for DWT-SA architecture is shown in Fig. 10. It schedules the computation of eachDWT coefficient as shown in Table 4.

CU is a switch, that directs data from the Input Delay (ID), or the Register Bank (RB) to the Filter Unit (FU). CU is amodular switch with a number of subcomponents equal to the number of taps in the FU. The CU multiplexes data from theID every second cycle, and from the RB in cycles 4, 6, and 8. In cycle 2, CU remains idle, i.e. it does not allow any passageof data. Proper timing, synchronization as well as enabling and disabling of the CU is ensured by the global CLK signal.

4.5 Timing Considerations

We have examined the design of each component in the proposed DWT-SA architecture. The timing considerations of thearchitecture are discussed below with respect to Fig. 7.

The number of switching inputs in each control subcell is equal to the number of octave computations. The first octavecomputations are scheduled every second clock cycle and hence the corresponding switch input is labeled 2k, where k is anynon-negative integer. Moreover, its inputs are supplied directly by the ID. Second octave computations are executed in clockcycles 4 and 8, which is reflected by the label 4k+4. The third octave computations are scheduled in clock cycle 6, or 8k+6.Both second and third octave computations use partial results from previous octave computations and therefore use inputsfrom RB. Table 5 determines which register is used as output.

Delay of the DWT-SA architecture consists of the latency period necessary to fill up the filter for the first time, in addition tothe number of clock cycles through the registers as described in Table 4. First results are thus produced 43 (i.e., 5+38) clockcycles after the first input sample has entered the pipeline. Subsequent coefficients are available at the output the pipelineevery 8 clock cycles. The DWT coefficients are output from the final filter stage.

Page 9: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

9

4.6 Simulation Results

The proposed DWT-SA architecture, has been fully simulated in order to validate its functionality [21]. First, analogsimulations on each small cell, and later digital simulations on the larger block and the final chip have been performed. Theanalog simulations were executed using the Hspice simulation tool, running under the Opus 4.2 design platform. Once thegate level analog simulations of all subcircuits were completed, digital simulations were performed on groups of subcircuitsforming a more complex functional circuit. Larger blocks of circuits were assembled progressively and verified until theentire DWT-SA architecture was simulated. The digital simulator used was Verilog logic simulator running under Opus 4.2.Process parameters used were those of 1.2 µm technology.

The dimension of a single DWT-SA module with 8 bit wide data and filter coefficients will be approximately10 7mm mm× with approximately 300,000 transistors on the chip. The power dissipation of the architecture using 5 voltCMOS process at 20 Mhz will be approximately 500mW-1W. The 16-bit architecture would require 4 times more transistors,4 times more silicon area, and will consume about 4 times more power. However, the simulation results in Section 3 showsthat it is sufficient to have an architecture with 12-bit precision. In this case, the required number of transistors and powerdissipation will be approximately 2.25 times those required by 8-bit architecture.

One of the most popular applications of DWT is video processing. With a frame rate of 30 frames/sec, the video processorshould process a complete frame in less than 33 ms. It has been found that the proposed architecture can execute the DWTcomputations on a monochrome 512 512× frame in 13 ms, with 1.2 µm technology and 20 MHz clock rate. The

computation on color frame will take about 39 ms, and hence cannot be executed in real time. However, an additional 35%speedup can be expected if 0.8 µm technology or below is employed. This speedup will enable the architecture to performreal time DWT computations for color video sequences.

5. Conclusion

A systolic VLSI architecture for computing one dimensional DWT in real time has been presented. The architecture issimple, modular, cascadable, and has been implemented in VLSI. The implementation employs only one multiplier per filtercell, and hence results in a considerably smaller chip area. As implemented in a 1.2 µm technology, and running at 20 MHz,this architecture achieves real-time DWT computation for a monochrome 512 x 512 video input.

Acknowledgment

This research was funded by the Microelectronics Network (Micronet) under the Network Centers of Excellence (NCE)program of the Government of Canada.

References

[1] I. Daubechies, “Orthonormal bases of compactly supported wavelets,” Comm. Pure Appl. Math, Vol. 41, pp. 906-966,1988

[2] S. G. Mallat, “A theory of multiresolution signal decomposition: the wavelet representation,” IEEE Trans. on PatternRecognition and Machine Intelligence, Vol. 11, No. 7, July 1989.

[3] M. Vetterli and C. Harley, “Wavelets and filter banks: theory and design,” IEEE Transactions on Signal processing,Vol. 40, No. 9, pp. 2207-2232, 1992 .

[4] R. A. Gopinath, Wavelets and Filter Banks - New Results and Applications, Ph.D Dissertation, Rice University,Houston, Texas, 1993.

[5] Y. Meyer, Wavelets: Algorithms and Applications, SIAM, Philadelphia, 1993.[6] A. N. Akansu and R. A. Haddad, Multiresolution Signal Decomposition: Transform, Subbands and Wavelets,

Academic Press Inc., 1992.[7] O. Rioul and M. Vetterli, “Wavelets and signal processing,” IEEE Signal processing Magazine, pp. 14-38, Oct. 1991.[8] R. A. Devore, B. Jawerth and B. J. Lucier, “Image compression through wavelet coding,” IEEE Trans. on Information

Theory, Vol. 38, No. 2, pp. 719-746, March 1992 .[9] M. Unser, “Texture classification and segmentation using wavelet frames,” IEEE Trans. on Image processing, Vol. 4,

No. 11, pp. 1549-1560, Nov. 1995.

Page 10: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

10

[10] S. G. Mallat, "Multifrequency channel decompositions of images and wavelet models", IEEE Trans. on Acoustics,Speech and Signal Processing, Vol. 37, No. 12, pp. 2091-2110, Sept. 1989.

[11] G. Beylkin, R. R. Coifman and V. Rokhlin, Wavelets in Numerical Analysis; in Wavelets and Their Applications, pp.181-210, Jones and bartlett, 1992.

[12] M. A. Stoksik, R. G. Lane and D. T. Nguyen, “Accurate synthesis of fractional Brownian motion using wavelets,”Electronics Letters, Vol. 30, No. 5, pp. 383-384, March 1994.

[13] L. Senhadji, G. Carrault, and J. J. Bellanger, “Interictal EEG spike detection: A new framework based on wavelettransform,” Proc. of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, pp. 548-551,Philadelphia, Oct 1994.

[14] O. Rioul and P. Duhamel, “Fast algorithms for discrete and continuous wavelet transforms,” IEEE Trans. onInformation Theory, Vol. 38, No. 2, March 1992.

[15] K. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelet transforms”, IEEE Trans. on VLSI Systems, pp.191-202, June 1993.

[16] Aware Wavelet Transform Processor (WTP) Preliminary, Aware Inc., Cambridge, MA.[17] M. Vishwanath, “Discrete wavelet transform in VLSI”, Proc. IEEE Int. Conf. Appl. Specific Array Processors. pp.

218-229, 1992.[18] K. K. Parhi, "Video data format converters using minimum number of registers", IEEE Trans. on Circuits and Systems

for Video Tech., vol. 38, pp. 255-267, June 1992.[19] C. Chakrabarti, M. Vishwanath and R. M. Owens, "Architectures for wavelet transforms", Proc. IEEE VLSI Signal

Processing Workshop. pp. 507-515, 1993.[20] Y. Kang, "Low-power design of wavelet processors", Proc. of SPIE, vol. 2308, pp. 1800-1806, 1993.[21] A. Grzeszczak, VLSI Architecture for Discrete Wavelet Transform, M.A.Sc. thesis, Department of Electrical

Engineering, University of Ottawa, Canada, 1995.[22] G. Knowles, "VLSI architecture for the discrete wavelet transform," Electronics Letters, Vol. 26, No. 15, pp. 1184-

1185, Jul. 1990.[23] M. Vishwanath, “The recursive pyramid algorithm for the discrete wavelet transform”, IEEE Trans. on Signal

Processing, March 1994.[24] A. D. Booth, "A signed binary multiplication technique", Quarterly Journal of Mechanics and Applied Mathematics,

Vol. 4, pp. 236-240, 1951.[25] G. J. Hekstra, Multiplier Architectures for VLSI Implementation, Technical Report No. 90-104, Delft University of

Technology, Netherlands, Nov. 1990.[26] N. Weste and K. Eshragian, Principles of CMOS VLSI Design, Addison-Wesley, June 1988.[27] J. Cavanagh, Digital Computer Arithmetic, McGraw-Hill, 1984.[28] S. Panchanathan, “Universal architecture for matrix transposition,” IEE Proceedings-E, Vol. 139, No. 5, Sept 1992.

Page 11: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

11

Nsample

HighH1

LowL1

HighH2

LowL2

HighH3

LowL3

N/8 fsamples

N/8 gsamples

2

2

2

2

2

2

N/2 c samples

N/4 e samples

N/2 b samples

N/2 d samples

a

Figure 1. Three stage DWT decomposition using pyramid algorithm.

LL LH

HL HH

(a)

LL LH

HL HHLH

HL HH

LH

HL HH

4 4

443

33

2

2

2

(b)

xx/2x/4x/8

y

y/2

y/4

y/8

Figure 2. (a) Wavelet transform decomposition of an image into 4 sub-images.

(b) A three level image decomposition.

Page 12: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

12

-150

-75

0

75

150

0 64 128 192

Samples

Inp

ut

Dat

a

(a)

-600

-300

0

300

600

0 8 16 24Sam ples

DW

T C

oef

fici

ents

Original

200*Error

-150

-75

0

75

150

0 8 16 24Samples

DW

T C

oef

fiei

ents

Original200*Error

(b) (c)

-150

-75

0

75

150

0 16 32 48Samples

DW

T C

oef

fici

ents

Original200*Error

-150

-75

0

75

150

0 16 32 48 64 80 96 112

Sam ple s

DW

T C

oef

fici

ents

Original

200*Error

(d) (e)

Figure 3. Ideal DWT coefficients and errors due to 12 bit precision of coefficients.The data were decomposed with Daubechies 8 tap wavelet for 3 stages. The dottedlines represent errors due to 12 bits coefficients. a) The step input, b) stage 3 lowpasscoefficients, c) stage 3 highpass coefficients, d) stage 2 highpass coefficients, e) stage1 highpass coefficients.

Page 13: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

13

-150

-75

0

75

150

0 64 128 192Samples

Inp

ut

dat

a

(a)

-400

-200

0

200

400

0 8 16 24Samples

DW

T C

oef

fici

ents

Original

100*Error

-1

-0.5

0

0.5

1

0 8 16 24

Samples

DW

T C

oef

fici

ents

Original

Error

(b) (c)

-0.6

-0.3

0

0.3

0.6

0 8 16 24 32 40 48 56

Samples

DW

T c

oef

fici

ents

Original

Error-0.8

-0.4

0

0.4

0.8

0 16 32 48 64 80 96 112

Sam ples

DW

T C

oef

fici

ents

Original5*error

(d) (e)

Figure 4. Ideal DWT coefficients and errors due to 12 bit precision of coefficients.The data were decomposed with Daubechies 8 tap wavelet for 3 stages. The dottedlines represent errors due to 12 bits coefficients. a) The sinusoidal input, b) stage 3lowpass coefficients, c) stage 3 highpass coefficients, d) stage 2 highpass coefficients,e) stage 1 highpass coefficients.

Page 14: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

14

40

45

50

55

60

65

8 9 10 11 12 13 14

Filter Coefficient Precision (in bits)

SN

R (

in d

B)

Sine Input

Step Input

Figure 5. Performance variation with respect to filter coefficient precision

38

39

40

41

42

43

44

45

9 10 11 12 13 14 15

Filter Coefficient Precision (in bits)

SN

R (

in d

B)

Lena

Mandrill

(a)

20

30

40

50

60

9 10 11 12 13 14

Wavelet Coefficient Precision (in bits)

SN

R (

in d

B)

Lena

Mandrill

(b)

Figure 6. Performance variation with respect to a) filter coefficients andb) DWT coefficients precision.

Page 15: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

15

R26

R22

R18

R14

R11

R10

R9

R8

R7

R6

R5

R4

R3

R2

R1

F6

F5

F4

F3

F2

F1

D1 D2 D3 D4 D5

OUT PUT

F

I

L

T

E

R

INPUT

2 K

4 K + 4 8 K + 6

INPUT DEL AY

R

E

G

I

S

T

E

R

B

A

N

K

C

O

N

T

R

O

L

U

N

I

T

=

MUX

Figure 7. The proposed DWT-SA architecture.

Page 16: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

16

0 1 2 3 4 5

6 TAP FILTER

OUT

I(nT) I(nT-1) I(nT-2) I(nT-3) I(nT-4) I(nT-5)

CLK

SAMPLE INPUT

Figure 8. Systolic operation of the six tap filter.

To

the

next

stage

H

L

MUX

*+

Sample

From

the

previous

stage

Figure 9. Proposed filter cell.

Page 17: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

17

M UX4 : 1

M UX4 : 1

M UX4 : 1

M UX4 : 1

M UX4 : 1

M UX4 : 1

I

Iz-1

Iz-2

Iz-3

Iz-4

Iz-5

From RB

Clk Control

FromID 3

To Filter Unit

Figure 10. The Control Unit (CU) block diagram.

Page 18: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

18

Table 1. Filter coefficients of Daubechies wavelets.

Wavelets Coefficients Daub-6 Daub-8 Daub-8(LA) 1 Daub-10 Daub-10(LA)

h(0) 0.332671 0.230378 -0.075766 0.160102 0.027333

h(1) 0.806892 0.714847 -0.029636 0.603829 0.029519

h(2) 0.459878 0.630881 0.497619 0.724309 -0.039134

h(3) -0.135011 -0.027984 0.803739 0.138428 0.199398

h(4) -0.085441 -0.187035 0.297858 -0.242295 0.723408

h(5) 0.035226 0.030841 -0.099220 -0.032245 0.633979

h(6) 0.032883 -0.012604 0.077571 0.016602

h(7) -0.010597 0.032223 -0.006241 -0.175328

h(8) -0.012581 -0.021102

h(9) 0.003336 0.019539

h n( )∑ 1.855118 1.865446 1.848663 2.000938 1.885342

{ }max ( ) h n 0.806892 0.714847 0.803739 0.724309 0.723408

1 LA refers to filters with least asymmetric phase

Table 2: Dynamic range of DWT coefficients forvarious stage (z h n= ∑ ( ) ).

Data 1-D 2-D input data ( )−d d, ( )−d d,

1st level coeff. ( )−d z d z* , * ( )−d z d z* , *2 2

2nd level coeff. ( )−d z d z* , *2 2 ( )−d z d z* , *4 4

3rd level coeff. ( )−d z d z* , *3 3 ( )−d z d z* , *6 6

Page 19: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

19

Table 3: Scaling factors: to obtain the idealcoefficients, the j bits DWT coefficientsshould be divided by the scaling factor.

1-D 2-D1st level coeff. 2 1j i− − 2 2j i− −

2nd level coeff. 2 2j i− − 2 4j i− −

3rd level coeff. 2 3j i− − 2 6j i− −

Table 4. Schedule for one complete set of computations.

Init Cycle High - pass Low - pass1 b(0) c(0)2 - -3 b(2) c(2)4 d(0) e(0)5 b(4) c(4)6 f(0) g(0)7 b(6) c(6)8 d(4) e(4)

Page 20: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

20

Table 5. FRA Register Allocation

Cycle Com R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R131 c(0)2 c(0)3 c(2) c(0)4 e(0) c(2) c(0)5 c(4) c(2) c(0)6 g(0) c(4) c(2) c(0)7 c(6) c(4) c(2) c(0)8 e(4) c(6) c(4) c(2) c(0)9 c(0) c(6) c(4) c(2) c(0)10 c(0) c(6) c(4) c(2) c(0)11 c(2) c(0) c(6) c(4) c(2) c(0)12 e(0) c(2) c(0) c(6) c(4) c(2) c(0)13 c(4) e(0) c(2) c(0) c(6) c(4) c(2)14 g(0) c(4) e(0) c(2) c(0) c(6) c(4) c(2)15 c(6) c(4) e(0) c(2) c(0) c(6) c(4)16 e(4) c(6) c(4) e(0) c(2) c(0) c(6) c(4)17 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6)18 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6)19 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0)20 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0)21 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)22 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)23 c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0)24 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0)25 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0)26 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6)27 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4)28 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4)29 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(4)30 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)31 c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0)32 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0)33 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0)34 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6)35 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4)36 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4)37 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(4)38 g0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)

Page 21: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

21

Table 5. FRA Register Allocation (continued).

Cycle Com R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R261 c(0)23 c(2)4 e(0)5 c(4)6 g(0)7 c(6)8 e(4)9 c(0)1011 c(2)12 e(0)13 c(4)14 g(0)15 c(6)16 e(4)17 c(0)1819 c(2)20 e(0)21 c(4)22 g(0)23 c(6)24 e(4)25 c(0)26 e(0)27 c(2) e(0)28 e(0) e(0)29 c(4) e(0)30 g(0) e(4) e(0)31 c(6) e(4) e(0)32 e(4) e(4) e(0)33 c(0) e(4) e(0)34 e(0) e(4) e(0)35 c(2) e(0) e(4) e(0)36 e(0) e(0) e(4) e(0)37 c(4) e(0) e(4) e(0)38 g0) e(4) e(0) e(4) e(0)

Page 22: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

22

Table 6. FBRA Register Allocation.

Tabl Com R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R141 c(0)2 c(0)3 c(2) c(0)4 e(0) c(2) c(0)5 c(4) c(2) c(0)6 g(0) c(4) c(2) c(0)7 c(6) c(4) c(2) c(0)8 e(4) c(6) c(4) c(2) c(0)9 c(0) c(6) c(4) c(2) c(0)10 c(0) c(6) c(4) c(2) c(0)11 c(2) c(0) c(6) c(4) c(2) c(0)12 e(0) c(2) c(0) c(6) c(4) c(2) c(0)13 c(4) e(0) c(2) c(0) c(6) c(4) c(2)14 g(0) c(4) e(0) c(2) c(0) c(6) c(4) c(2)15 c(6) c(4) e(0) c(2) c(0) c(6) c(4)16 e(4) c(6) c(4) e(0) c(2) c(0) c(6) c(4)17 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6)18 c(0) e(4) c(6) c(4) e(0) c(2) c(0) c(6)19 c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0)20 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) c(0)21 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)22 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2)23 c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0)24 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0)25 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0)26 c(0) e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(0)27 c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4)28 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4)29 c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) e(4)30 g(0) c(4) e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) e(4)31 c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) e(0)32 e(4) c(6) c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) e(0)33 c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) c(6) e(4) e(0)34 c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) c(6) e(4) e(0) e(0)35 c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) e(0) e(4) e(0)36 e(0) c(2) c(0) e(4) c(6) c(4) e(0) c(2) e(0) c(0) e(4) e(0) e(4)37 c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) c(2) e(0) e(0) e(4)38 g(0) c(4) e(0) c(2) c(0) e(4) c(6) e(4) c(4) e(0) c(2) e(0) e(0) e(4)

Page 23: VLSI Implementation of Discrete Wavelet Transform€¦ · This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21]

23

Table 7. Activity periods for intermediate results

Sample Available at cycle Life periodc(0) 1 1 to 12c(2) 3 3 to 14c(4) 5 5 to 16c(6) 7 7 to 18e(0) 12 12 to 18e(4) 16 16 to 38