13
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1 A Pipeline VLSI Architecture for High-Speed Computation of the 1-D Discrete Wavelet Transform Chengjun Zhang, Chunyan Wang, Senior Member, IEEE, and M. Omair Ahmad, Fellow, IEEE Abstract—In this paper, a scheme for the design of a high-speed pipeline VLSI architecture for the computation of the 1-D discrete wavelet transform (DWT) is proposed. The main focus of the scheme is on reducing the number and period of clock cycles for the DWT computation with little or no overhead on the hardware resources by maximizing the inter- and intrastage parallelisms of the pipeline. The interstage parallelism is enhanced by optimally mapping the computational load associated with the various DWT decomposition levels to the stages of the pipeline and by synchro- nizing their operations. The intrastage parallelism is enhanced by decomposing the filtering operation equally into two subtasks that can be performed independently in parallel and by optimally organizing the bitwise operations for performing each subtask so that the delay of the critical data path from a partial-product bit to a bit of the output sample for the filtering operation is minimized. It is shown that an architecture designed based on the proposed scheme requires a smaller number of clock cycles compared to that of the architectures employing comparable hardware resources. In fact, the requirement on the hardware resources of the architecture designed by using the proposed scheme also gets improved due to a smaller number of registers that need to be employed. Based on the proposed scheme, a specific example of designing an architecture for the DWT computation is considered. In order to assess the feasibility and the efficiency of the proposed scheme, the architecture thus designed is simulated and imple- mented on a field-programmable gate-array board. It is seen that the simulation and implementation results conform to the stated goals of the proposed scheme, thus making the scheme a viable approach for designing a practical and realizable architecture for real-time DWT computation. Index Terms—Discrete wavelet transform (DWT), DWT com- putation, field-programmable gate-array (FPGA) implementation, inter- and intrastage parallelisms, multiresolution filtering, par- allel architecture, pipeline architecture, real-time processing, VLSI architecture. I. INTRODUCTION S INCE the development of the theory for the computation of the discrete wavelet transform (DWT) by Mallat [1] in 1989, the DWT has been increasingly used in many different areas of science and engineering mainly because of the multires- olution decomposition property of the transformed signals. The Manuscript received August 12, 2009; revised January 18, 2010; accepted February 23, 2010. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada and in part by the Regroupement Stratégique en Microélectronique du Québec (ReSMiQ). This paper was rec- ommended by Associate Editor Y. Lian. The authors are with the Center for Signal Processing and Communications, Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1M8, Canada (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TCSI.2010.2046974 DWT is computationally intensive because of multiple levels of decomposition involved in the computation of the DWT. It is therefore a challenging problem to design an efficient VLSI architecture to implement the DWT computation for real-time applications, particularly those requiring processing of high-fre- quency or broadband signals [2]–[4]. Many architectures have been proposed in order to provide high-speed and area-efficient implementations for the DWT computation [5]–[8]. In [9]–[11], the polyphase matrix of a wavelet filter is decomposed into a sequence of alternating upper and lower triangular matrices and a diagonal matrix to ob- tain the so-called lifting-based architectures with low hardware complexity. However, such architectures have a long critical path, which results in reducing the processing rate of input samples. On the other hand, the problem of low processing rate is not acute in the architectures that use convolution low- and high-pass filtering operations to compute the DWT [12]–[19]. These convolution-based architectures can be categorized as single- or multistage pipeline architectures. The architectures proposed in [12]–[16] are single-stage architectures in which the DWT computation is performed using a recursive pyramid algorithm (RPA) [20] that results in a reduced memory space requirement for the architectures. Lewis and Knowles [12] have designed a simple single-stage VLSI architecture to implement the computation of the DWT without multipliers. Chakrabarti and Vishwanath [13] have proposed a single-stage SIMD architecture that is aimed at reducing the computation time. Grzesczak et al. [14] have proposed a single-stage systolic array architecture with a reduced hardware resource. Cheng and Parhi [15] have proposed a high-speed single-stage architecture based on hardware-efficient parallel finite-impulse response (FIR) filter structures for the DWT computation. The architectures proposed in [17]–[19] are multistage architectures in which the tasks of the various decomposition levels of the DWT computa- tion are distributed to a number of pipeline stages. A high-speed multistage pipeline architecture, with one stage to carry out the computation of one decomposition level of the DWT, has been proposed by Marino et al. [17]. Park [18] has proposed a pipeline architecture using scalable data recorder units, each requiring a small amount of hardware resources. Masud and McCanny [19] have proposed a method for the design of an efficient, modular, and scalable pipeline architecture by using reusable silicon intellectual property (IP) cores for the DWT computation. The pipeline architectures have the advantages of requiring a small memory space and a short computing time and are suitable for real-time computations. However, these architectures have some inherent characteristics that have not yet been fully exploited in the schemes for their design. The 1549-8328/$26.00 © 2010 IEEE

A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

  • Upload
    ritts1

  • View
    74

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1

A Pipeline VLSI Architecture for High-SpeedComputation of the 1-D Discrete Wavelet Transform

Chengjun Zhang, Chunyan Wang, Senior Member, IEEE, and M. Omair Ahmad, Fellow, IEEE

Abstract—In this paper, a scheme for the design of a high-speedpipeline VLSI architecture for the computation of the 1-D discretewavelet transform (DWT) is proposed. The main focus of thescheme is on reducing the number and period of clock cycles forthe DWT computation with little or no overhead on the hardwareresources by maximizing the inter- and intrastage parallelisms ofthe pipeline. The interstage parallelism is enhanced by optimallymapping the computational load associated with the various DWTdecomposition levels to the stages of the pipeline and by synchro-nizing their operations. The intrastage parallelism is enhancedby decomposing the filtering operation equally into two subtasksthat can be performed independently in parallel and by optimallyorganizing the bitwise operations for performing each subtask sothat the delay of the critical data path from a partial-product bit toa bit of the output sample for the filtering operation is minimized.It is shown that an architecture designed based on the proposedscheme requires a smaller number of clock cycles comparedto that of the architectures employing comparable hardwareresources. In fact, the requirement on the hardware resources ofthe architecture designed by using the proposed scheme also getsimproved due to a smaller number of registers that need to beemployed. Based on the proposed scheme, a specific example ofdesigning an architecture for the DWT computation is considered.In order to assess the feasibility and the efficiency of the proposedscheme, the architecture thus designed is simulated and imple-mented on a field-programmable gate-array board. It is seen thatthe simulation and implementation results conform to the statedgoals of the proposed scheme, thus making the scheme a viableapproach for designing a practical and realizable architecture forreal-time DWT computation.

Index Terms—Discrete wavelet transform (DWT), DWT com-putation, field-programmable gate-array (FPGA) implementation,inter- and intrastage parallelisms, multiresolution filtering, par-allel architecture, pipeline architecture, real-time processing, VLSIarchitecture.

I. INTRODUCTION

S INCE the development of the theory for the computationof the discrete wavelet transform (DWT) by Mallat [1] in

1989, the DWT has been increasingly used in many differentareas of science and engineering mainly because of the multires-olution decomposition property of the transformed signals. The

Manuscript received August 12, 2009; revised January 18, 2010; acceptedFebruary 23, 2010. This work was supported in part by the Natural Sciencesand Engineering Research Council of Canada and in part by the RegroupementStratégique en Microélectronique du Québec (ReSMiQ). This paper was rec-ommended by Associate Editor Y. Lian.

The authors are with the Center for Signal Processing and Communications,Department of Electrical and Computer Engineering, Concordia University,Montreal, QC H3G 1M8, Canada (e-mail: [email protected];[email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSI.2010.2046974

DWT is computationally intensive because of multiple levelsof decomposition involved in the computation of the DWT. Itis therefore a challenging problem to design an efficient VLSIarchitecture to implement the DWT computation for real-timeapplications, particularly those requiring processing of high-fre-quency or broadband signals [2]–[4].

Many architectures have been proposed in order to providehigh-speed and area-efficient implementations for the DWTcomputation [5]–[8]. In [9]–[11], the polyphase matrix ofa wavelet filter is decomposed into a sequence of alternatingupper and lower triangular matrices and a diagonal matrix to ob-tain the so-called lifting-based architectures with low hardwarecomplexity. However, such architectures have a long criticalpath, which results in reducing the processing rate of inputsamples. On the other hand, the problem of low processing rateis not acute in the architectures that use convolution low- andhigh-pass filtering operations to compute the DWT [12]–[19].These convolution-based architectures can be categorized assingle- or multistage pipeline architectures. The architecturesproposed in [12]–[16] are single-stage architectures in whichthe DWT computation is performed using a recursive pyramidalgorithm (RPA) [20] that results in a reduced memory spacerequirement for the architectures. Lewis and Knowles [12] havedesigned a simple single-stage VLSI architecture to implementthe computation of the DWT without multipliers. Chakrabartiand Vishwanath [13] have proposed a single-stage SIMDarchitecture that is aimed at reducing the computation time.Grzesczak et al. [14] have proposed a single-stage systolic arrayarchitecture with a reduced hardware resource. Cheng and Parhi[15] have proposed a high-speed single-stage architecture basedon hardware-efficient parallel finite-impulse response (FIR)filter structures for the DWT computation. The architecturesproposed in [17]–[19] are multistage architectures in which thetasks of the various decomposition levels of the DWT computa-tion are distributed to a number of pipeline stages. A high-speedmultistage pipeline architecture, with one stage to carry outthe computation of one decomposition level of the DWT, hasbeen proposed by Marino et al. [17]. Park [18] has proposeda pipeline architecture using scalable data recorder units, eachrequiring a small amount of hardware resources. Masud andMcCanny [19] have proposed a method for the design of anefficient, modular, and scalable pipeline architecture by usingreusable silicon intellectual property (IP) cores for the DWTcomputation. The pipeline architectures have the advantages ofrequiring a small memory space and a short computing timeand are suitable for real-time computations. However, thesearchitectures have some inherent characteristics that have notyet been fully exploited in the schemes for their design. The

1549-8328/$26.00 © 2010 IEEE

Page 2: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

computational performance of such architectures could be fur-ther improved, provided that the design of the pipeline makesuse of inter- and intrastage parallelisms to the maximum extentpossible, synchronizes the operations of the stages optimally,and utilizes the available hardware resources judiciously.

In this paper, a scheme for the design of a pipeline architec-ture for a fast computation of the DWT is developed. The goal offast computation is achieved by minimizing the number and pe-riod of clock cycles. The main idea used for minimizing thesetwo parameters is to optimally distribute the task of the DWTcomputation among the stages of the pipeline and to maximizethe inter- and intrastage parallelisms of the pipeline.

This paper is organized as follows. In Section II, a matrix for-mulation for the DWT computation is presented. In Section III, astudy is conducted to determine the number of stages required tooptimally map the task of the DWT computation onto the stagesof the pipeline. Based on this study, in Section IV, a scheme forthe design of a pipeline architecture is developed. In Section V,the performance of the pipeline architecture for the DWT com-putation using the proposed design scheme is assessed and com-pared with that of other existing architectures. A specific ex-ample of designing an architecture for the DWT computationis also considered, and the resulting architecture is simulatedand implemented on a field-programmable gate-array (FPGA)board in order to demonstrate the realizability and validity of theproposed scheme. Section VI summarizes this paper by high-lighting the salient features of the proposed design scheme andthe resulting pipeline architectures.

II. FORMULATION OF THE DWT COMPUTATION

A. Matrix Formulation

The 1-D DWT of a signal is computed by performing the fil-tering operation repeatedly, i.e., first on the input data and thenon the output data, each time after decimating it by a factorof two, for the successive decomposition levels. The filteringoperation uses a quadrature mirror filter bank with low- andhigh-pass filters to decompose the signal into low- and high-passsubband signals, respectively. The transform can be expressedusing a matrix formulation in order to provide a better insight

into the underlining operations of the DWT and to facilitate theproposed scheme for the design of the architecture for its com-putation.

Let the signal be denoted as ,where , the number of samples in the input signal, is chosento be , with being an integer. Assume that and

are the coefficients of the -tap low- and high-pass filters, respectively. Then, by expressing the transform ma-trices for the low- and high-pass computations at the th level

decomposition as (1a) and (1b), shown at thebottom of the page, respectively, where both and havea size of , the outputs of the transform at theth level can be computed from the following:

(2)

where and represent the column vectors of sizeand consist of low- and high-pass output samples, respectively,at the decomposition level , with . It is clear from(1a) and (1b) that the lengths of the filters and the size of theinput samples control the number of nonzero entries of the ma-trices involved, which, in turn, determines the complexity of theDWT computation. If the decomposed signals are required to bereassembled into the original form without loss of information,the low- and high-pass filters must satisfy the perfect reconstruc-tion condition given by

(3)

A border extension of the input signal becomes necessaryfor the processing of the samples on or near the border of a fi-nite-length signal. There are generally three ways by which theborder can be extended in a DWT computation, namely, zeropadding, symmetric padding, and periodic padding [21]. Eventhough, from the point of view of hardware cost, zero paddingis the least expensive, periodic padding is the most commonlyused method for border extension, since it allows a precise re-covery of the original signal at or near the border. This methodextends the original sequence by appending it with its first

samples as

(4)

...... (1a)

...... (1b)

Page 3: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: PIPELINE VLSI ARCHITECTURE FOR HIGH-SPEED COMPUTATION OF THE 1-D DWT 3

Thus, in order to operate on the padded input sequence ,the transform matrices and have to be modified byappending each by additional columns. The elements ofthe appended columns in a row of a modified transform matrixassume a zero value, if all the filter coefficients already appear inthe corresponding row of (1a) or (1b). Otherwise, the elementsin the row are made to assume the missing values of the filtercoefficients so that all the coefficients appear in that row of themodified transform matrix.

B. Reformulation of (2)

It is seen from (1a) and (1b) that, due to the decima-tion-by-two requirement of the DWT, entries in the successiverows of matrices and and, therefore, in their mod-ified versions are shifted to the right by two positions. Thisproperty can be utilized to decompose the arithmetic opera-tions in (2) into two parts so that the operations in one partcan be performed simultaneously with those of the other one.For this purpose, we now decompose each of the modifiedtransform matrices and by separating the even- andodd-numbered columns of each matrix into two submatrices.The resulting submatrices, taking into account the perfectreconstruction condition specified by (3), can be expressed as(5a)–(5d), shown at the bottom of the page, in which the entriesin the successive rows are shifted to the right by only oneposition. With this decomposition of the transform matrices,the DWT computation, as given by (2), can be reformulated as

(6)

where and are the two subvectors consisting ofeven- and odd-numbered samples, respectively, in the paddedvector of .

It is seen from (6) that the operations in each of the twoterms are identical, and also, they can be performed indepen-dently in parallel. Furthermore, in view of the structures of thedecomposed transform matrices, as given by (5a)–(5d), the fil-tering operation can be carried out by employing the conven-tional clocking mechanism used for implementing digital sys-tems.

III. CHOICE OF A PIPELINE FOR THE 1-D DWT COMPUTATION

In a pipeline structure for the DWT computation, multiplestages are used to carry out the computations of the variousdecomposition levels of the transform. Thus, the computationcorresponding to each decomposition level needs to be mappedto a stage or stages of the pipeline. In order to maximize thehardware utilization of a pipeline, the hardware resource of astage should be proportional to the amount of the computationassigned to the stage. Since the amount of computations in suc-cessive decomposition levels of the transform gets reduced bya factor of two, two scenarios can be used for the distributionof the computations to the stages of a pipeline. In the first sce-nario, the decomposition levels are assigned to the stages so asto equalize the computations carried out by each stage, i.e., thehardware requirements of all the stages are kept the same. In thesecond scenario, the computations of the successive decompo-sition levels are assigned to the successive stages of a pipelineon a one-level-to-one-stage basis. Thus, in this case, the hard-ware requirement of the stages gets reduced by a factor of two

...... (5a)

...... (5b)

...... (5c)

...... (5d)

Page 4: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

Fig. 1. Stage-equalized pipeline structure.

Fig. 2. One-to-one mapped pipeline structure with � �� � �� stages.

as they perform the computations corresponding to higher leveldecompositions.

Fig. 1 shows a stage-equalized pipeline structure, in whichthe computations of all the levels are distributedequally among the stages. The process of stage equalizationcan be accomplished by dividing equally the task of a given levelof decomposition into smaller subtasks and assigning each suchsubtask to a single stage and/or by combining the tasks of morethan one consecutive level of decomposition into a single taskand assigning it to a single stage. Note that, generally, a divisionof the task would be required for low levels of decompositionand a combination of the tasks for high levels of decomposition.

In a one-to-one mapped structure, the computations of de-composition levels are distributed exactly among stages, onelevel to one stage. In practical applications, a structure with lessthan stages is used for the computation of a -level DWT, asshown in Fig. 2. In this structure, the computations of the first

levels are carried out by the stages ,respectively, and those of the last levels are performedrecursively by the th stage. The amount of hardware resourcesof a stage is one-half of that of its preceding one except for the

th stage that has the same size as that of the preceding stage.The structures of Figs. 1 and 2 can be used to perform the

computations of multiple levels of decomposition. The compu-tation of each level is performed as an -tap FIR filtering oper-ation by summing the products of the input samples and thefilter coefficients, as described by (2). Generally, one multiplica-tion-and-accumulation (MAC) cell is used to carry out one mul-tiplication of an input sample by a coefficient, followed by oneaccumulation operation. In order to perform an uninterrupted

-tap filtering operation with easy control, one can thus use anetwork of basic units of such a MAC cell. Since all the de-composition levels perform -tap filtering operations, it wouldbe desirable that each decomposition level performs its filteringoperation using this same type of MAC-cell network. However,in the context of one-to-one mapped pipeline structure of Fig. 2,in which the requirement is that the hardware resource shouldget reduced by a factor of two from one stage to the next, theuse of the same MAC-cell network for all the stages would notbe possible, unless the pipeline has only two stages. In otherwords, the first stage performs the level-1 computation, and thesecond stage performs the computations corresponding to all theremaining levels recursively. In the context of a stage-equalizedpipeline structure of Fig. 1, where the requirement is that allthe stages should have the same hardware resource, the sameMAC-cell network can be used easily for all the stages. How-ever, in this case, the same amount of computations cannot be

Fig. 3. Pipeline structure with two stages.

assigned to all the stages that are based on the same MAC-cellnetwork, unless, again, there are only two stages in the pipeline.

In a situation of a pipeline of more than two stages, each basedon a network of MAC cells, one cannot achieve a resource-efficient architecture. Thus, for either pipeline structure, i.e.,the one-to-one mapped or stage-equalized, a two-stage pipelinewould be the best choice in terms of hardware efficiency andfrom the standpoint of design and implementation simplicity.Note that the two-stage version of either pipeline structure isthe same, and it is shown in Fig. 3. An additional advantage ofthe two-stage pipeline is in the design flexibility of a MAC-cellnetwork where the multiplication and accumulation operationscan be furnished together by using logic gates. These logic gatescould be arranged into more efficient arrays yielding a shorterpropagation delay for the MAC-cell network. Based on the pre-vious discussion, it seems logical to use the two-stage pipelinestructure of Fig. 3 for the design and implementation of an ar-chitecture for the 1-D DWT computation. The next section isconcerned specifically with a detailed design of the architecture.

IV. ARCHITECTURE DESIGN

In the previous section, we advocated a two-stage pipelinestructure for the computation of the 1-D DWT. The structure,whose development is constrained by the nature of the DWTcomputation, is capable of optimizing the use of hardware re-sources. In this two-stage structure, stage 2 performs by oper-ating on the data produced by stage 1 and on those produced byitself, and therefore, the operations of the two stages need to besynchronized in a best possible manner [22]. In this section, wepresent the design of the proposed two-stage pipeline architec-ture focusing on data synchronization, the details of the variouscomponents comprising the stages, and the inter- and intrastagedata flow.

A. Synchronization of Stages

In order to develop a suitable synchronization scheme, con-sider the timing diagram for the relative operations of the twostages shown in Fig. 4, where and are the times taken in-dividually by stages 1 and 2, respectively, to carry out their op-erations; and are the time spans during which stage 1 or2 alone is operational; and is the overlapped time span forthe two stages. Our objective is to minimize . Sincethe operation of stage 1 is independent of that of stage 2, it cancontinue its operation continuously until the computation of allthe samples of decomposition level 1 is done. In Fig. 4, the slotsshown for stage 1 correspond to the samples of decompo-sition level 1 that it has to compute. The presence of continuousslots indicates that stage 1 can continue its operation uninter-ruptedly without having any idle slot. Thus, the minimal pos-sible value for is equal to , where is the time

Page 5: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: PIPELINE VLSI ARCHITECTURE FOR HIGH-SPEED COMPUTATION OF THE 1-D DWT 5

TABLE INUMBERS AND SAMPLES IN THE LOWEST BOUND

required to compute one output sample. If and weassume that the DWT operation has to be carried out for all the

levels, then the number of samples that stage 2 has to computeis . Thus, the lowest bound for is . Now,by choosing a value of equal to its lowest bound, if one canshow that (i.e., stage 2 does not have any idle slotduring ), then, indeed, not only will be minimizedbut one also achieves its lowest bound. Now, we will show that,for the proposed architecture, this is so possible.

Let us first determine the lowest bound on . Since the lastsample of level 1, as produced by stage 1, becomes availableonly at the end of , a sample at level that depends on thislast sample directly or indirectly could not possibly be computedduring the time span and has therefore to be computed during

. Assume the following: 1) During , we compute samplesof levels 2 and higher, which could not possibly be computedduring , and 2) other output samples necessary for computingthose samples have already been computed during . Thelowest bound on is . Therefore, in order to compute thisbound, we need to determine the value of . The last sample oflevel 1, which is computed at the end of , is . There are

output samples at level 2 that depend on this sample,and they are given as , ,where and represent the smallest integer larger thanor equal to and the largest integer less than or equal to ,respectively. Next, at level 3, there are outputsamples that indirectly depend on , and they are given as

, . Similarly, we candetermine the numbers and samples that depend indirectly on

for other levels. Table I gives the listing of the numbers

and samples of levels from to that depend on .After adding the expression in the third column of this table andsome manipulation, it can be shown that the value of can beobtained as (7), shown at the bottom of the page.

Fig. 4. Timing diagram for the operations of two stages.

In Fig. 4, is chosen to be . Next, we explorethe possibility of developing a synchronization scheme for com-puting all the output samples in the context of Fig. 4 with theobjective that stage 2 does not create any idle slots. In devel-oping such a scheme, one has to take into consideration the re-quirement of the underlying filtering operation of the waveletcomputation. This filtering operation imposes the constraint thatthe first output sample at level cannot be computed untilsamples at level have already been computed and each ofthe subsequent samples at level cannot be computed, unlesstwo new samples at level have already been computed.Note that this requirement of the filtering operation imposes aconstraint on the operation of stage 2 only, since stage 1 oper-ates sequentially and unilaterally to compute the level-1 outputsamples only. Under this constraint, we now give three steps ofthe synchronization scheme that govern the computation of theoutput samples at various decomposition levels by stages 1 and2.Step 1) Stage 1 operates continuously to compute the level-1

output samples sequentially.Step 2) Stage 2 starts the computation of level-2 samples be-

ginning at the time slot .Step 3)

a) When stage 2 is computing an output sample atthe lowest incomplete level .After completing the computation of the presentsample at this level, stage 2 moves on to the com-putation of a sample at the lowest higher level,if the data required for the computation of thissample have become available; otherwise, stage2 continues with the computation of the nextsample at the present level .

b) When stage 2 is computing an output sample ata level other than the lowest incomplete level.After completing the computation of the presentsample, stage 2 moves its operation to the lowestincomplete level.

The rationale behind Step 3a) is that moving the operation ofstage 2 to a higher level allows more data from level 1, as pro-duced by stage 1, to become available, since the availability ofthe output samples of level 1 is crucial for the computation of thesamples at higher levels. On the other hand, the rationale behind

(7)

Page 6: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

Fig. 5. Synchronization scheme for a 128-point �� � �� DWT computation using a length-4 �� � �� FIR filter.

Step 3b) is that there are always more samples to be computed atlower levels than that at higher levels, and therefore, more timeneeds to be spent in computing lower level samples.

The nature of the filtering operation coupled with the decima-tion by a factor of two requires that, in order for stage 2 to com-pute a level-2 sample at slot , stage 2 needs level-1 samplescomputed by stage 1 at slots , ofwhich the samples produced at the last two slots must not havebeen previously used for the computation of level-2 samples. Ifstage 2 can meet this requirement during the entire time span ,then it can continue its operation uninterruptedly without cre-ating an idle slot. We will now show that, based on the steps pre-sented before, stage 2 would indeed be able to meet this require-ment. For this purpose, consider an algorithm, namely, Algo-rithm 1, which synchronizes the operation of stage 2 during thetime span . In this algorithm, we have made use of two coun-ters, namely, and . The counters and represent the totalnumber of samples having been computed at level 2 and that atthe levels higher than 2, respectively, at a particular instant ofstage-2 operation. Note that, at the time that stage 2 starts itsoperation, stage 1 has already produced level-1 samples.Since a length- filtering operation would require input sam-ples and , stage 2 can not only start the computationof level-2 samples but also continue the computation of the suc-ceeding level-2 samples at least for some time. Since the com-putation of each level-2 sample makes use of two new level-1samples during the time in which only one level-1 sample isproduced by stage 1, the number of available level-1 samplesis reduced by one after the computation of each level-2 sample.However, since stage 2, following Step 3) of the synchronizationscheme, is allowed to compute the samples at levels higher than2 without making use of the samples from level 1, the reservoirof level-1 samples is increased by one after the computation ofone such a higher level sample. Therefore, at a particular time,there are level-1 samples available to beused by stage 2 for the computation of the succeeding level-2samples. Since increases faster than , reaches its max-imum value at the time slot just before the end of time span .At this time slot

(8a)

(8b)

Thus, using (7), (8a) and (8b), the lowest bound of duringtime span is calculated as

(9)Since, in practice, the filter length is such that ,the aforementioned inequality can be written as

if is evenif is odd.

(10)

Thus, the lowest bound on is greater than or equal to .Therefore, during the entire course of time span , there willalways exist sufficient number of samples available to stage 2for it to continue its level-2 computation in the framework ofAlgorithm 1. In other words, during time span , stage 2 wouldnever have to cease its operation for the lack of availability ofat least two new level-1 samples, i.e., the block in Algorithm 1that introduces a unit delay will never be used during the ex-ecution of the algorithm.

Algorithm 1: Synchronizing the operation of stage 2 during

Initialize ,

WhileIf (at least two new samples available from level 1) then

Compute a new sample at level 2

If (enough data available from the lowest level) then

Compute a new sample at level

End ifElse

Unit delayEnd if

End whileEnd algorithm

We now consider an example to illustrate the synchroniza-tion scheme that has been presented earlier. For this purpose, weconsider a 128-point DWT computation using four-tap

FIR filters. The synchronized operation of the twostages is shown in Fig. 5, in which each rectangle representsa time slot during which a low-pass output sample is produced.Stage 1 starts the computation of the first level-1 output sampleat slot 1 and continues its operation until slot 64 when the com-putation of the 64th level-1 output sample is completed. Equa-tion (7) can be used to obtain the value of as 13. Thus, at slotnumber , stage 2 starts the computation of thefirst level-2 output sample. At this point, the reservoir of level-1

Page 7: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: PIPELINE VLSI ARCHITECTURE FOR HIGH-SPEED COMPUTATION OF THE 1-D DWT 7

Fig. 6. Block diagram of the two-stage architecture.

available samples contains samples. Note that thenumber of samples in this reservoir decreases by one sample asone new level-2 sample is computed and that it increases by oneas one sample at a level higher than 2 is computed. However,the general trend is a decline in the number of available level-1samples from 14 samples at slot 15 to 4 samples at slot 65 whenthe computations of all level-1 samples are completed. At slot66, an output sample at level 4 is computed, since the requiredsamples from level 3 have become available for its computation.After this computation, stage 2 returns its operation to the com-putation of the last level-2 output sample. Note that, for the com-putation of this last level-2 sample, two padded samples wouldbe required, since, at this time, no level-1 output sample is un-used. Beyond this point, all the remaining samples from levels 3to 7 are computed using Step 3) of the synchronization scheme.

B. Design of Stages

Since, in the stage-equalized architectures, the two stages to-gether perform the DWT computation, with the amount and typeof computations of the individual stages being the same, each ofthe two stages can use identical processing units. However, thecontrol units to be employed by the stages have to be different,since, as seen from Algorithm 1 of the previous section, the op-eration of stage 1 is autonomous, whereas stage 2 must alwayssynchronize its operation with that of stage 1. Based on this al-gorithm, the design of the control unit used by stage 2 wouldhave to be a bit more involved than that of the control unit usedby stage 1. Obviously, in order to synchronize the operation ofstage 2 with that of stage 1, a buffer has to be used to store thelow-pass output samples from the two stages. Fig. 6 shows ablock diagram incorporating all these requirements for the de-sign of the proposed architecture. The two processing units arereferred to as PU in stage 1 and PU in stage 2. Note that, inthis architecture, the high-pass samples from PU and PU areoutputted directly.

In each stage, the processing unit by employing an -MAC-cell network performs an -tap filtering operation and, at eachclock cycle, generates a total of product terms and their sum.Since, normally, the interval between the two consecutive inputsamples must not be smaller than the delay of a MAC cell, themaximal allowable data rate of the input to the processing unitwould be determined by this delay. However, if the -MAC-cellnetwork is organized into subnetworks operating in parallel,the input samples can be applied to these subnetworks in aninterleaved manner. The interval of the two consecutive inputsamples can thus be shortened by a factor . To this end, con-sidering the problem at hand in which a two-subband filtering

Fig. 7. Block diagram of the processing unit for �-tap filtering computationassuming � to be an even number.

operation is performed and, for each consecutive decomposi-tion level, the input data are decimated by a factor of two, the

MAC cells can be conveniently organized into a pair of evenand odd filter blocks. These even and odd filter blocks, which re-ceive the even- and odd-numbered input samples, respectively,employ -MAC-cell networks, and each produces onlyproduct terms and their sums. The partial sums from the twonetworks are required to be added in an accumulation block byusing a carry propagation adder (CPA), as shown in Fig. 7. Sincethe delay of the accumulate block is comparable to that of the

-MAC-cell network, it is useful to pipeline them for par-allel computation. Since the high-pass operation differs fromthat of the low-pass operation only in reversing the sign of theeven-numbered coefficients, the proposed organization of theprocessing unit would allow the filter block to use the same filtercoefficients by simply introducing a sign inversion block intothe even filter block, as shown in Fig. 7.

As discussed earlier and seen from Fig. 6, all the output datamust be synchronized in accordance with Algorithm 1. This syn-chronization process is facilitated by introducing in stage 2 abuffer, which stores output data from the two stages and pro-vides input data to stage 2. According to Step 2) of the synchro-nization scheme, during time span , the number of samplesthat need to be stored for the operation of stage 2 increases until

. However, this number will not exceed duringtime spans and , since the number of samples newly pro-duced by stages 1 and 2 is equal to or less than that consumedby stage 2. Thus, the minimum capacity of the buffer for theoperation of stage 2 is registers. Since the number ofoutput samples at a level that would be needed to compute anoutput sample at the next higher level will not exceed the filterlength , the buffer is therefore divided intochannels, as shown in Fig. 8. Each channel consists of shiftregisters, except channel that only has reg-isters, where is the remainder on division of by .Channel 1 is used for storing only the level-1 samples producedby PU , whereas channel for the level- samplesduring and , and would also be used for storing the level-1samples during . Note that channel 2 is also chosen to store

Page 8: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

Fig. 8. Structure of the buffer.

Fig. 9. (a) Formation of an array of partial products. (b) Two types of bitwiseadders. (c) A layered organization of bitwise addition using the two modules in(b).

the samples of the remaining levels starting at the timeslot when all the level-2 samples have been consumed.

C. Design of -MAC-Cell Network

In the processing unit shown in Fig. 7, each physical linkfrom a given input bit to an output bit of an -MAC-cell net-work gives rise to a channel or data path having a delay that de-pends on the number and the types of operations being carriedout along that path [23]. Thus, it is crucial to aim at achievingthe shortest critical data path when designing an -MAC-cellnetwork for our architecture. In order to have a better appre-ciation of the operations of an -MAC-cell network, let usconsider an example of the filtering operation of one such net-work with . Let us assume that the input samples andthe filter coefficients have wordlengths of six and three, respec-tively. Each MAC-cell network has six partial products, with atotal of 36 bits, which can be produced in parallel, as shownin Fig. 9(a). Our objective is to design a MAC-cell network, inwhich the bits of the partial products are accumulated in sucha way as to optimize the delays of the data paths from the in-dividual bits of the partial products to the output bits of theMAC-cell network.

Even though all the bits of the partial products, as given bythe array shown in Fig. 9(a), are available simultaneously, theycannot be used in parallel to produce simultaneously all the bits

of an output sample. The reason for this is that the processes ofaccumulation of the bits in each column of the array of the par-tial products have to be carried out bitwise, and at the same time,one has to take care of the propagations of the carry bits. In otherwords, the accumulation of the partial products has to be carriedout in a certain sequence. Thus, the task of accumulation can bedivided into a sequence of layers such that the operations of thefirst layer depend only on the partial-product bits and that thoseof the succeeding layers depend on the partial-product bits notyet used, as well as on the bits of the results of the precedinglayers. In order to meet our goal of minimizing the critical pathfrom a partial-product bit to a bit of the output sample, we canorganize the layers of the MAC-cell network that would carryout the accumulation of the partial products based on the fol-lowing guiding principle. Minimize the number of layers whileminimizing the delay of each layer. The number of layers can beminimized by assigning to each layer the maximum number ofsuch tasks that can be performed independently of each other inparallel. The accumulation task in each layer can be performedby using full-adder (3:2) and double-adder (2 2:3) modules,as shown in Fig. 9(b). The two types of modules are chosen be-cause of the following reasons: 1) Their delays are about thesame so that the delay of any layer can be made equal to thisdelay irrespective of whether the layer uses one or two types ofmodules, and 2) the two modules can be used together in sucha way so that they produce a smaller number of propagatingcarry bits, and therefore, their combined use helps in reducingthe number of layers.

With the choice of the combination of full and double adders,the first layer can be formed by using as many modules as nec-essary, with the maximum number of partial-product bits beingutilized as 3- or 4-bit inputs to the respective modules. Scanningthe partial-product array from right to left, a maximum numberof bits of this array are first used as inputs to as many full-addermodules as necessary, since, in comparison to a double adder,this module is more efficient in consuming the bits of the inputarray. In this process, whenever in a column 1) only two bits ofthe partial-product array are left unused, these two bits, alongwith a pair of bits from the neighboring left column of the array,are used as inputs to a double-adder modules, and whenever ina column 2) only one bit of the partial-product array is left un-used, this bit is used in the next layer for accumulation. Note thatthe case of using a double adder also helps in propagating twocarry bits, i.e., one internal and the other external to the adder, tothe left within the same time delay as that of the full adder. Thenext layer can then be formed again by using as many modulesas necessary with inputs from the partial-product bits, still un-used, and the sum and carry output bits from the previous layersbeing utilized in a carry-save manner. This process can be con-tinued until after the last layer when all the bits of an outputsample are produced.

Based on the principles and the procedure enunciated pre-viously, we can now give formally an algorithm, namely, Al-gorithm 2, which carries out the organization of a MAC-cellnetwork, given input samples and filter coefficients.Fig. 9(c) shows an illustration of the organization of the addermodules into three layers of a MAC-cell network for the ex-ample considered earlier. It is seen from this figure that the delay

Page 9: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: PIPELINE VLSI ARCHITECTURE FOR HIGH-SPEED COMPUTATION OF THE 1-D DWT 9

of the critical path is equal to that of three full adders for this par-ticular example.

Algorithm 2: Organizing the bitwise modules of the MAC-cellnetwork

Initialize an array of partial-product bits fromthe -bit samples and -bit filter coefficients, where

and

.

While for any

Initialize the elements of an arrayby for

For every column

While

Assign 3 bits, ,, ,

as inputs to a full adder

Append one sum bit to ,and one carry bit toin

End while

If and then

Assign 2 2 bits, ,, ,

, as inputs to a double adder

Append two sum bits to ,, and one carry bit

to in

Else

Carry forward unused bitsto in

End if

End for

End while

End algorithm

Using Algorithm 2, a generalized structure for the MAC-cellnetwork, as shown in Fig. 10, can be generated with -bitsamples and -bit filter coefficients as inputs to the net-work. Layer produces a total of partial-productbits. The accumulations of these partial-product bits are car-ried out successively by a set of layers of adder modules. Avariable-size array is used as input to each layer. This arrayinitially contains only the partial-product bits, and for succes-sive layers, it contains the sum and carry bits from the previouslayers and the partial-product bits still unused. An input to alayer that consists of a partial-product bit or a sum bit is shownin the figure by an arrow going down vertically into the layer,whereas an input that consists of a carry bit is shown by an

Fig. 10. Structure of the ���-MAC-cell network.

arrow going down leftward. The MAC-cell network has a totalof layers, which is the min-imum number of layers, with the choice of using the maximumnumber of full adders, followed by, if necessary, the doubleadders in each layer. The number of adder modules used foreach layer progressively decreases from Layer to Layer . Theoutput bits of the MAC-cell network are then used by the accu-mulation block of the processing unit to produce the final sum.In aforementioned design of the MAC-cell network, optimiza-tion of its critical path is carried out by incorporating and ar-ranging the multiply and accumulate operations into multiplelayers. This leads to a network that has a critical path with asmaller delay than the delay of the MAC cell used in DSP pro-cessors, in which the delay of the critical path is simply the sumof the delays associated with a multiplier and an accumulator.The critical path of the MAC-cell network could be shortenedfurther by encoding the input data to the MAC-cell networkusing booth encoders. Thus, the delay of the MAC-cell networkis reduced by making a smaller number of carry bits to propa-gate through the MAC-cell network. However, such an improve-ment can be achieved with an expense of additional hardwareresources to be used for encoders.

V. PERFORMANCE EVALUATION AND FPGA IMPLEMENTATION

In order to evaluate the performance of the architectureresulting from the proposed scheme, we need to make use ofcertain metrics that characterize the architecture in terms of thehardware resources used and the computation time. The hard-ware resources used for the filtering operation are measured bythe number of multipliers and the number of adders

, and that used for the memory space and pipelinelatches is measured by the number of registers . Thecomputation time, in general, is technology dependent. How-ever, a metric, which is independent of the technology usedbut can be utilized to determine the computation time , is thenumber of clock cycles consumed from the instant thefirst sample is inputted to the last sample outputted assuming agiven clock-cycle period, for example, unity, as the latency of aMAC cell.

For a -level DWT computation of an -sample sequenceusing -tap filters, the expressions for the metrics mentionedbefore for various architectures are summarized in Table II. As-suming that the number of samples is much larger than , itis seen from the table that, compared to the architecture of [17],all the other architectures, including the proposed one, require

Page 10: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

TABLE IICOMPARISON OF VARIOUS ARCHITECTURES

approximately twice the number of clock cycles, except the ar-chitecture of [14], which requires four times as many clock cy-cles. This performance of [17] is achieved by utilizing the hard-ware resources of adders and multipliers that are four times thatrequired by the architecture of [14] and twice that required byany of the other architectures. However, if the value ofcannot be neglected in comparison to that of , then the valuesof , , and should be taken into consideration while com-paring the architectures in terms of . In this regard, onlyfor the proposed architecture and the architecture of [18] that

is independent of the filter length, with the proposed ar-chitecture giving the lowest value of for a given . Theproposed architecture requires the number of registers that isat least 20% less than that required by any of the other architec-tures when the filter length is large. It should be noted that ap-proximately 20% of the hardware resource comprises registers.

Since the area of the circuit for the DWT computation de-pends on the filter length and the total number of samples ,it would be useful to have a measure of the area of the circuitas a function of and . Only the proposed architecture andthose of [13] and [18] are used for this paper, since the numbersof multipliers and the numbers of adders for these architecturesare the same. Thus, any difference in the areas of the three ar-chitectures could be accounted for due mainly to the differencein the number of registers used by each of the architectures. Asseen from Table II, the number of registers for the architectureof [13] is , and that for the architecture of [18] is ap-proximately . However, thenumber of registers for the proposed architecture depends notonly directly on the filter length but also indirectly on and

through the parameter . These dependences are intuitivelyobvious from the fact that, as the filter length or the number ofsamples increases, the starting point of stage 2 gets more de-layed. In other words, is increased. However, it is seen fromthis figure that the dependence of on is relatively muchmore nonlinear than its dependence on . The results of Fig. 11can be used to obtain a measure of the area of the proposed ar-chitecture as a function of and . We estimate the areas ofthe proposed architecture along with that of the other two ar-chitectures under the assumption that the ratio of areas of onemultiplier, one adder, and one register is 12:3:1. The plots ofthe estimates of the areas as functions of and are shown in

Fig. 11. Estimated values of � . (a) � versus � �� � � �. (b) � versus� �� � ���.

Fig. 12. Estimated areas of the three architectures. (a) � versus � �� � � �.(b) � versus � �� � ���.

Fig. 12. It is obvious from this figure that the area of the pro-posed architecture is, in general, lower than those of the othertwo architectures. The lower area of the proposed architecturecan be attributed due mainly to the presence of the parameterin its expression for . Recall that is a parameter that weminimized in the design of the proposed architecture in order tomaximize the parallelism between the two stages, and a lowervalue of , in turn, results in smaller number of registers re-quired to store the results of the operations of stage 1 before theoperation of stage 2 starts.

Considering the clock-cycle period as the delay of theMAC cell used by an architecture, the computation time canbe obtained as . Note that the reciprocal ofis simply the throughput of the architecture assuming that onesample is inputted during each clock cycle. Using , one can de-termine the area–time complexity , where area mainlycomprises the areas of the multipliers, adders, and registers. Inorder to evaluate the performance of the architectures in termsof and , we consider an example of designing a circuitfor the DWT computation where the sample size andthe number of decomposition levels . We use Daubechiessix-tap filters as analysis filters, and the sample andfilter coefficient wordlengths are taken as 8 bits. The CPA ofthe processing unit utilizes the structure of a combination ofcarry-skip and carry-select adders [24]. The registers are de-signed using D-type flip-flops (DFF). All the modules, such aspartial-product generator, DFF, full adder, double adder, multi-plexer, and demultiplexer, used in the proposed architecture aredesigned by using 0.35- m CMOS technology and simulated byusing HSpice to obtain the delays. Note that these same mod-ules are also used to evaluate the performance of all the otherarchitectures. Table III shows the values of the clock-cycle pe-riod and the area–time complexity for the various architectures.It is seen from this table that the proposed architecture has sig-nificantly smaller value of clock-cycle period compared to that

Page 11: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: PIPELINE VLSI ARCHITECTURE FOR HIGH-SPEED COMPUTATION OF THE 1-D DWT 11

TABLE IIIEVALUATION OF VARIOUS ARCHITECTURES

of all the other architectures. Among all the architectures con-sidered, the proposed architecture has the highest throughput of138 MB/s and the lowest area–time complexity.

In order to estimate the power consumption of the proposedarchitecture, an example of the proposed architecture is con-structed for a seven-level DWT computation of 8-bit samplesusing six-tap filters and simulated at a clock frequency of 138MHz using Synopsys Power Compiler. The resulting powerconsumption values are 154.2 and 67.6 mW using 0.35- m

V and 0.18- m V technologies,respectively.

In order to have a fair comparison of the power consumptionperformance of different architectures, the circuit complexitiesand the technologies used for the circuit design of the archi-tectures under consideration must be the same. In this regard,estimates of the power consumption for the architectures listedin Table III are either unavailable or, if available, the underlyingarchitectures have been designed with substantial differences inthe circuit complexities and process technologies. Despite thisdifficulty in carrying out a fair comparison of the power con-sumption of the architectures, we compare the estimated powerconsumption of the proposed architecture with that given in[25]. The architecture in [25] is also a pipeline architecture thatuses the same filter core as that used in [19] of Table III. In [25],an example of the architecture using 9/3 filters and 9-bit sampleshas been constructed and simulated for an operation at 100-MHzclock frequency using 0.35- m technology. The resulting powerconsumption figure is 325 mW. This value of power consump-tion is more than twice the value of 154.2 mW obtained from theexample of the proposed architecture in 0.35- m technology,which is constructed by employing six-tap filters operating on8-bit samples at 138-MHz clock frequency.

In order to verify the estimated results for the example of theDWT computation considered before, an implementation of thecircuit is carried out in FPGA. Verilog is used for the hardwaredescription and Xilinx ISE 8.2i for the synthesis of the circuiton a Virtex-II Pro XC2VP7-7 board. The FPGA chip consists of36 36 arrays with 11 088 logic cells, and it is capable of oper-ating with a clock frequency of up to 400 MHz. The implemen-tation is evaluated with respect to the clock period (throughput)measured as the delay of the critical path of the MAC-cell net-work, and the resource utilization (area) measured as the num-bers of configuration logic block slices, DFFs, lookup tables,and input/output blocks. The resources used by the implemen-tation are listed in Table IV. The circuit is found to perform wellwith a clock period as short as 8.7 ns, a value that is reasonablyclose to the estimated value of 7.2 ns. The power consumptionof the FPGA chip on which the designed circuit is implemented

TABLE IVRESOURCES USED IN THE FPGA DEVICE

is measured to be 105 mW V . Thus, the simulatedvalue of 67.6 mW is reasonably realistic for power consumptionfor the circuit realizing the proposed architecture, consideringthat the measured value of power consumption also includes thepower dissipated by the unused slices in FPGA.

VI. CONCLUSION

In this paper, a scheme for the design of a pipeline architec-ture for a real-time computation of the 1-D DWT has been pre-sented. The objective has been to achieve a low computationtime by maximizing the operational frequency and min-imizing the number of clock cycles required for theDWT computation, which, in turn, have been realized by devel-oping a scheme for enhanced inter- and intrastage parallelismsfor the pipeline architecture.

A study has been undertaken, which suggests that, in view ofthe nature of the DWT computation, it is most efficient to mapthe overall task of the DWT computation to only two pipelinestages, i.e., one for performing the task of the level-1 DWT com-putation and the other for performing that of all the remainingdecomposition levels. In view of the fact that the amount andnature of the computation performed by the two stages are thesame, their internal designs ought to be the same. There are twomain ideas that have been employed for the internal design ofeach stage in order to enhance the intrastage parallelism. Thefirst idea is to decompose the filtering operation into two sub-tasks that operate independently on the even- and odd-numberedinput samples, respectively. This idea stems from the fact thatthe DWT computation is a two-subband filtering operation, andfor each consecutive decomposition level, the input data are dec-imated by a factor of two. Each subtask of the filtering opera-tion is performed by a MAC-cell network, which is essentially a2-D array of bitwise adders. The second idea employed for en-hancing the intrastage parallelism is to organize this array in away so as to minimize the delay of the critical path from a par-tial-product input bit to a bit of an output sample through thisarray. In this paper, this has been accomplished by minimizingthe number of layers of the array while minimizing the delay ofeach layer.

In order to assess the effectiveness of the proposed scheme,a pipeline architecture has been designed using this scheme andsimulated. The simulation results have shown that the architec-ture designed based on the proposed scheme would require thesmallest number of clock cycles to compute outputsamples and a reduction of at least 30% in the period of the clockcycle in comparison to those required by the other architec-tures with a comparable hardware requirement. An FPGA im-plementation of the designed architecture has been carried out,

Page 12: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

demonstrating the effectiveness of the proposed scheme for de-signing efficient and realizable architectures for the DWT com-putation. Finally, it should be pointed out that the principle ofmaximizing the inter- and intrastage parallelisms presented inthis paper for the design of architecture for the 1-D DWT com-putation is extendable to that for the 2-D DWT computation.

REFERENCES

[1] S. Mallat, “A theory for multiresolution signal decomposition: Thewavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol.11, no. 7, pp. 674–693, Jul. 1989.

[2] J. Chilo and T. Lindblad, “Hardware implementation of 1D wavelettransform on an FPGA for infrasound signal classification,” IEEETrans. Nucl. Sci., vol. 55, no. 1, pp. 9–13, Feb. 2008.

[3] S. Cheng, C. Tseng, and M. Cole, “Efficient and effective VLSIarchitecture for a wavelet-based broadband sonar signal detectionsystem,” in Proc. IEEE 14th ICECS, Marrakech, Morocco, Dec. 2007,pp. 593–596.

[4] K. G. Oweiss, A. Mason, Y. Suhail, A. M. Kamboh, and K. E.Thomson, “A scalable wavelet transform VLSI architecture forreal-time signal processing in high-density intra-cortical implants,”IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 6, pp. 1266–1278,Jun. 2007.

[5] C. Chakrabarti, M. Vishwanath, and R. M. Owens, “Architectures forwavelet transforms: A survey,” J. VLSI Signal Process., vol. 14, no. 2,pp. 171–192, Nov. 1996.

[6] C. Huang, P. Tseng, and L. Chen, “Analysis and VLSI architecture for1-D and 2-D discrete wavelet transform,” IEEE Trans. Signal Process.,vol. 53, no. 4, pp. 1575–1586, Apr. 2005.

[7] M. Martina and G. Masera, “Multiplierless, folded 9/7–5/3 waveletVLSI architecture,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54,no. 9, pp. 770–774, Sep. 2007.

[8] A. Acharyya, K. Maharatna, B. M. Al-Hashimi, and S. R. Gunn,“Memory reduction methodology for distributed-arithmetic-basedDWT/IDWT exploiting data symmetry,” IEEE Trans. Circuits Syst. II,Exp. Briefs, vol. 56, no. 4, pp. 285–289, Apr. 2009.

[9] K. A. Kotteri, S. Barua, A. E. Bell, and J. E. Carletta, “A comparison ofhardware implementations of the biorthogonal 9/7 DWT: Convolutionversus lifting,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 52, no.5, pp. 256–260, May 2006.

[10] C. Wang and W. S. Gan, “Efficient VLSI architecture for lifting-baseddiscrete wavelet packet transform,” IEEE Trans. Circuits Syst. II, Exp.Briefs, vol. 54, no. 5, pp. 422–426, May 2007.

[11] G. Shi, W. Liu, L. Zhang, and F. Li, “An efficient folded architecturefor lifting-based discrete wavelet transform,” IEEE Trans. Circuits Syst.II, Exp. Briefs, vol. 56, no. 4, pp. 290–294, Apr. 2009.

[12] A. S. Lewis and G. Knowles, “VLSI architecture for 2D Daubechieswavelet transform without multipliers,” Electron. Lett., vol. 27, no. 2,pp. 171–173, Jan. 1991.

[13] C. Chakrabarti and M. Vishwanath, “Efficient realizations of the dis-crete and continuous wavelet transforms: from single chip implemen-tations to mapping on SIMD array computers,” IEEE Trans. SignalProcess., vol. 43, no. 3, pp. 759–771, Mar. 1995.

[14] A. Grzesczak, M. K. Mandal, and S. Panchanathan, “VLSI implemen-tation of discrete wavelet transform,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 4, no. 4, pp. 421–433, Dec. 1996.

[15] C. Cheng and K. K. Parhi, “High-speed VLSI implementation of 2-Ddiscrete wavelet transform,” IEEE Trans. Signal Process., vol. 56, no.1, pp. 393–403, Jan. 2008.

[16] S. S. Nayak, “Bit-level systolic implementation of 1D and 2D discretewavelet transform,” Proc. Inst. Elect. Eng.—Circuits Devices Syst., vol.152, no. 1, pp. 25–32, Feb. 2005.

[17] F. Marino, D. Guevorkian, and J. Astola, “Highly efficienthigh-speed/low-power architectures for 1-D discrete wavelet trans-form,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 47, no. 12, pp.1492–1502, Dec. 2000.

[18] T. Park, “Efficient VLSI architecture for one-dimensional discretewavelet transform using a scalable data recorder unit,” in Proc.ITC-CSCC, Phuket, Thailand, Jul. 2002, pp. 353–356.

[19] S. Masud and J. V. McCanny, “Reusable silicon IP cores for discretewavelet transform applications,” IEEE Trans. Circuits Syst. I, Reg. Pa-pers, vol. 51, no. 6, pp. 1114–1124, Jun. 2004.

[20] M. Vishwanath, “The recursive pyramid algorithm for the discretewavelet transform,” IEEE Trans. Signal Process., vol. 42, no. 3, pp.673–677, Mar. 1994.

[21] M. Ferretti and D. Rizzo, “Handling borders in systolic architecturesfor the 1-D discrete wavelet transform for perfect reconstruction,” IEEETrans. Signal Process., vol. 48, no. 5, pp. 1365–1378, May 2000.

[22] C. Zhang, C. Wang, and M. O. Ahmad, “An efficient buffer-based ar-chitecture for on-line computation of 1-D discrete wavelet transform,”in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Montreal,QC, Canada, May 2004, vol. 5, pp. 201–204.

[23] C. Zhang, C. Wang, and M. O. Ahmad, “A VLSI architecture for ahigh-speed computation of the 1D discrete wavelet transform,” in Proc.IEEE Int. Symp. Circuits Syst., Kobe, Japan, May 2005, vol. 2, pp.1461–1464.

[24] A. Satoh, N. Ooba, K. Takano, and E. DAvignon, “High-speed MARShardware,” in Proc. 3rd AES conf., New York, Apr. 2000, pp. 305–316.

[25] S. Masud and J. V. McCanny, “Rapid design of diorthogonal wavelettransforms,” Proc. Inst. Elect. Eng.—Circuits Devices Syst., vol. 147,no. 5, pp. 293–296, Oct. 2000.

Chengjun Zhang received the B.S. and M.S. degreesin physics from Nanjing University, Nanjing, China,in 1994 and 1997, respectively. He is currentlyworking toward the Ph.D. degree in the Departmentof Electrical and Computer Engineering, ConcordiaUniversity, Montreal, QC, Canada.

His research interests include signal processing,architecture design, and VLSI implementation ofdigital systems.

Chunyan Wang (SM’04) received the B.Eng. degreein electronics from Shanghai Jiao Tong University,Shanghai, China, and the M.Eng. and Ph.D. degreesfrom Université Paris-Sud, Paris, France.

In 1997, she joined Concordia University, Mon-treal, QC, Canada, as an Assistant Professor, whereshe is currently an Associate Professor of Electricaland Computer Engineering. Her current researchareas are low-power analog- and mixed-signalVLSI design, CMOS sensor integration, and VLSIimplementation of digital signal processing systems.

M. Omair Ahmad (S’69–M’78–SM’83–F’01)received the B.Eng. degree in electrical engineeringfrom Sir George Williams University, Montreal,QC, Canada, and the Ph.D. degree in electricalengineering from Concordia University, Montreal.

From 1978 to 1979, he was a Member of theFaculty with the New York University College,Buffalo. In September 1979, he joined the facultyof Concordia University as an Assistant Professorof computer science. Subsequently, he joined theDepartment of Electrical and Computer Engineering,

Concordia University, where he was the Chair of the department from June2002 to May 2005 and is currently a Professor. He holds the ConcordiaUniversity Research Chair (Tier I) in Multimedia Signal Processing. He haspublished extensively in the area of signal processing and is the holder of fourpatents. His current research interests include the areas of multidimensionalfilter design; speech, image, and video processing; nonlinear signal processing;communication digital signal processing; artificial neural networks; and VLSIcircuits for signal processing. He was a Founding Researcher with Micronetfrom its inception in 1990 as a Canadian Network of Centers of Excellenceuntil its expiration in 2004. Previously, he was an Examiner of the Order ofEngineers of Quebec.

Page 13: A Pipeline VLSI Architecture for High-Speed_Computation of the 1-D Discrete Wavelet Transform

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: PIPELINE VLSI ARCHITECTURE FOR HIGH-SPEED COMPUTATION OF THE 1-D DWT 13

Dr. Ahmad was an Associate Editor for the IEEE TRANSACTIONS ON

CIRCUITS AND SYSTEMS—PART I: FUNDAMENTAL THEORY AND APPLICATIONS

from June 1999 to December 2001. He was the Local Arrangements Chairmanof the 1984 IEEE International Symposium on Circuits and Systems. During1988, he was a member of the Admission and Advancement Committee ofthe IEEE. He has also served as the Program Co-Chair for the 1995 IEEEInternational Conference on Neural Networks and Signal Processing, the 2003IEEE International Conference on Neural Networks and Signal Processing, andthe 2004 IEEE International Midwest Symposium on Circuits and Systems. Hewas the General Cochair for the 2008 IEEE International Conference on NeuralNetworks and Signal Processing. He is currently the Chair of the MontrealChapter IEEE Circuits and Systems Society. He is a recipient of numeroushonors and awards, including the Wighton Fellowship from the SandfordFleming Foundation, an induction to Provost’s Circle of Distinction for careerachievements, and the Award of Excellence in Doctoral Supervision from theFaculty of Engineering and Computer Science, Concordia University.