11
Signal Processing 87 (2007) 1089–1099 Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms Cheng-yi Xiong a,b, , Jian-hua Hou a,b , Jin-wen Tian b , Jian Liu b a College of Electronic Information Engineering, South-Center University for Nationalities, Wuhan 430074, China b Institute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan 430074, China Received 5 March 2006; received in revised form 16 September 2006; accepted 2 October 2006 Available online 26 October 2006 Abstract Efficient array architectures for multi-dimensional (m-D) discrete wavelet transform (DWT), e.g. m ¼ 2; 3, are presented, in which the lifting scheme of DWT is used to reduce efficiently hardware complexity. The parallelism of 2 m subbands transforms in lifting-based m-D DWT is explored, which increases efficiently the throughput rate of separable m-D DWT with fewer additional hardware overhead. The proposed architecture is composed of m2 m1 1-D DWT modules working in parallel and pipelined, which is designed to process 2 m input samples per clock cycle, and generate 2 m subbands coefficients synchronously. The total time of achieving one level of decomposition for a 2-D image of size N 2 is approximately N 2 =4 intra-clock cycles (ccs), and that for a 3-D image sequence of size MN 2 is approximately MN 2 =8 ccs. Efficient line-based architecture frameworks for both 2D+t (spatial domain decomposition first, followed by temporal directional decomposition) and t+2D (temporal directional decomposition first, followed by spatial domain decomposition) 3-D DWT are firstly proposed, as much as we know. Compared with the similar works reported in previous literature, the proposed architectures have good performance in terms of throughput rate and system output latency, and are good alternatives in tradeoff between throughput rate and hardware complexity. The proposed architectures are simple, regular, scalable and well suited for VLSI implementation. r 2006 Elsevier B.V. All rights reserved. Keywords: Discrete wavelet transform; Multi-dimensional; Lifting scheme; Parallel; VLSI architecture 1. Introduction The discrete wavelet transform (DWT) has been widely used as a powerful tool in many applications, such as signal processing, numerical analysis, computer graphics, image compression, etc. [1,2]. Two-dimensional (2-D) DWT has been adopted in still image or a sequence of images compression applications [3,4]. 3-D DWT has been employed in applications such as video compressions [5,8] and magnetic resonance image (MRI) compressions [6], as well as noise reduction between frames of a video sequence [7], and so on. Since the DWT is a computation-intensive algorithm, dedicated VLSI solutions have been considered to meet the real-time requirements in practical applications. By far, a large number of efficient architectures for 2-D and 3-D wavelet transform (WT) have been presented [9–31]. DWT can be classified into two ARTICLE IN PRESS www.elsevier.com/locate/sigpro 0165-1684/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2006.10.001 Corresponding author. College of Electronic Information Engineering, South-Center University for Nationalities, Wuhan 430074, China. Tel./fax: +86 2767842854. E-mail address: [email protected] (C.-y. Xiong).

Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

Embed Size (px)

Citation preview

Page 1: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESS

0165-1684/$ - se

doi:10.1016/j.si

�CorrespondEngineering, So

430074, China.

E-mail addr

Signal Processing 87 (2007) 1089–1099

www.elsevier.com/locate/sigpro

Efficient array architectures for multi-dimensionallifting-based discrete wavelet transforms

Cheng-yi Xionga,b,�, Jian-hua Houa,b, Jin-wen Tianb, Jian Liub

aCollege of Electronic Information Engineering, South-Center University for Nationalities, Wuhan 430074, ChinabInstitute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan 430074, China

Received 5 March 2006; received in revised form 16 September 2006; accepted 2 October 2006

Available online 26 October 2006

Abstract

Efficient array architectures for multi-dimensional (m-D) discrete wavelet transform (DWT), e.g. m ¼ 2; 3, are presented,in which the lifting scheme of DWT is used to reduce efficiently hardware complexity. The parallelism of 2m subbands

transforms in lifting-based m-D DWT is explored, which increases efficiently the throughput rate of separable m-D DWT

with fewer additional hardware overhead. The proposed architecture is composed of m2m�1 1-D DWT modules working in

parallel and pipelined, which is designed to process 2m input samples per clock cycle, and generate 2m subbands coefficients

synchronously. The total time of achieving one level of decomposition for a 2-D image of size N2 is approximately N2=4intra-clock cycles (ccs), and that for a 3-D image sequence of size MN2 is approximately MN2=8 ccs. Efficient line-based

architecture frameworks for both 2D+t (spatial domain decomposition first, followed by temporal directional

decomposition) and t+2D (temporal directional decomposition first, followed by spatial domain decomposition) 3-D

DWT are firstly proposed, as much as we know. Compared with the similar works reported in previous literature, the

proposed architectures have good performance in terms of throughput rate and system output latency, and are good

alternatives in tradeoff between throughput rate and hardware complexity. The proposed architectures are simple, regular,

scalable and well suited for VLSI implementation.

r 2006 Elsevier B.V. All rights reserved.

Keywords: Discrete wavelet transform; Multi-dimensional; Lifting scheme; Parallel; VLSI architecture

1. Introduction

The discrete wavelet transform (DWT) has beenwidely used as a powerful tool in many applications,such as signal processing, numerical analysis,computer graphics, image compression, etc. [1,2].Two-dimensional (2-D) DWT has been adopted in

e front matter r 2006 Elsevier B.V. All rights reserved

gpro.2006.10.001

ing author. College of Electronic Information

uth-Center University for Nationalities, Wuhan

Tel./fax: +86 2767842854.

ess: [email protected] (C.-y. Xiong).

still image or a sequence of images compressionapplications [3,4]. 3-D DWT has been employed inapplications such as video compressions [5,8] andmagnetic resonance image (MRI) compressions [6],as well as noise reduction between frames of a videosequence [7], and so on. Since the DWT is acomputation-intensive algorithm, dedicated VLSIsolutions have been considered to meet the real-timerequirements in practical applications.

By far, a large number of efficient architecturesfor 2-D and 3-D wavelet transform (WT) have beenpresented [9–31]. DWT can be classified into two

.

Page 2: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991090

categories: one is based on convolution operation[1], and the other is based on lifting scheme [33,34].Lifting-based DWT has many advantages overthe conventional convolution-based one [33–35].Especially, the lifting scheme can reduce efficientlythe computational complexity of DWT. When themulti-dimensional (m-D) wavelet base functions areseparable, there exist two main approaches tocompute the m-D DWT: separable approach andnon-separable approach [10,11]. The separableapproach performs m-D DWT by 1-D DWTdimension by dimension, which requires extra hugememory to save the intermediate data that shouldbe transposed for the next dimensional DWT, andhas long output latency and system latency (SL).The non-separable approach does not require anytransposition but requires more multipliers andaccumulators (MACs) than the separable approach.

In order to tradeoff the speed and area, some line-based [13] architectures for 2-D DWT by exploitingparallel and pipeline have been proposed. Chrysafiset al. [13] first proposed the line-based architecturesfor 2-D DWT and image coder for reducingmemory. Wu et al. [14] proposed an efficient line-based architecture for the direct 2-D DWT, in whichthe polyphase decomposition technique and thecoefficient folding technique had been employed toincrease the speed and the hardware utilization.Park et al. [15] proposed a high-speed lattice-basedVLSI architecture for the 2-D DWT for real-timevideo signal processing. Marino [17] proposed ahigh-speed/low-power pipelined architecture for thedirect 2-D DWT by four subbands transform beingperformed in parallel. However, those architectureswere all developed based on convolution DWT,hence they had higher hardware complexity.Recently, some researchers have the lifting schemeused for DWT architecture to further improve theperformance of the 2-D DWT hardware implemen-tation. Jiang et al. [21] first proposed a novel lifting-based system architecture based on overlap-statesequential and split-and-merge parallel with bound-ary post-processing technique for reducing thememory requirements and communication betweenthe processors. Andra et al. [22] proposed a block-based four processors architecture. However, thosearchitectures are all using block-based input mode,thus require a large size of raw data buffer storage.Dillen et al. [24] proposed a combined line-basedarchitecture for the Legall5/3 and Daubechies9/7DWT, which was implemented for one-level decom-position. Liu et al. [32] proposed an efficient

line-based 2-D architecture by using spatial combi-native lifting algorithm of the Daubechies9/7 DWT,with the number of reduced multiply operations.Liao et al. [24] proposed a lifting-based 2-D multi-level architecture with recursive pyramid algorithmand one-level architecture by dual scan fashion. Wuet al. [30] proposed a high-performance andmemory-efficient pipeline architecture for the Le-gall5/3 and Daubechies9/7 WT. However, they allstill had a limited data processing throughput rate.

In order to implement 3-D DWT, Weeks et al.[20] proposed first two efficient architectures. Daiet al. [25] proposed a high speed architecture withpolyphase decomposition technique. Das et al. [28]proposed a solution of implementing running3-D DWT using Daubechies four-tap (D4) wave-let filters. However, they are all based on theconvolution DWT, thus they had higher hardwarecomplexity and longer output latency.

In this paper, novel VLSI generic architectures for2-D and 3-D DWT are proposed by using liftingscheme on the basis of our previous work [31]. Theproposed approach can be straightforwardly extendedto the design of architecture for other higherdimensional DWT. The parallelism of 2m subbandstransforms in lifting-based m-D DWT is explored,which increases efficiently the throughput rate ofseparable m-D DWT with fewer additional hardwareoverhead. The proposed architecture is composed ofm2m�1 1-D DWT modules working in parallel andpipeline fashion, which is designed to process 2m inputsamples per working clock cycle, and generate 2m

subbands coefficients synchronously. The total time ofachieving one-level of decomposition for a 2-D imageof size N2 is approximately N2=4 intra-clock cycles(ccs), and that for decomposing a 3-D image sequenceof size MN2 is approximately MN2=8 ccs. Efficientline-based architectures framework for both 2D+t(spatial domain decomposition first, followed bytemporal directional decomposition) and t+2D(temporal directional decomposition first, followedby spatial domain decomposition) 3-D DWT arefirst proposed, as much as we know. Compared withthe similar works reported in previous literature, theproposed architectures have good performance interms of system output latency and throughput rate,and are good alternatives in tradeoff betweenthroughput rate and hardware complexity. The newarchitectures are simple, regular, scalable and wellsuited for VLSI implementation.

The rest of the paper is structured as follows.Section 2 gives a brief review of the lifting scheme

Page 3: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1091

for the DWT. Section 3 describes the proposedarchitectures for the multi-dimensional DWT. Theperformance analysis for the proposed architecturesand comparisons with other designs are presented inSection 4. Finally, a brief conclusion is drawn inSection 5.

2. Lifting scheme of the DWT

Lifting scheme is a new method to constructwavelet bases, which was first introduced bySweldens in 1990s [33]. It was originally developedfrom the earlier work of Donoho to buildthe wavelet from interpolating scaling functionsand the work of Lounsbery et al., which constructedthe wavelet for a polyhedral surface. The maindifference with such classical constructions as[33,34] is that it entirely depends on the spatialdomain. Therefore, it is suitable for constructingwavelets that lack translation and dilation, andthus, the Fourier transform is no longer available.This scheme is called the second-generation wavelet.Obviously, it can be used to build first-generationwavelets and leads to a faster, full in-placeimplementation of the DWT.

The basic idea behind the lifting scheme is arelationship among all biorthogonal wavelets thatshare the same scaling function such that one canconstruct the desired wavelet from a simple one.Daubechies and Sweldens proved [34] that anywavelet with FIR filters could be factorized into afinite number of alternating lifting and dual liftingsteps starting from the Lazy wavelet. This impliesthat any wavelet can be derived from arbitrarywavelet, including the Lazy wavelet, by a finitenumber of lifting and dual lifting.

The main characteristic of the lifting-based DWTscheme is to decompose the high-pass and low-passfilters into a sequence of upper and lower triangularmatrices and convert the filter implementation intobanded matrix multiplications [34]. Such a schemehas several advantages, including ‘‘in-place’’ com-putation of DWT, integer-to-integer wavelet trans-form (IWT) [35], symmetric forward and inversetransform, etc. Therefore, it comes as no surprisethat lifting has been chosen in the new still imagecompression standard JPEG2000.

Let hðzÞ and gðzÞ denote the low-pass and high-pass analysis filters, hðzÞ and gðzÞ the low-pass andhigh-pass synthesis filters, respectively, then thecorresponding decomposition and reconstructionpolyphase matrices, denoted as pðzÞ and pðzÞ,

respectively, are defined as follows:

pðzÞ ¼heðzÞ hoðzÞ

geðzÞ goðzÞ

" #(1a)

and

pðzÞ ¼heðzÞ hoðzÞ

geðzÞ goðzÞ

" #, (1b)

where heðzÞ and geðzÞ (hoðzÞ and goðzÞ) represent theeven parts (odd parts) of the low-pass and high-passwavelet filters, respectively. It has been shown in[33,34] that if the hðzÞ and gðzÞ are a pair ofcomplementary filters for each other, then the pðzÞ isalways factorized by lifting scheme as follows [34]:

PðzÞ ¼K 0

0 1=K

!Ylt

i¼1

1 siðzÞ

0 1

� �1 0

tiðzÞ 1

!( )

(2a)

or

PðzÞ ¼K 0

0 1=K

!Ylt

i¼1

1 0

tiðzÞ 1

!1 siðzÞ

0 1

� �( ).

(2b)

In which K is a constant, tiðzÞ and siðzÞ are denotedas primary lifting and dual lifting polynomial (orvice versa), respectively, and lt represents the totallifting steps required. For example, the Daubechiesfour-tap (D4) wavelet filters can be factored as [34]

PðzÞ ¼1 a

0 1

� �1 0

bþ cz�1 1

" #1 z

0 1

� �K 0

0 1=K

" #,

(3)

where a ¼ �ffiffiffi3p

, b ¼ffiffiffi3p

=4, c ¼ ðffiffiffi3p� 2Þ=4,

K ¼ ðffiffiffi3pþ 1Þ=

ffiffiffi2p

.Accordingly, the 1-D DWT can be implemented

by using the mathematical notations as

½LðzÞ;HðzÞ�t ¼ Pðz�1Þt½X eðzÞ;X oðzÞ�t, (4)

where LðzÞ and HðzÞ represent the Z-transform oflow-frequency subband sequence lðnÞ and high-frequency subband sequence hðnÞ, respectively; letxðnÞ denote input sequence, X eðzÞ and X oðzÞ

denote the Z-transform of even-numbered sequencexeðnÞ ¼ xð2nÞ and odd-numbered sequence xoðnÞ ¼

xð2nþ 1Þ, respectively; the superscript t signifiestranspose operation.

Page 4: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESS

... ... ...

Fig. 2. Scanned pattern of the input image.

C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991092

3. Proposed architecture for m-D DWT

In this section, several new architectures for m-DDWT are presented, in which the lifting schemeof DWT is used to reduce efficiently hardwarecomplexity, and the parallelism of 2m subbandstransform in lifting-based m-D DWT is explored toincrease the throughput rate. For descriptionsimplicity, only the 2-D and 3-D architectures areintroduced in detail in this section. The proposedarchitecture for 2-D DWT is first described,followed by that for 3-D DWT. This approachcan be straightforwardly extended to the design ofarchitecture for separable higher dimensional DWT.

3.1. Proposed 2-D architecture

3.1.1. Algorithm

The flow diagram of the separable 2-D DWTimplementation could be described as shown inFig. 1. Assuming the input image is scanned row byrow, and in the order of from left to right as shownin Fig. 2, the original input image (denoted as xÞ isfirst decomposed along horizontal direction, theresulting outputs are then decomposed alongvertical direction, and four subbands are finallyobtained usually denoted as ll (low–low frequency),lh (low–high frequency), hl (high–low frequency), hh

(high–high frequency). The ll subband could befurther decomposed in the same way. Let xeeðm; nÞ[xoeðm; nÞ] represent the samples of even-numberedrow and even-numbered column (odd-numberedrow and even-numbered column), and xeoðm; nÞ(xooðm; nÞ) represent the samples of even-numberedrow and odd-numbered column (odd-numberedrow and odd-numbered column), respectively. Andlet leðm; nÞ and heðm; nÞ (loðm; nÞ and hoðm; nÞ) denotethe low-frequency coefficients and high-frequency

Horizontal1-D DWT

Vertical1-D DWT

Vertical1-D DWT

x

l

h

ll

lh

hl

hh

Fig. 1. Flow diagram of the separable 2-D DWT implementa-

tion.

coefficients of the even-numbered rows (odd-num-bered rows), respectively. Where m ¼ 0; 1; 2; . . . ;M=2� 1, n ¼ 0; 1; 2; . . . ;N=2� 1, and M and N

are even integers and denote the height and width ofimage, respectively.

We define the Z-transform for the mth row vectorof matrix yðm; nÞ (i.e. (yðm; 0Þ yðm; 1Þ . . . yðm;N=2� 1ÞÞ as

Y mðz1Þ ¼XN=2�1n¼0

yðm; nÞz�n1 (5a)

and the Z-transform for the nth column vector ofmatrix yðm; nÞ (i.e. ðyð0; nÞ yð1; nÞ . . . yðM=2� 1; nÞÞas

Y nðz2Þ ¼XM=2�1

m¼0

yðm; nÞz�m2 , (5b)

where z1 ¼ z�1 is equivalent to unit time delay,while z2 ¼ z�N=2 is equivalent to N=2 units timedelay.

Using X eeðzÞ (XoeðzÞ) and X eoðzÞ ðXooðzÞ), respec-tively, represent Z-transforms of the sequencesxeeðm; nÞ (xoeðm; nÞ) and xeoðm; nÞ (xooðm; nÞ), LeðzÞ

and HeðzÞ (LoðzÞ and HoðzÞ), respectively, representZ-transforms of leðm; nÞ and heðm; nÞ (loðm; nÞ andhoðm; nÞ), and LLðzÞ, LHðzÞ, HLðzÞ and HHðzÞ,respectively, represent Z-transforms of the matrixesll, lh, hl, hh, then the separable 2-D DWT could beimplemented by using the mathematical notationsas (where the corresponding subscript m and/or n

are omitted for simplicity)

½Leðz1Þ;Heðz1Þ�t ¼ Pðz�11 Þ

t½X eeðz1Þ;X eoðz1Þ�

t, (6a)

½Loðz1Þ;Hoðz1Þ�t ¼ Pðz�11 Þ

t½Xoeðz1Þ;Xooðz1Þ�

t, (6b)

Page 5: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESS

K2

1/K2

SNU

ll

lh

hl

hh

Fig. 4. Circuit block diagram of SNU in Fig. 3.

C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1093

½LLðz2Þ;LHðz2Þ�t ¼ Pðz�12 Þ

t½Leðz2Þ;Loðz2Þ�

t, (7a)

½HLðz2Þ;HHðz2Þ�t ¼ Pðz�12 Þ

t½Heðz2Þ;Hoðz2Þ�

t, (7b)

where (6a) expresses the 1-D row transform of theeven-row inputs, and (6b) expresses the 1-D rowtransform of the odd-row inputs; while (7a) repre-sents the 1-D column transform of the l subbandcoefficients, and (7b) represents the 1-D columntransform of the h subband coefficients.

3.1.2. Architecture

In order to increase the throughput rate of 2-DDWT, it is an efficient solution to immediately startthe decomposition transform along vertical direc-tion when the sufficient data generated by thedecomposition transform along horizontal directionare available. In the later part of this section, anovel line-based architecture for the separable 2-DDWT is proposed, which is shown in Fig. 3. It couldincrease significantly the throughput rate with feweradditional hardware overhead, making best use ofthe parallelism of four subbands transform asimplied by (6) and (7). Here the input image is stillassumed to be scanned in line-by-line fashion, asshown in Fig. 2. The proposed 2-D architecture iscomposed of an input buffer unit (IBU) and a WTmodule. The WT module is a four-input/four-output architecture that includes two row-wise1-D DWT modules (RW1 and RW2), two column-wise 1-D DWT modules (CW1 and CW2), and ascale normalization unit (SNU). The four 1-D DWTmodules are responsible for performing filtering,respectively, along horizontal and vertical direc-tions, working in parallel and pipelined. The SNU isdesigned as shown in Fig. 4, which integratesthe scale normalization operations required,respectively, in row transform and column trans-form to reduce efficiently the number of multipliersrequired in the architecture of 2-D DWT, becausethe scale normalization factor for low-pass filtering

lehe

loho

xee

xoo

xeo

xoex

IBU

FIFO1

FIFO2

FIFO3

FIFO4

WT

RW1

RW2

CW1

CW2

SNU

ll

lh

hl

hh

Fig. 3. Proposed line-based architecture for the separable 2-D

DWT.

is inverse to that for high-pass filtering as indicatedby (2).

Four input samples are required simultaneouslyto input to WT in each internal working clock cycle,while four subbands coefficients (one for eachsubband) are generated synchronously. Two inputsamples are from the even-numbered row, and theother two are from the odd-numbered row. Twolines of signals are required to input simultaneouslyto the WT. Since the data samples are assumed toinput in a line-by-line way as explained above, anIBU is required to buffer the needed lines ofdata. The IBU can be implemented by four FIFO(first-in-first-out) (named FIFO1, FIFO2, FIFO3,FIFO4 with sizes of, respectively, about N=2,N=2;N=4, N=4, where N represents the width ofimage), which are used to store the samplesseparately being from even-row–even-column,even-row–odd-column, odd-row–even-column andodd-row–odd-column. In order to provide foursamples at an internal working clock cycle, 4 timesfaster clock rate than intra-working clock, i.e. f s ¼

4f w ðf s denotes input data sampling frequency, andf w denotes internal working frequency, respec-tively), is required to acquire input data samples.The buffering for the odd-numbered (or even-numbered) rows of samples can be achieved in theperiod when the second halves rows of samples arebeing processed.

Similarly, let xeeðm; nÞ (xoeðm; nÞ) to represent thesamples of even-numbered row and even-numberedcolumn (odd-numbered row and even-numberedcolumn), and xeoðm; nÞ (xooðm; nÞ) represents thesamples of even-numbered row and odd-numberedcolumn (odd-numbered row and odd-numberedcolumn), respectively. And let leðm; nÞ and heðm; nÞ(loðm; nÞ and hoðm; nÞ) to denote the low-frequencycoefficients and high-frequency coefficients of theeven-numbered rows (odd-numbered rows), respec-tively. In each internal clock cycle, four inputs,xeeðm; nÞ and xeoðm; nÞ, as well as xoeðm; nÞ andxooðm; nÞ, are, respectively, inputted to RW1 and

Page 6: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESS

Temporal1-D DWT

Horizontal1-D DWT

Horizontal1-D DWT

Inputl

h

ll

lh

hl

hh

Vertical1-D DWT

Vertical1-D DWT

Vertical1-D DWT

Vertical1-D DWT

lll

llh

lhl

lhh

hll

hlh

hhl

hhh

x

Horizontal1-D DWT

Vertical1-D DWT

Vertical1-D DWT

Inputl

h

ll

lh

hl

hh

Temporal1-D DWT

Temporal1-D DWT

Temporal1-D DWT

Temporal1-D DWT

lll

llh

lhl

lhh

hll

hlh

hhl

hhh

x

(a)

(b)

Fig. 6. Flow diagram of the separable 3-D DWT implementa-

tion: (a) t+2D type; (b) 2D+t type.

C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991094

RW2 in parallel. RW1 generates one low-frequencycoefficient leðm; nÞ and one high-frequency coeffi-cient heðm; nÞ for even-numbered row of samples ineach clock cycle, while RW2 produces one low-frequency coefficient loðm; nÞ and one high-fre-quency coefficient hoðm; nÞ for odd-numbered rowof samples in each clock cycle. The outputs of RW1and RW2 are then pipelined to CW1 and CW2, i.e.leðm; nÞ and loðm; nÞ are inputted to CW1, anddecomposed into the subbands low–low frequency(ll) and low–high frequency (lh) components byscale normalization operations. Meanwhile, heðm; nÞand hoðm; nÞ are inputted in parallel to CW2, anddecomposed into the subbands high–low frequency(hl) and high–high frequency (hh) components byscale normalization operations as well. It is notedthat the parameters (m; nÞ of all the above variablesare omitted in Fig. 3 for expression simplicity.

The architecture of RW module could be designedby directly mapping the lifting factorization of thechosen wavelet filter, which is a two-input/two-outputarchitecture implemented by employing parallel andpipeline techniques. For example, the flow diagram of1-D architecture designed for the lifting-based D4DWT is shown in Fig. 5 (here the scale normal-izations are omitted). The architecture for CWmodule is obtained by mapping the architecture ofRW, where the difference of the both lies in the delayregisters used in the latter are replaced by thecorresponding delay lines. The length of each delayline should be equal to N/2 unit time delay due todecimation of DWT.

3.2. Proposed 3-D architecture

Three-D DWT has been used in video compres-sion, MRI compressions, and so on. The separable3-D DWT could be implemented by two types:t+2D and 2D+t. The t+2D type indicates that the3-D DWT is implemented by performing temporal(inter-frame) transform first and followed byperforming spatial transform, while the 2D+t typerepresents that the 3-D DWT is achieved byperforming spatial transform first and temporaltransform next. The flow diagrams of two types of

xe

xo

l

ha b c

Z-1

Fig. 5. Flow diagram of 1-D architecture for lifting-based D4

wavelet filters.

3-D transform are, respectively, shown in Fig. 6(a)and (b). In Fig. 6(a), the input image sequence isfirst decomposed along temporal direction, thenalong horizontal direction, and last along verticaldirection. While in Fig. 6(b), the input imagesequence is first decomposed along horizontaldirection, then along vertical direction, and lastalong temporal direction. The resulting eight sub-bands are usually denoted as lll, llh, lhl, lhh, hll, hlh,hhl, hhh. Where the lll denotes as low–low–lowfrequency subband, the llh denotes the low–low–high frequency subband, and so on. The lll

subband can then be further decomposed in thesame way. The structure illustrated in Fig. 6(a)could work under two operation modes: causalmode and non-causal mode, and that illustrated inFig. 6(b) could only work under causal operationmode. In the causal mode, the image sequence isscanned in real time as the order: first along row-wise, then along column-wise followed by alongframe-wise. In the non-causal mode, an imagesequence block needs to be first stored in anexternal memory, and then it is read in as the

Page 7: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1095

order: first along frame-wise, then along row-wisefollowed by along column-wise. As shown in a laterpart of paper, the causal mode requires smaller sizeof external memory with shorter output latencythan the non-causal mode, while the former requireslarger size of intermediate data buffer storage thanthe latter.

The algorithm for implementing separable 2-DDWT in Z-domain, as described by (6) and (7), couldbe directly extended to implement separable 3-DDWT. Efficient line-based architectures for theseparable 3-D DWT, by directly extending that forseparable 2-D DWT, are thus proposed as shown inFig. 7. The architecture in Fig. 7(a) is designed toimplement t+2D type 3-D DWT, while that inFig. 7(b) is designed to achieve 2D+t type 3-DDWT. Both architectures for the separable 3-DDWT, working under causal mode, are basicallysimilar and composed of an IBU and a WT module.The architecture for working under non-causal modecould be designed similar as that shown in Fig. 7(a),

xoee

xooo

xoeo

xooe

xeee

xeoo

xeeo

xeoe loe

hoe

loo

hoo

leo

heo

x

xoee

xooo

xoeo

xooe

xeee

xeoo

xeeo

xeoe

loe

hoe

loo

hoo

leehee

leo

heo

x

IBU

FIFO1

FIFO5

FIFO2

FIFO6

FIFO3

FIFO7

FIFO4

FIFO8

WT

FW1 RW

RW

RW

RW

FW2

FW3

FW4

lee

IBU

FIFO1

FIFO2

FIFO3

FIFO4

FIFO5

FIFO6

FIFO7

FIFO8

WT

RW1

RW2

RW3

RW4

CW

CW

CW

CW

(a)

(b)

Fig. 7. Proposed line-based architecture for the separ

but the IBU of which could be removed. The WTmodule is an eight-input/eight-output architecturewith three stages of transforms in parallel andpipelined, which includes four frame-wise 1-DDWT modules (FW1–FW4), four row-wise 1-DDWT modules (RW1–RW4), four column-wise 1-DDWT modules (CW1–CW4), and a SNU. A set ofeight input samples are required in each intra-clockcycle to feed to the first stage of decompositionmodules, which are separately from even-frameeven-row even-column, even-frame even-row odd-column, even-frame odd-row even-column, even-frame odd-row odd-column, odd-frame even-roweven-column, odd-frame even-row odd-column, odd-frame odd-row even-column, and odd-frameodd-row odd-column. They are in turn denoted asxeeeðm; nÞ, xeeoðm; nÞ, xeoeðm; nÞ, xeooðm; nÞ, xoeeðm; nÞ,xoeoðm; nÞ, xooeðm; nÞ, xoooðm; nÞ. The 12 1-D DWTmodules are responsible for performing filtering,respectively, along temporal, horizontal and verticaldirections, and work in parallel and pipelined.

1 CW1

CW2

CW3

CW3

2

3

4

lle

lhe

hlehhe

llo

lho

hlo

hho

SNU

lll

llh

hll

hlh

lhl

lhh

hhl

hhh

1

2

3

4

lle

lhe

hle

hhe

llo

lho

hlo

hho

FW1

FW2

FW3

FW3

SNU

lll

llh

hll

hlh

lhl

lhh

hhl

hhh

able 3-D DWT: (a) t+2D type; (b) 2D+t type.

Page 8: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESS

K3

SNU

lll

ll h

hll

hlh

lhh

lhl

hhl

hhh

K

K

1/K

K

1/K

1/K

1/K3

Fig. 8. Flow diagram of SNU in Fig. 7.

C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991096

Similarly, the SNU of Fig. 7, designed as shown inFig. 8, integrates the scale normalization operationsrequired, respectively, in temporal transform,row transform and column transform to reduceefficiently the number of multipliers required in thearchitecture of 3-D DWT. The IBU can beimplemented by eight FIFOs (named FIFO1–FI-FO8 with sizes of, respectively, about N2=4;N2=4;N2=4;N2=4;N2=8;N2=8;N2=8;N2=8, where N2

represents the size of a frame image), which areused to buffer the samples separately being fromeven-row–even-column, even-row–odd-column, odd-row–even-column and odd-row–odd-column of se-quential an even-indexed and an odd-indexed frameimages. In order to provide eight samples at eachinternal working clock cycle, 8 times faster clock ratethan intra-working clock, i.e. f s ¼ 8f w ðf s denotesinput data sampling frequency, and f w denotesinternal working frequency of WT module, respec-tively), is required to acquire input data samples. Thebuffering for the odd-numbered (or even-numbered)frames of samples can be achieved in the period whenthe second halves frames of samples are beingprocessed. In each internal clock cycle, eight inputsamples obtained respectively, from the eight FIFOs,are inputted in parallel to the four 1-D DWTmodules of the first stage, while eight outputcoefficients are generated synchronously by SNU.The detailed data flow could be seen in Fig. 7.

In causal mode, the architectures for RW andCW are same as those for 2-D DWT presented in

Section 3.1. And the architecture for FW is similaras that for CW, but the size of delay line required inthe former is increased from N=2 to N2=4. In non-causal mode, the delay lines required in FWworking under causal mode are replaced by thecorresponding delay registers, while the delayregisters used in RW are replaced by the corre-sponding delay lines with size of M=2, and the sizeof delay line required in CW is increased from N=2to NM=4, where M represents the frame number ofthe buffered image sequence.

4. Performance evaluations

Performance evaluation for several architecturesof m-D DWT in terms of hardware complexity andcomputation complexity is given in this section. Thecomparisons between our architectures and severalother efficient designs (i.e. those presented in[14,24,25,28,30]) are given in detail. The work in[14] presented an efficient 2-D architecture ofconvolution-based DWT. The work in [24] pre-sented an efficient 2-D architecture of lifting-basedDWT, and that in [30] presented an efficient 2-Darchitecture of lifting-based the Legall5/3 andDaubechies9/7 wavelet filters. The work in [25]presented an efficient m-D architecture of convolu-tion-based DWT, and that in [28] presented anefficient 3-D architecture of convolution-basedDaubechies 4-tap filters. The numbers of multi-pliers, adders, the size of buffer memory, andcontrol complexity (CC) are used to measure thehardware complexity for m-D DWT. The computa-tion complexity is measured by SL of performingone-level m-D DWT, which is normalized to intra-clock cycles (ccs).

According to the architectures shown in Figs. 3and 7, four and 12 1-D DWT modules are,respectively, used to implement 2-D and 3-DDWT. The total numbers of multipliers, addersused in WT module are multiple of those required inthe original 1-D DWT architecture, and the size ofbuffer occupied in WT module are proportional tothe number of delay registers used in the original 1-D DWT architecture. They are all dependent of thechosen wavelet filters and its lifting factorization.Let KM;KA and KR represent, respectively, thenumbers of multipliers, adders and delay registersused in the original 1-D DWT architecture, thenKM ¼ 6 multipliers, KA ¼ 8 adders and KR ¼ 4delay registers are required when the lifting-basedDaubechies9/7 [34] WT is chosen, and KM ¼ 5

Page 9: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1097

multipliers, KA ¼ 4 adders and KR ¼ 1 delayregisters are required when the lifting-based D4[34] WT is chosen. Because of integration ofnormalization operations, it could be evaluated thatthe number of multipliers required in 2-D WTmodule is 4KM-6, and that required in 3-D WTmodule is 12KM-16. The sizes of buffer required arecomputed as 1:5N þ 2KRðN=2Þ ¼ ð1:5þ KRÞN for2-D architecture, and 1:5N2 þ 4KRðN

2=4Þ þ4KRðN=2Þ ¼ ð1:5þ KRÞN

2 þ 2KRN for 3-D archi-tecture in causal mode, while 4KRMðN=4Þ þ4KRðM=2Þ ¼ KRðN þ 2ÞM for 3-D architecture innon-causal mode. Where the parameter N denotesthe width and height of the input image, and M

represents the frame number of buffered imagesequence in non-causal mode.

Table 1

Performance comparison in general case for several 2-D architectures

Architecture Multipliers Adders

Wu [14] 2ðKg þ KhÞ 2ðKg þ KhÞ

Liao [24] 2KM 2KA

Dai [25] 4ðKg þ KhÞ 4ðKg þ Kh � 2Þ

Proposed 4KM � 6 4KA

Note: K ¼ ceilingðKg=2Þ þ ceilingðKh=2Þ.N denotes the width of image; N2 denotes the size of image.

Table 2

Performance comparisons for several 2-D architectures in cases of DB

Architecture Multipliers Adders B

Wu [14] DB9/7 34 32 9

D4 16 16 4

Liao [24] DB9/7 12 16 5

D4 10 8 2

Das [28] D4 16 14 3

Wu [30] DB9/7 6 8 5

Dai [25] DB9/7 64 56 1

D4 32 24 5

Proposed DB9/7 18 32 5

D4 14 16 2

Table 3

Performance comparisons in general case for several 3-D architectures

Architecture Mode Multipliers Adders On-

Dai [25] Non-causal 12ðKg þKhÞ 12ðKg þ Kh � 2Þ ðKg

Proposed Causal 12KM � 16 12KA ð1:5Non-causal 12KM � 16 12KA KRð

The detailed performance comparisons of several2-D architectures are listed in Tables 1 and 2, whilethose for 3-D architectures are listed in Tables 3and 4. Tables 1 and 3 describe the comparisonresults in general case, and Tables 2 and 4 listthe performance comparison results in cases ofDaubechies9/7 (DB9/7) and D4 wavelet filters [34]being chosen. For fairness of comparison, an IBUof size 1.5N is added to buffer storage required inDai’s and Das’s 2-D architectures, an IBU of size1:5N2 is added to buffer storage required in Dai’sand Das’ 3-D architectures, and the optimizationimplementation of multiplier adopted in Das’method is ignored. Kg and Kh, respectively,represent the length of high-pass filter and low-passfilter of the chosen wavelet filters. It is noted that,

Buffer memory SL(ccs) CC

KN N2=2 Complex

ð1þKRÞN N2=2 Sample

ðKg þ Kh � 2:5ÞN N2=4 Medium

ð1:5þ KRÞN N2=4 Sample

9/7 and D4 wavelet filters being chosen

uffer memory OL(ccs) SL(ccs) CC

N 4N N2=2 Complex

N 1:5N

N 2N N2=2 Sample

N N

:5N N N2=2 Medium

:5N 4N N2 Medium

3:5N 2N N2=4 Medium

:5N N=2:5N N N2=4 Sample

:5N N=2

chip memory Off-chip memory SL(ccs) CC

þ Kh � 4ÞðN þ 2ÞM M �N2 MN2=8 Complex

þ KRÞN2 þ 2KRN – MN2=8 Sample

N þ 2ÞM M �N2 Medium

Page 10: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESS

Table 4

Performance comparisons for several 3-D architectures in cases of DB9/7 and D4 wavelet filters being chosen

Architecture Mode Multipliers Adders On-chip memory SL(ccs) OL

Dai [25] DB9/7 Non-causal 192 168 12ðN þ 2ÞM MN2=8 4sþMN2

D4 96 72 4ðN þ 2ÞM MN2=8 sþMN2

Das [28] D4 Causal 48 40 3:5N2 þ 4N MN2=4 N þN2

Proposed DB9/7 Causal 56 96 5:5N2 þ 8N MN2=8 2s0

D4 44 48 2:5N2 þ 2N MN2=8 s0

DB9/7 Non-causal 56 96 4ðN þ 2ÞM MN2=8 2sþMN2

D4 44 48 ðN þ 2ÞM MN2=8 sþMN2

Note: s ¼ ð2M þMNÞ=4, s0 ¼ ð2N þN2Þ=4.

C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991098

for expression simplicity, effect of the delay registersused in architecture to memory size and outputlatency are omitted because its value is far smallerthan that of the delay line. In addition, ourarchitectures are very regular, thus the control logicand control circuit of our architecture will be verysimple. The comparison results demonstrate thatthe proposed architectures have better performancein terms of production of SL and hardware cost.

5. Conclusions

Efficient VLSI architectures for 2-D and 3-DDWT have been proposed by using the liftingscheme of DWT. The parallelism of among allsubbands transforms in lifting-based m-D DWT isexplored to optimize the design, which increasesefficiently the throughput rate of separable m-DDWT. Compared with the similar works, ourapproach could increase efficiently the performancein terms of system output latency and throughputrate, and is a good alterative in tradeoff betweenthroughput rate and hardware complexity. Theproposed architectures are simple, regular, scalable,and well suited for VLSI implementation.

Acknowledgements

The authors would like to thank the associateeditor and anonymous reviewers for their valuablecomments. This work was supported by theState ‘‘eleventh-five’’ Key Plan Project Founda-tion of China (C1120061304), the State NaturalScience Foundation of China (60572048) and theNatural Science Foundation of Hubei Province(2006ABA370).

References

[1] S.G. Mallat, A theory for multiresolution signal decomposi-

tion: the wavelet representation, IEEE Trans. Pattern Anal.

Mach. Intell. 11 (7) (July 1989) 674–693.

[2] I. Daubechies, The wavelet transform, time frequency,

localization and signal analysis, IEEE Trans. Inform.

Theory 36 (9) (September 1990) 961–1005.

[3] A. Averbuch, D. Lazar, M. Israeli, Image compression using

wavelet transform and multiresolution decomposition, IEEE

Trans. Image Process. 5 (1) (January 1996) 4–15.

[4] A.S. Lewis, G. Knowles, Image compression using the 2-D

wavelet transform, IEEE Trans. Image Process. 1 (2)

(February 1992) 244–250.

[5] A.S. Lewis, G. Knowles, Video compression using 3-D

wavelet transforms, Electron. Lett. 26 (6) (March 1990)

396–398.

[6] M. Weeks, Architectures for the 3-D discrete wavelet

transform, Ph.D. Dissertation, University of Southwestern

Louisiana, 1998.

[7] A. Bruce, D. Donoho, H.-Y. Gao, Wavelet analysis, IEEE

Spectrum 33 (10) (October 1996) 26–35.

[8] J.W. Woods, J.-R. Ohm, Special issue on subband/wavelet

interframe video coding, Signal Process. Image Commun. 19

(2004) 557–579.

[9] K.K. Parhi, T. Nishitani, VLSI architectures for discrete

wavelet transforms, IEEE Trans. VLSI Systems 1 (6) (June

1993) 191–202.

[10] C. Chakrabarti, M. Vishwanath, Architectures for wavelet

transforms: a survey, J. VLSI Signal Process. 14 (1996)

171–192.

[11] M. Week, M. Bayoumi, Discrete wavelet transform:

architectures, design and performance Issues, J. VLSI Signal

Process. 35 (2) (February 2003) 155–178.

[12] C. Chakrabarti, M. Vishwanath, Efficient realizations

of the discrete and continuous wavelet transforms: from

single chip implementations to mappings on SIMD array

computers, IEEE Trans. Signal Process. 43 (3) (March 1995)

759–771.

[13] C. Chrysafis, A. Ortega, Line-based, reduced memory,

wavelet image compression, IEEE Trans. Image Process. 9

(3) (March 2000) 378–389.

[14] P. Wu, L. Chen, An efficient architecture for two-dimen-

sional discrete wavelet transform, IEEE Trans. Circuits

Systems Video Technol. 11 (4) (April 2001) 536–545.

Page 11: Efficient array architectures for multi-dimensional lifting-based discrete wavelet transforms

ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1099

[15] T. Park, S. Jung, High speed lattice based VLSI architecture

of 2D discrete wavelet transform for real-time video signal

processing, IEEE Trans. Consumer Electron. 48 (4) (April

2002) 1026–1032.

[16] F. Marino, Two fast architectures for the direct 2-D discrete

wavelet transform-signal processing, IEEE Trans. Signal

Process. 49 (6) (June 2001) 1248–1259.

[17] F. Marino, Efficient high-speed low-power pipelined archi-

tecture for the direct 2-D discrete wavelet transform, IEEE

Trans. Circuits Systems II: Analog Digital Signal Process. 47

(12) (February 2000) 1476–1491.

[18] T. Park, S. Jung, High speed lattice based VLSI architecture

of 2D discrete wavelet transform for real-time video signal

processing, IEEE Trans. Consumer Electron. 48 (4) (April

2002) 1026–1032.

[19] M. Weeks, M. Bayoumi, 3-D discrete wavelet transform

architectures, in: Proceeding of the IEEE International

Symposium on Circuits and Systems (ISCAS ’98), Monterey,

CA, May 1998, pp. 57–60.

[20] M. Weeks, M. Bayoumi, 3-D discrete wavelet transform

architectures, IEEE Trans. Signal Process. 50 (8) (August

2002) 2050–2063.

[21] W. Jiang, A. Ortega, Lifting factorization-based discrete

wavelet transform architecture design, IEEE Trans. Circuits

Systems Video Technol. 11 (5) (May 2001) 651–657.

[22] K. Andra, C. Chakrabarti, T. Acharya, A VLSI architecture

for lifting-based forward and inverse wavelet transform,

IEEE Trans. Signal Process. 50 (4) (April 2002) 966–977.

[23] G. Dillen, B. Georis, J.D. Legat, O. Cantineau, Combined

line-based architecture for the 5-3 and 9-7 wavelet transform

of jpeg2000, IEEE Trans. Circuits Systems Video Technol.

13 (9) (September 2003) 944–950.

[24] H. Liao, M.M. Mandal, B.F. Cockburn, Efficient architec-

ture for 1-D and 2-D lifting-based wavelet transforms, IEEE

Trans. Signal Process. 52 (5) (May 2004) 1315–1326.

[25] Q. Dai, X. Chen, C. Lin, A novel VLSI architecture for

multidimensional discrete wavelet transform, IEEE Trans.

Circuits Systems Video Technol. 14 (8) (August 2004)

1105–1110.

[26] A. Grzeszczak, M.K. Mandal, S. Panchanathan, VLSI

implementation of discrete wavelet transform, IEEE

Trans. Very Large Scale Integration (VLSI) Systems 4 (4)

(December 1996) 421–433.

[27] M. Vishwanath, R.M. Owens, M.J. Irwin, VLSI architec-

tures for the discrete wavelet transform, IEEE Trans.

Circuits Systems II: Analog Digital Signal Process. 42 (5)

(May 1995) 305–316.

[28] B. Das, S. Banerjee, Data-folded architecture for running 3D

DWT using 4-tap Daubechies filters, IEE Proc. Circuits

Devices Systems 152 (1) (February 2005) 17–24.

[29] S.M. Aroutchelvame, K. Raahemifar, An efficient archi-

tecture for lifting-based forward and inverse discrete

wavelet transform, in: Proceedings of the IEEE Interna-

tional Conference on Multimedia and Expo, July 2005,

pp. 816–819.

[30] B. Wu, C. Lin, A high-performance and memory-efficient

pipeline architecture for the Legall5/3 and 9/7 discrete

wavelet transform of JPEG2000 codec, IEEE Trans.

Circuits Systems Video Technol. 15 (12) (December 2005)

1615–1628.

[31] C. Xiong, J. Tian, J. Liu, Efficient parallel architecture for

lifting-based two dimensional discrete wavelet transform, in:

Proceedings of the IEEE International Workshop on VLSI

Design and Video Technology, June 2005, pp. 75–78.

[32] L. Liu, X. Wang, A VLSI architecture of spatial combinative

lifting algorithm based 2-D DWT/IDWT, in: Proceedings of

the IEEE Asia-pacific Conference on Circuits and System,

May 2002, pp. 299–304.

[33] W. Sweldens, The lifting scheme: a new philosophy in

biorthogonal wavelet constructions, Proc. SPIE 2569 (1995)

68–79.

[34] I. Daubechies, W. Sweldens, Factoring wavelet transforms

into lifting schemes, J. Fourier Anal. Appl. 4 (3) (1998)

247–269.

[35] A.R. Calderbank, I. Daubechies, W. Sweldens, B.L. Yeo,

Wavelet transform that map integers to integers, Appl.

Comput. Harmonic Anal. (ACHA) 5 (3) (March 1999)

332–369.