Upload
cheng-yi-xiong
View
215
Download
2
Embed Size (px)
Citation preview
ARTICLE IN PRESS
0165-1684/$ - se
doi:10.1016/j.si
�CorrespondEngineering, So
430074, China.
E-mail addr
Signal Processing 87 (2007) 1089–1099
www.elsevier.com/locate/sigpro
Efficient array architectures for multi-dimensionallifting-based discrete wavelet transforms
Cheng-yi Xionga,b,�, Jian-hua Houa,b, Jin-wen Tianb, Jian Liub
aCollege of Electronic Information Engineering, South-Center University for Nationalities, Wuhan 430074, ChinabInstitute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan 430074, China
Received 5 March 2006; received in revised form 16 September 2006; accepted 2 October 2006
Available online 26 October 2006
Abstract
Efficient array architectures for multi-dimensional (m-D) discrete wavelet transform (DWT), e.g. m ¼ 2; 3, are presented,in which the lifting scheme of DWT is used to reduce efficiently hardware complexity. The parallelism of 2m subbands
transforms in lifting-based m-D DWT is explored, which increases efficiently the throughput rate of separable m-D DWT
with fewer additional hardware overhead. The proposed architecture is composed of m2m�1 1-D DWT modules working in
parallel and pipelined, which is designed to process 2m input samples per clock cycle, and generate 2m subbands coefficients
synchronously. The total time of achieving one level of decomposition for a 2-D image of size N2 is approximately N2=4intra-clock cycles (ccs), and that for a 3-D image sequence of size MN2 is approximately MN2=8 ccs. Efficient line-based
architecture frameworks for both 2D+t (spatial domain decomposition first, followed by temporal directional
decomposition) and t+2D (temporal directional decomposition first, followed by spatial domain decomposition) 3-D
DWT are firstly proposed, as much as we know. Compared with the similar works reported in previous literature, the
proposed architectures have good performance in terms of throughput rate and system output latency, and are good
alternatives in tradeoff between throughput rate and hardware complexity. The proposed architectures are simple, regular,
scalable and well suited for VLSI implementation.
r 2006 Elsevier B.V. All rights reserved.
Keywords: Discrete wavelet transform; Multi-dimensional; Lifting scheme; Parallel; VLSI architecture
1. Introduction
The discrete wavelet transform (DWT) has beenwidely used as a powerful tool in many applications,such as signal processing, numerical analysis,computer graphics, image compression, etc. [1,2].Two-dimensional (2-D) DWT has been adopted in
e front matter r 2006 Elsevier B.V. All rights reserved
gpro.2006.10.001
ing author. College of Electronic Information
uth-Center University for Nationalities, Wuhan
Tel./fax: +86 2767842854.
ess: [email protected] (C.-y. Xiong).
still image or a sequence of images compressionapplications [3,4]. 3-D DWT has been employed inapplications such as video compressions [5,8] andmagnetic resonance image (MRI) compressions [6],as well as noise reduction between frames of a videosequence [7], and so on. Since the DWT is acomputation-intensive algorithm, dedicated VLSIsolutions have been considered to meet the real-timerequirements in practical applications.
By far, a large number of efficient architecturesfor 2-D and 3-D wavelet transform (WT) have beenpresented [9–31]. DWT can be classified into two
.
ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991090
categories: one is based on convolution operation[1], and the other is based on lifting scheme [33,34].Lifting-based DWT has many advantages overthe conventional convolution-based one [33–35].Especially, the lifting scheme can reduce efficientlythe computational complexity of DWT. When themulti-dimensional (m-D) wavelet base functions areseparable, there exist two main approaches tocompute the m-D DWT: separable approach andnon-separable approach [10,11]. The separableapproach performs m-D DWT by 1-D DWTdimension by dimension, which requires extra hugememory to save the intermediate data that shouldbe transposed for the next dimensional DWT, andhas long output latency and system latency (SL).The non-separable approach does not require anytransposition but requires more multipliers andaccumulators (MACs) than the separable approach.
In order to tradeoff the speed and area, some line-based [13] architectures for 2-D DWT by exploitingparallel and pipeline have been proposed. Chrysafiset al. [13] first proposed the line-based architecturesfor 2-D DWT and image coder for reducingmemory. Wu et al. [14] proposed an efficient line-based architecture for the direct 2-D DWT, in whichthe polyphase decomposition technique and thecoefficient folding technique had been employed toincrease the speed and the hardware utilization.Park et al. [15] proposed a high-speed lattice-basedVLSI architecture for the 2-D DWT for real-timevideo signal processing. Marino [17] proposed ahigh-speed/low-power pipelined architecture for thedirect 2-D DWT by four subbands transform beingperformed in parallel. However, those architectureswere all developed based on convolution DWT,hence they had higher hardware complexity.Recently, some researchers have the lifting schemeused for DWT architecture to further improve theperformance of the 2-D DWT hardware implemen-tation. Jiang et al. [21] first proposed a novel lifting-based system architecture based on overlap-statesequential and split-and-merge parallel with bound-ary post-processing technique for reducing thememory requirements and communication betweenthe processors. Andra et al. [22] proposed a block-based four processors architecture. However, thosearchitectures are all using block-based input mode,thus require a large size of raw data buffer storage.Dillen et al. [24] proposed a combined line-basedarchitecture for the Legall5/3 and Daubechies9/7DWT, which was implemented for one-level decom-position. Liu et al. [32] proposed an efficient
line-based 2-D architecture by using spatial combi-native lifting algorithm of the Daubechies9/7 DWT,with the number of reduced multiply operations.Liao et al. [24] proposed a lifting-based 2-D multi-level architecture with recursive pyramid algorithmand one-level architecture by dual scan fashion. Wuet al. [30] proposed a high-performance andmemory-efficient pipeline architecture for the Le-gall5/3 and Daubechies9/7 WT. However, they allstill had a limited data processing throughput rate.
In order to implement 3-D DWT, Weeks et al.[20] proposed first two efficient architectures. Daiet al. [25] proposed a high speed architecture withpolyphase decomposition technique. Das et al. [28]proposed a solution of implementing running3-D DWT using Daubechies four-tap (D4) wave-let filters. However, they are all based on theconvolution DWT, thus they had higher hardwarecomplexity and longer output latency.
In this paper, novel VLSI generic architectures for2-D and 3-D DWT are proposed by using liftingscheme on the basis of our previous work [31]. Theproposed approach can be straightforwardly extendedto the design of architecture for other higherdimensional DWT. The parallelism of 2m subbandstransforms in lifting-based m-D DWT is explored,which increases efficiently the throughput rate ofseparable m-D DWT with fewer additional hardwareoverhead. The proposed architecture is composed ofm2m�1 1-D DWT modules working in parallel andpipeline fashion, which is designed to process 2m inputsamples per working clock cycle, and generate 2m
subbands coefficients synchronously. The total time ofachieving one-level of decomposition for a 2-D imageof size N2 is approximately N2=4 intra-clock cycles(ccs), and that for decomposing a 3-D image sequenceof size MN2 is approximately MN2=8 ccs. Efficientline-based architectures framework for both 2D+t(spatial domain decomposition first, followed bytemporal directional decomposition) and t+2D(temporal directional decomposition first, followedby spatial domain decomposition) 3-D DWT arefirst proposed, as much as we know. Compared withthe similar works reported in previous literature, theproposed architectures have good performance interms of system output latency and throughput rate,and are good alternatives in tradeoff betweenthroughput rate and hardware complexity. The newarchitectures are simple, regular, scalable and wellsuited for VLSI implementation.
The rest of the paper is structured as follows.Section 2 gives a brief review of the lifting scheme
ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1091
for the DWT. Section 3 describes the proposedarchitectures for the multi-dimensional DWT. Theperformance analysis for the proposed architecturesand comparisons with other designs are presented inSection 4. Finally, a brief conclusion is drawn inSection 5.
2. Lifting scheme of the DWT
Lifting scheme is a new method to constructwavelet bases, which was first introduced bySweldens in 1990s [33]. It was originally developedfrom the earlier work of Donoho to buildthe wavelet from interpolating scaling functionsand the work of Lounsbery et al., which constructedthe wavelet for a polyhedral surface. The maindifference with such classical constructions as[33,34] is that it entirely depends on the spatialdomain. Therefore, it is suitable for constructingwavelets that lack translation and dilation, andthus, the Fourier transform is no longer available.This scheme is called the second-generation wavelet.Obviously, it can be used to build first-generationwavelets and leads to a faster, full in-placeimplementation of the DWT.
The basic idea behind the lifting scheme is arelationship among all biorthogonal wavelets thatshare the same scaling function such that one canconstruct the desired wavelet from a simple one.Daubechies and Sweldens proved [34] that anywavelet with FIR filters could be factorized into afinite number of alternating lifting and dual liftingsteps starting from the Lazy wavelet. This impliesthat any wavelet can be derived from arbitrarywavelet, including the Lazy wavelet, by a finitenumber of lifting and dual lifting.
The main characteristic of the lifting-based DWTscheme is to decompose the high-pass and low-passfilters into a sequence of upper and lower triangularmatrices and convert the filter implementation intobanded matrix multiplications [34]. Such a schemehas several advantages, including ‘‘in-place’’ com-putation of DWT, integer-to-integer wavelet trans-form (IWT) [35], symmetric forward and inversetransform, etc. Therefore, it comes as no surprisethat lifting has been chosen in the new still imagecompression standard JPEG2000.
Let hðzÞ and gðzÞ denote the low-pass and high-pass analysis filters, hðzÞ and gðzÞ the low-pass andhigh-pass synthesis filters, respectively, then thecorresponding decomposition and reconstructionpolyphase matrices, denoted as pðzÞ and pðzÞ,
respectively, are defined as follows:
pðzÞ ¼heðzÞ hoðzÞ
geðzÞ goðzÞ
" #(1a)
and
pðzÞ ¼heðzÞ hoðzÞ
geðzÞ goðzÞ
" #, (1b)
where heðzÞ and geðzÞ (hoðzÞ and goðzÞ) represent theeven parts (odd parts) of the low-pass and high-passwavelet filters, respectively. It has been shown in[33,34] that if the hðzÞ and gðzÞ are a pair ofcomplementary filters for each other, then the pðzÞ isalways factorized by lifting scheme as follows [34]:
PðzÞ ¼K 0
0 1=K
!Ylt
i¼1
1 siðzÞ
0 1
� �1 0
tiðzÞ 1
!( )
(2a)
or
PðzÞ ¼K 0
0 1=K
!Ylt
i¼1
1 0
tiðzÞ 1
!1 siðzÞ
0 1
� �( ).
(2b)
In which K is a constant, tiðzÞ and siðzÞ are denotedas primary lifting and dual lifting polynomial (orvice versa), respectively, and lt represents the totallifting steps required. For example, the Daubechiesfour-tap (D4) wavelet filters can be factored as [34]
PðzÞ ¼1 a
0 1
� �1 0
bþ cz�1 1
" #1 z
0 1
� �K 0
0 1=K
" #,
(3)
where a ¼ �ffiffiffi3p
, b ¼ffiffiffi3p
=4, c ¼ ðffiffiffi3p� 2Þ=4,
K ¼ ðffiffiffi3pþ 1Þ=
ffiffiffi2p
.Accordingly, the 1-D DWT can be implemented
by using the mathematical notations as
½LðzÞ;HðzÞ�t ¼ Pðz�1Þt½X eðzÞ;X oðzÞ�t, (4)
where LðzÞ and HðzÞ represent the Z-transform oflow-frequency subband sequence lðnÞ and high-frequency subband sequence hðnÞ, respectively; letxðnÞ denote input sequence, X eðzÞ and X oðzÞ
denote the Z-transform of even-numbered sequencexeðnÞ ¼ xð2nÞ and odd-numbered sequence xoðnÞ ¼
xð2nþ 1Þ, respectively; the superscript t signifiestranspose operation.
ARTICLE IN PRESS
... ... ...
Fig. 2. Scanned pattern of the input image.
C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991092
3. Proposed architecture for m-D DWT
In this section, several new architectures for m-DDWT are presented, in which the lifting schemeof DWT is used to reduce efficiently hardwarecomplexity, and the parallelism of 2m subbandstransform in lifting-based m-D DWT is explored toincrease the throughput rate. For descriptionsimplicity, only the 2-D and 3-D architectures areintroduced in detail in this section. The proposedarchitecture for 2-D DWT is first described,followed by that for 3-D DWT. This approachcan be straightforwardly extended to the design ofarchitecture for separable higher dimensional DWT.
3.1. Proposed 2-D architecture
3.1.1. Algorithm
The flow diagram of the separable 2-D DWTimplementation could be described as shown inFig. 1. Assuming the input image is scanned row byrow, and in the order of from left to right as shownin Fig. 2, the original input image (denoted as xÞ isfirst decomposed along horizontal direction, theresulting outputs are then decomposed alongvertical direction, and four subbands are finallyobtained usually denoted as ll (low–low frequency),lh (low–high frequency), hl (high–low frequency), hh
(high–high frequency). The ll subband could befurther decomposed in the same way. Let xeeðm; nÞ[xoeðm; nÞ] represent the samples of even-numberedrow and even-numbered column (odd-numberedrow and even-numbered column), and xeoðm; nÞ(xooðm; nÞ) represent the samples of even-numberedrow and odd-numbered column (odd-numberedrow and odd-numbered column), respectively. Andlet leðm; nÞ and heðm; nÞ (loðm; nÞ and hoðm; nÞ) denotethe low-frequency coefficients and high-frequency
Horizontal1-D DWT
Vertical1-D DWT
Vertical1-D DWT
x
l
h
ll
lh
hl
hh
Fig. 1. Flow diagram of the separable 2-D DWT implementa-
tion.
coefficients of the even-numbered rows (odd-num-bered rows), respectively. Where m ¼ 0; 1; 2; . . . ;M=2� 1, n ¼ 0; 1; 2; . . . ;N=2� 1, and M and N
are even integers and denote the height and width ofimage, respectively.
We define the Z-transform for the mth row vectorof matrix yðm; nÞ (i.e. (yðm; 0Þ yðm; 1Þ . . . yðm;N=2� 1ÞÞ as
Y mðz1Þ ¼XN=2�1n¼0
yðm; nÞz�n1 (5a)
and the Z-transform for the nth column vector ofmatrix yðm; nÞ (i.e. ðyð0; nÞ yð1; nÞ . . . yðM=2� 1; nÞÞas
Y nðz2Þ ¼XM=2�1
m¼0
yðm; nÞz�m2 , (5b)
where z1 ¼ z�1 is equivalent to unit time delay,while z2 ¼ z�N=2 is equivalent to N=2 units timedelay.
Using X eeðzÞ (XoeðzÞ) and X eoðzÞ ðXooðzÞ), respec-tively, represent Z-transforms of the sequencesxeeðm; nÞ (xoeðm; nÞ) and xeoðm; nÞ (xooðm; nÞ), LeðzÞ
and HeðzÞ (LoðzÞ and HoðzÞ), respectively, representZ-transforms of leðm; nÞ and heðm; nÞ (loðm; nÞ andhoðm; nÞ), and LLðzÞ, LHðzÞ, HLðzÞ and HHðzÞ,respectively, represent Z-transforms of the matrixesll, lh, hl, hh, then the separable 2-D DWT could beimplemented by using the mathematical notationsas (where the corresponding subscript m and/or n
are omitted for simplicity)
½Leðz1Þ;Heðz1Þ�t ¼ Pðz�11 Þ
t½X eeðz1Þ;X eoðz1Þ�
t, (6a)
½Loðz1Þ;Hoðz1Þ�t ¼ Pðz�11 Þ
t½Xoeðz1Þ;Xooðz1Þ�
t, (6b)
ARTICLE IN PRESS
K2
1/K2
SNU
ll
lh
hl
hh
Fig. 4. Circuit block diagram of SNU in Fig. 3.
C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1093
½LLðz2Þ;LHðz2Þ�t ¼ Pðz�12 Þ
t½Leðz2Þ;Loðz2Þ�
t, (7a)
½HLðz2Þ;HHðz2Þ�t ¼ Pðz�12 Þ
t½Heðz2Þ;Hoðz2Þ�
t, (7b)
where (6a) expresses the 1-D row transform of theeven-row inputs, and (6b) expresses the 1-D rowtransform of the odd-row inputs; while (7a) repre-sents the 1-D column transform of the l subbandcoefficients, and (7b) represents the 1-D columntransform of the h subband coefficients.
3.1.2. Architecture
In order to increase the throughput rate of 2-DDWT, it is an efficient solution to immediately startthe decomposition transform along vertical direc-tion when the sufficient data generated by thedecomposition transform along horizontal directionare available. In the later part of this section, anovel line-based architecture for the separable 2-DDWT is proposed, which is shown in Fig. 3. It couldincrease significantly the throughput rate with feweradditional hardware overhead, making best use ofthe parallelism of four subbands transform asimplied by (6) and (7). Here the input image is stillassumed to be scanned in line-by-line fashion, asshown in Fig. 2. The proposed 2-D architecture iscomposed of an input buffer unit (IBU) and a WTmodule. The WT module is a four-input/four-output architecture that includes two row-wise1-D DWT modules (RW1 and RW2), two column-wise 1-D DWT modules (CW1 and CW2), and ascale normalization unit (SNU). The four 1-D DWTmodules are responsible for performing filtering,respectively, along horizontal and vertical direc-tions, working in parallel and pipelined. The SNU isdesigned as shown in Fig. 4, which integratesthe scale normalization operations required,respectively, in row transform and column trans-form to reduce efficiently the number of multipliersrequired in the architecture of 2-D DWT, becausethe scale normalization factor for low-pass filtering
lehe
loho
xee
xoo
xeo
xoex
IBU
FIFO1
FIFO2
FIFO3
FIFO4
WT
RW1
RW2
CW1
CW2
SNU
ll
lh
hl
hh
Fig. 3. Proposed line-based architecture for the separable 2-D
DWT.
is inverse to that for high-pass filtering as indicatedby (2).
Four input samples are required simultaneouslyto input to WT in each internal working clock cycle,while four subbands coefficients (one for eachsubband) are generated synchronously. Two inputsamples are from the even-numbered row, and theother two are from the odd-numbered row. Twolines of signals are required to input simultaneouslyto the WT. Since the data samples are assumed toinput in a line-by-line way as explained above, anIBU is required to buffer the needed lines ofdata. The IBU can be implemented by four FIFO(first-in-first-out) (named FIFO1, FIFO2, FIFO3,FIFO4 with sizes of, respectively, about N=2,N=2;N=4, N=4, where N represents the width ofimage), which are used to store the samplesseparately being from even-row–even-column,even-row–odd-column, odd-row–even-column andodd-row–odd-column. In order to provide foursamples at an internal working clock cycle, 4 timesfaster clock rate than intra-working clock, i.e. f s ¼
4f w ðf s denotes input data sampling frequency, andf w denotes internal working frequency, respec-tively), is required to acquire input data samples.The buffering for the odd-numbered (or even-numbered) rows of samples can be achieved in theperiod when the second halves rows of samples arebeing processed.
Similarly, let xeeðm; nÞ (xoeðm; nÞ) to represent thesamples of even-numbered row and even-numberedcolumn (odd-numbered row and even-numberedcolumn), and xeoðm; nÞ (xooðm; nÞ) represents thesamples of even-numbered row and odd-numberedcolumn (odd-numbered row and odd-numberedcolumn), respectively. And let leðm; nÞ and heðm; nÞ(loðm; nÞ and hoðm; nÞ) to denote the low-frequencycoefficients and high-frequency coefficients of theeven-numbered rows (odd-numbered rows), respec-tively. In each internal clock cycle, four inputs,xeeðm; nÞ and xeoðm; nÞ, as well as xoeðm; nÞ andxooðm; nÞ, are, respectively, inputted to RW1 and
ARTICLE IN PRESS
Temporal1-D DWT
Horizontal1-D DWT
Horizontal1-D DWT
Inputl
h
ll
lh
hl
hh
Vertical1-D DWT
Vertical1-D DWT
Vertical1-D DWT
Vertical1-D DWT
lll
llh
lhl
lhh
hll
hlh
hhl
hhh
x
Horizontal1-D DWT
Vertical1-D DWT
Vertical1-D DWT
Inputl
h
ll
lh
hl
hh
Temporal1-D DWT
Temporal1-D DWT
Temporal1-D DWT
Temporal1-D DWT
lll
llh
lhl
lhh
hll
hlh
hhl
hhh
x
(a)
(b)
Fig. 6. Flow diagram of the separable 3-D DWT implementa-
tion: (a) t+2D type; (b) 2D+t type.
C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991094
RW2 in parallel. RW1 generates one low-frequencycoefficient leðm; nÞ and one high-frequency coeffi-cient heðm; nÞ for even-numbered row of samples ineach clock cycle, while RW2 produces one low-frequency coefficient loðm; nÞ and one high-fre-quency coefficient hoðm; nÞ for odd-numbered rowof samples in each clock cycle. The outputs of RW1and RW2 are then pipelined to CW1 and CW2, i.e.leðm; nÞ and loðm; nÞ are inputted to CW1, anddecomposed into the subbands low–low frequency(ll) and low–high frequency (lh) components byscale normalization operations. Meanwhile, heðm; nÞand hoðm; nÞ are inputted in parallel to CW2, anddecomposed into the subbands high–low frequency(hl) and high–high frequency (hh) components byscale normalization operations as well. It is notedthat the parameters (m; nÞ of all the above variablesare omitted in Fig. 3 for expression simplicity.
The architecture of RW module could be designedby directly mapping the lifting factorization of thechosen wavelet filter, which is a two-input/two-outputarchitecture implemented by employing parallel andpipeline techniques. For example, the flow diagram of1-D architecture designed for the lifting-based D4DWT is shown in Fig. 5 (here the scale normal-izations are omitted). The architecture for CWmodule is obtained by mapping the architecture ofRW, where the difference of the both lies in the delayregisters used in the latter are replaced by thecorresponding delay lines. The length of each delayline should be equal to N/2 unit time delay due todecimation of DWT.
3.2. Proposed 3-D architecture
Three-D DWT has been used in video compres-sion, MRI compressions, and so on. The separable3-D DWT could be implemented by two types:t+2D and 2D+t. The t+2D type indicates that the3-D DWT is implemented by performing temporal(inter-frame) transform first and followed byperforming spatial transform, while the 2D+t typerepresents that the 3-D DWT is achieved byperforming spatial transform first and temporaltransform next. The flow diagrams of two types of
xe
xo
l
ha b c
Z-1
Fig. 5. Flow diagram of 1-D architecture for lifting-based D4
wavelet filters.
3-D transform are, respectively, shown in Fig. 6(a)and (b). In Fig. 6(a), the input image sequence isfirst decomposed along temporal direction, thenalong horizontal direction, and last along verticaldirection. While in Fig. 6(b), the input imagesequence is first decomposed along horizontaldirection, then along vertical direction, and lastalong temporal direction. The resulting eight sub-bands are usually denoted as lll, llh, lhl, lhh, hll, hlh,hhl, hhh. Where the lll denotes as low–low–lowfrequency subband, the llh denotes the low–low–high frequency subband, and so on. The lll
subband can then be further decomposed in thesame way. The structure illustrated in Fig. 6(a)could work under two operation modes: causalmode and non-causal mode, and that illustrated inFig. 6(b) could only work under causal operationmode. In the causal mode, the image sequence isscanned in real time as the order: first along row-wise, then along column-wise followed by alongframe-wise. In the non-causal mode, an imagesequence block needs to be first stored in anexternal memory, and then it is read in as the
ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1095
order: first along frame-wise, then along row-wisefollowed by along column-wise. As shown in a laterpart of paper, the causal mode requires smaller sizeof external memory with shorter output latencythan the non-causal mode, while the former requireslarger size of intermediate data buffer storage thanthe latter.
The algorithm for implementing separable 2-DDWT in Z-domain, as described by (6) and (7), couldbe directly extended to implement separable 3-DDWT. Efficient line-based architectures for theseparable 3-D DWT, by directly extending that forseparable 2-D DWT, are thus proposed as shown inFig. 7. The architecture in Fig. 7(a) is designed toimplement t+2D type 3-D DWT, while that inFig. 7(b) is designed to achieve 2D+t type 3-DDWT. Both architectures for the separable 3-DDWT, working under causal mode, are basicallysimilar and composed of an IBU and a WT module.The architecture for working under non-causal modecould be designed similar as that shown in Fig. 7(a),
xoee
xooo
xoeo
xooe
xeee
xeoo
xeeo
xeoe loe
hoe
loo
hoo
leo
heo
x
xoee
xooo
xoeo
xooe
xeee
xeoo
xeeo
xeoe
loe
hoe
loo
hoo
leehee
leo
heo
x
IBU
FIFO1
FIFO5
FIFO2
FIFO6
FIFO3
FIFO7
FIFO4
FIFO8
WT
FW1 RW
RW
RW
RW
FW2
FW3
FW4
lee
IBU
FIFO1
FIFO2
FIFO3
FIFO4
FIFO5
FIFO6
FIFO7
FIFO8
WT
RW1
RW2
RW3
RW4
CW
CW
CW
CW
(a)
(b)
Fig. 7. Proposed line-based architecture for the separ
but the IBU of which could be removed. The WTmodule is an eight-input/eight-output architecturewith three stages of transforms in parallel andpipelined, which includes four frame-wise 1-DDWT modules (FW1–FW4), four row-wise 1-DDWT modules (RW1–RW4), four column-wise 1-DDWT modules (CW1–CW4), and a SNU. A set ofeight input samples are required in each intra-clockcycle to feed to the first stage of decompositionmodules, which are separately from even-frameeven-row even-column, even-frame even-row odd-column, even-frame odd-row even-column, even-frame odd-row odd-column, odd-frame even-roweven-column, odd-frame even-row odd-column, odd-frame odd-row even-column, and odd-frameodd-row odd-column. They are in turn denoted asxeeeðm; nÞ, xeeoðm; nÞ, xeoeðm; nÞ, xeooðm; nÞ, xoeeðm; nÞ,xoeoðm; nÞ, xooeðm; nÞ, xoooðm; nÞ. The 12 1-D DWTmodules are responsible for performing filtering,respectively, along temporal, horizontal and verticaldirections, and work in parallel and pipelined.
1 CW1
CW2
CW3
CW3
2
3
4
lle
lhe
hlehhe
llo
lho
hlo
hho
SNU
lll
llh
hll
hlh
lhl
lhh
hhl
hhh
1
2
3
4
lle
lhe
hle
hhe
llo
lho
hlo
hho
FW1
FW2
FW3
FW3
SNU
lll
llh
hll
hlh
lhl
lhh
hhl
hhh
able 3-D DWT: (a) t+2D type; (b) 2D+t type.
ARTICLE IN PRESS
K3
SNU
lll
ll h
hll
hlh
lhh
lhl
hhl
hhh
K
K
1/K
K
1/K
1/K
1/K3
Fig. 8. Flow diagram of SNU in Fig. 7.
C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991096
Similarly, the SNU of Fig. 7, designed as shown inFig. 8, integrates the scale normalization operationsrequired, respectively, in temporal transform,row transform and column transform to reduceefficiently the number of multipliers required in thearchitecture of 3-D DWT. The IBU can beimplemented by eight FIFOs (named FIFO1–FI-FO8 with sizes of, respectively, about N2=4;N2=4;N2=4;N2=4;N2=8;N2=8;N2=8;N2=8, where N2
represents the size of a frame image), which areused to buffer the samples separately being fromeven-row–even-column, even-row–odd-column, odd-row–even-column and odd-row–odd-column of se-quential an even-indexed and an odd-indexed frameimages. In order to provide eight samples at eachinternal working clock cycle, 8 times faster clock ratethan intra-working clock, i.e. f s ¼ 8f w ðf s denotesinput data sampling frequency, and f w denotesinternal working frequency of WT module, respec-tively), is required to acquire input data samples. Thebuffering for the odd-numbered (or even-numbered)frames of samples can be achieved in the period whenthe second halves frames of samples are beingprocessed. In each internal clock cycle, eight inputsamples obtained respectively, from the eight FIFOs,are inputted in parallel to the four 1-D DWTmodules of the first stage, while eight outputcoefficients are generated synchronously by SNU.The detailed data flow could be seen in Fig. 7.
In causal mode, the architectures for RW andCW are same as those for 2-D DWT presented in
Section 3.1. And the architecture for FW is similaras that for CW, but the size of delay line required inthe former is increased from N=2 to N2=4. In non-causal mode, the delay lines required in FWworking under causal mode are replaced by thecorresponding delay registers, while the delayregisters used in RW are replaced by the corre-sponding delay lines with size of M=2, and the sizeof delay line required in CW is increased from N=2to NM=4, where M represents the frame number ofthe buffered image sequence.
4. Performance evaluations
Performance evaluation for several architecturesof m-D DWT in terms of hardware complexity andcomputation complexity is given in this section. Thecomparisons between our architectures and severalother efficient designs (i.e. those presented in[14,24,25,28,30]) are given in detail. The work in[14] presented an efficient 2-D architecture ofconvolution-based DWT. The work in [24] pre-sented an efficient 2-D architecture of lifting-basedDWT, and that in [30] presented an efficient 2-Darchitecture of lifting-based the Legall5/3 andDaubechies9/7 wavelet filters. The work in [25]presented an efficient m-D architecture of convolu-tion-based DWT, and that in [28] presented anefficient 3-D architecture of convolution-basedDaubechies 4-tap filters. The numbers of multi-pliers, adders, the size of buffer memory, andcontrol complexity (CC) are used to measure thehardware complexity for m-D DWT. The computa-tion complexity is measured by SL of performingone-level m-D DWT, which is normalized to intra-clock cycles (ccs).
According to the architectures shown in Figs. 3and 7, four and 12 1-D DWT modules are,respectively, used to implement 2-D and 3-DDWT. The total numbers of multipliers, addersused in WT module are multiple of those required inthe original 1-D DWT architecture, and the size ofbuffer occupied in WT module are proportional tothe number of delay registers used in the original 1-D DWT architecture. They are all dependent of thechosen wavelet filters and its lifting factorization.Let KM;KA and KR represent, respectively, thenumbers of multipliers, adders and delay registersused in the original 1-D DWT architecture, thenKM ¼ 6 multipliers, KA ¼ 8 adders and KR ¼ 4delay registers are required when the lifting-basedDaubechies9/7 [34] WT is chosen, and KM ¼ 5
ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1097
multipliers, KA ¼ 4 adders and KR ¼ 1 delayregisters are required when the lifting-based D4[34] WT is chosen. Because of integration ofnormalization operations, it could be evaluated thatthe number of multipliers required in 2-D WTmodule is 4KM-6, and that required in 3-D WTmodule is 12KM-16. The sizes of buffer required arecomputed as 1:5N þ 2KRðN=2Þ ¼ ð1:5þ KRÞN for2-D architecture, and 1:5N2 þ 4KRðN
2=4Þ þ4KRðN=2Þ ¼ ð1:5þ KRÞN
2 þ 2KRN for 3-D archi-tecture in causal mode, while 4KRMðN=4Þ þ4KRðM=2Þ ¼ KRðN þ 2ÞM for 3-D architecture innon-causal mode. Where the parameter N denotesthe width and height of the input image, and M
represents the frame number of buffered imagesequence in non-causal mode.
Table 1
Performance comparison in general case for several 2-D architectures
Architecture Multipliers Adders
Wu [14] 2ðKg þ KhÞ 2ðKg þ KhÞ
Liao [24] 2KM 2KA
Dai [25] 4ðKg þ KhÞ 4ðKg þ Kh � 2Þ
Proposed 4KM � 6 4KA
Note: K ¼ ceilingðKg=2Þ þ ceilingðKh=2Þ.N denotes the width of image; N2 denotes the size of image.
Table 2
Performance comparisons for several 2-D architectures in cases of DB
Architecture Multipliers Adders B
Wu [14] DB9/7 34 32 9
D4 16 16 4
Liao [24] DB9/7 12 16 5
D4 10 8 2
Das [28] D4 16 14 3
Wu [30] DB9/7 6 8 5
Dai [25] DB9/7 64 56 1
D4 32 24 5
Proposed DB9/7 18 32 5
D4 14 16 2
Table 3
Performance comparisons in general case for several 3-D architectures
Architecture Mode Multipliers Adders On-
Dai [25] Non-causal 12ðKg þKhÞ 12ðKg þ Kh � 2Þ ðKg
Proposed Causal 12KM � 16 12KA ð1:5Non-causal 12KM � 16 12KA KRð
The detailed performance comparisons of several2-D architectures are listed in Tables 1 and 2, whilethose for 3-D architectures are listed in Tables 3and 4. Tables 1 and 3 describe the comparisonresults in general case, and Tables 2 and 4 listthe performance comparison results in cases ofDaubechies9/7 (DB9/7) and D4 wavelet filters [34]being chosen. For fairness of comparison, an IBUof size 1.5N is added to buffer storage required inDai’s and Das’s 2-D architectures, an IBU of size1:5N2 is added to buffer storage required in Dai’sand Das’ 3-D architectures, and the optimizationimplementation of multiplier adopted in Das’method is ignored. Kg and Kh, respectively,represent the length of high-pass filter and low-passfilter of the chosen wavelet filters. It is noted that,
Buffer memory SL(ccs) CC
KN N2=2 Complex
ð1þKRÞN N2=2 Sample
ðKg þ Kh � 2:5ÞN N2=4 Medium
ð1:5þ KRÞN N2=4 Sample
9/7 and D4 wavelet filters being chosen
uffer memory OL(ccs) SL(ccs) CC
N 4N N2=2 Complex
N 1:5N
N 2N N2=2 Sample
N N
:5N N N2=2 Medium
:5N 4N N2 Medium
3:5N 2N N2=4 Medium
:5N N=2:5N N N2=4 Sample
:5N N=2
chip memory Off-chip memory SL(ccs) CC
þ Kh � 4ÞðN þ 2ÞM M �N2 MN2=8 Complex
þ KRÞN2 þ 2KRN – MN2=8 Sample
N þ 2ÞM M �N2 Medium
ARTICLE IN PRESS
Table 4
Performance comparisons for several 3-D architectures in cases of DB9/7 and D4 wavelet filters being chosen
Architecture Mode Multipliers Adders On-chip memory SL(ccs) OL
Dai [25] DB9/7 Non-causal 192 168 12ðN þ 2ÞM MN2=8 4sþMN2
D4 96 72 4ðN þ 2ÞM MN2=8 sþMN2
Das [28] D4 Causal 48 40 3:5N2 þ 4N MN2=4 N þN2
Proposed DB9/7 Causal 56 96 5:5N2 þ 8N MN2=8 2s0
D4 44 48 2:5N2 þ 2N MN2=8 s0
DB9/7 Non-causal 56 96 4ðN þ 2ÞM MN2=8 2sþMN2
D4 44 48 ðN þ 2ÞM MN2=8 sþMN2
Note: s ¼ ð2M þMNÞ=4, s0 ¼ ð2N þN2Þ=4.
C.-y. Xiong et al. / Signal Processing 87 (2007) 1089–10991098
for expression simplicity, effect of the delay registersused in architecture to memory size and outputlatency are omitted because its value is far smallerthan that of the delay line. In addition, ourarchitectures are very regular, thus the control logicand control circuit of our architecture will be verysimple. The comparison results demonstrate thatthe proposed architectures have better performancein terms of production of SL and hardware cost.
5. Conclusions
Efficient VLSI architectures for 2-D and 3-DDWT have been proposed by using the liftingscheme of DWT. The parallelism of among allsubbands transforms in lifting-based m-D DWT isexplored to optimize the design, which increasesefficiently the throughput rate of separable m-DDWT. Compared with the similar works, ourapproach could increase efficiently the performancein terms of system output latency and throughputrate, and is a good alterative in tradeoff betweenthroughput rate and hardware complexity. Theproposed architectures are simple, regular, scalable,and well suited for VLSI implementation.
Acknowledgements
The authors would like to thank the associateeditor and anonymous reviewers for their valuablecomments. This work was supported by theState ‘‘eleventh-five’’ Key Plan Project Founda-tion of China (C1120061304), the State NaturalScience Foundation of China (60572048) and theNatural Science Foundation of Hubei Province(2006ABA370).
References
[1] S.G. Mallat, A theory for multiresolution signal decomposi-
tion: the wavelet representation, IEEE Trans. Pattern Anal.
Mach. Intell. 11 (7) (July 1989) 674–693.
[2] I. Daubechies, The wavelet transform, time frequency,
localization and signal analysis, IEEE Trans. Inform.
Theory 36 (9) (September 1990) 961–1005.
[3] A. Averbuch, D. Lazar, M. Israeli, Image compression using
wavelet transform and multiresolution decomposition, IEEE
Trans. Image Process. 5 (1) (January 1996) 4–15.
[4] A.S. Lewis, G. Knowles, Image compression using the 2-D
wavelet transform, IEEE Trans. Image Process. 1 (2)
(February 1992) 244–250.
[5] A.S. Lewis, G. Knowles, Video compression using 3-D
wavelet transforms, Electron. Lett. 26 (6) (March 1990)
396–398.
[6] M. Weeks, Architectures for the 3-D discrete wavelet
transform, Ph.D. Dissertation, University of Southwestern
Louisiana, 1998.
[7] A. Bruce, D. Donoho, H.-Y. Gao, Wavelet analysis, IEEE
Spectrum 33 (10) (October 1996) 26–35.
[8] J.W. Woods, J.-R. Ohm, Special issue on subband/wavelet
interframe video coding, Signal Process. Image Commun. 19
(2004) 557–579.
[9] K.K. Parhi, T. Nishitani, VLSI architectures for discrete
wavelet transforms, IEEE Trans. VLSI Systems 1 (6) (June
1993) 191–202.
[10] C. Chakrabarti, M. Vishwanath, Architectures for wavelet
transforms: a survey, J. VLSI Signal Process. 14 (1996)
171–192.
[11] M. Week, M. Bayoumi, Discrete wavelet transform:
architectures, design and performance Issues, J. VLSI Signal
Process. 35 (2) (February 2003) 155–178.
[12] C. Chakrabarti, M. Vishwanath, Efficient realizations
of the discrete and continuous wavelet transforms: from
single chip implementations to mappings on SIMD array
computers, IEEE Trans. Signal Process. 43 (3) (March 1995)
759–771.
[13] C. Chrysafis, A. Ortega, Line-based, reduced memory,
wavelet image compression, IEEE Trans. Image Process. 9
(3) (March 2000) 378–389.
[14] P. Wu, L. Chen, An efficient architecture for two-dimen-
sional discrete wavelet transform, IEEE Trans. Circuits
Systems Video Technol. 11 (4) (April 2001) 536–545.
ARTICLE IN PRESSC.-y. Xiong et al. / Signal Processing 87 (2007) 1089–1099 1099
[15] T. Park, S. Jung, High speed lattice based VLSI architecture
of 2D discrete wavelet transform for real-time video signal
processing, IEEE Trans. Consumer Electron. 48 (4) (April
2002) 1026–1032.
[16] F. Marino, Two fast architectures for the direct 2-D discrete
wavelet transform-signal processing, IEEE Trans. Signal
Process. 49 (6) (June 2001) 1248–1259.
[17] F. Marino, Efficient high-speed low-power pipelined archi-
tecture for the direct 2-D discrete wavelet transform, IEEE
Trans. Circuits Systems II: Analog Digital Signal Process. 47
(12) (February 2000) 1476–1491.
[18] T. Park, S. Jung, High speed lattice based VLSI architecture
of 2D discrete wavelet transform for real-time video signal
processing, IEEE Trans. Consumer Electron. 48 (4) (April
2002) 1026–1032.
[19] M. Weeks, M. Bayoumi, 3-D discrete wavelet transform
architectures, in: Proceeding of the IEEE International
Symposium on Circuits and Systems (ISCAS ’98), Monterey,
CA, May 1998, pp. 57–60.
[20] M. Weeks, M. Bayoumi, 3-D discrete wavelet transform
architectures, IEEE Trans. Signal Process. 50 (8) (August
2002) 2050–2063.
[21] W. Jiang, A. Ortega, Lifting factorization-based discrete
wavelet transform architecture design, IEEE Trans. Circuits
Systems Video Technol. 11 (5) (May 2001) 651–657.
[22] K. Andra, C. Chakrabarti, T. Acharya, A VLSI architecture
for lifting-based forward and inverse wavelet transform,
IEEE Trans. Signal Process. 50 (4) (April 2002) 966–977.
[23] G. Dillen, B. Georis, J.D. Legat, O. Cantineau, Combined
line-based architecture for the 5-3 and 9-7 wavelet transform
of jpeg2000, IEEE Trans. Circuits Systems Video Technol.
13 (9) (September 2003) 944–950.
[24] H. Liao, M.M. Mandal, B.F. Cockburn, Efficient architec-
ture for 1-D and 2-D lifting-based wavelet transforms, IEEE
Trans. Signal Process. 52 (5) (May 2004) 1315–1326.
[25] Q. Dai, X. Chen, C. Lin, A novel VLSI architecture for
multidimensional discrete wavelet transform, IEEE Trans.
Circuits Systems Video Technol. 14 (8) (August 2004)
1105–1110.
[26] A. Grzeszczak, M.K. Mandal, S. Panchanathan, VLSI
implementation of discrete wavelet transform, IEEE
Trans. Very Large Scale Integration (VLSI) Systems 4 (4)
(December 1996) 421–433.
[27] M. Vishwanath, R.M. Owens, M.J. Irwin, VLSI architec-
tures for the discrete wavelet transform, IEEE Trans.
Circuits Systems II: Analog Digital Signal Process. 42 (5)
(May 1995) 305–316.
[28] B. Das, S. Banerjee, Data-folded architecture for running 3D
DWT using 4-tap Daubechies filters, IEE Proc. Circuits
Devices Systems 152 (1) (February 2005) 17–24.
[29] S.M. Aroutchelvame, K. Raahemifar, An efficient archi-
tecture for lifting-based forward and inverse discrete
wavelet transform, in: Proceedings of the IEEE Interna-
tional Conference on Multimedia and Expo, July 2005,
pp. 816–819.
[30] B. Wu, C. Lin, A high-performance and memory-efficient
pipeline architecture for the Legall5/3 and 9/7 discrete
wavelet transform of JPEG2000 codec, IEEE Trans.
Circuits Systems Video Technol. 15 (12) (December 2005)
1615–1628.
[31] C. Xiong, J. Tian, J. Liu, Efficient parallel architecture for
lifting-based two dimensional discrete wavelet transform, in:
Proceedings of the IEEE International Workshop on VLSI
Design and Video Technology, June 2005, pp. 75–78.
[32] L. Liu, X. Wang, A VLSI architecture of spatial combinative
lifting algorithm based 2-D DWT/IDWT, in: Proceedings of
the IEEE Asia-pacific Conference on Circuits and System,
May 2002, pp. 299–304.
[33] W. Sweldens, The lifting scheme: a new philosophy in
biorthogonal wavelet constructions, Proc. SPIE 2569 (1995)
68–79.
[34] I. Daubechies, W. Sweldens, Factoring wavelet transforms
into lifting schemes, J. Fourier Anal. Appl. 4 (3) (1998)
247–269.
[35] A.R. Calderbank, I. Daubechies, W. Sweldens, B.L. Yeo,
Wavelet transform that map integers to integers, Appl.
Comput. Harmonic Anal. (ACHA) 5 (3) (March 1999)
332–369.