An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavelet Transform

An Efficient Pipelined VLSI

Architecture for Lifting-Based 2D-

Discrete Wavelet Transform

Rahul Jain

Preeti Ranjan Panda

IIT-Delhi

28 May 2007 ISCAS 2007 2

Agenda

� Existing Work

� Proposed Architecture

� Comparative Results

� Conclusion

28 May 2007 ISCAS 2007 3

Discrete Wavelet Transform (DWT)

� At the core of JPEG2000 standard

� (9, 7) Daubechies coefficients defined in JPEG2000

� 1-D DWT using Daubechies (9, 7) filter� two lifting steps

� one scaling step

� Each lifting Step� a prediction step

� an update step

28 May 2007 ISCAS 2007 4

Hardware Implementation of DWT

� 2-D DWT implemented by row-wise and

column wise 1-D DWT

� Dominated by memory size and bandwidth

� No of pipeline registers α Memory Size

� Objective

� Smaller critical path

� Lesser pipeline registers

28 May 2007 ISCAS 2007 5

1-D DWT Equation

1. P1: Y(2i+1) = a * ( X(2i) + X(2i+2) ) + X(2i+1)

2. U1: Y(2i) = b * ( Y(2i-1) + Y(2i+1) ) + X(2i)

3. P2: Z(2i+1) = c * ( Y(2i) + Y(2i+2) ) + Y(2i+1)

4. U2: Z(2i) = d * ( Z(2i-1) + Z(2i+1) ) + Y(2i)

5. S: Z(2i) = k * Z(2i)

6. S: Z(2i+1) = (1/k) * Z(2i+1)

P: Prediction Step

U: Update Step

S: Scaling Step

a, b, c, d, k: constants defined in JPEG2000 standard

28 May 2007 ISCAS 2007 6

Data Flow Graph (DFG)

� DFG derived from the equations

� a, b, c and d nodes show the corresponding constant coefficient multipliers

� X7 and X8 are the off-chip reads required to compute Z4 and Z5

� X6, Y5, Y4 and Z3 are read from the on-chip buffer

28 May 2007 ISCAS 2007 7

Existing Architectures

� Non-Pipelined Direct Implementation� Requires 6 registers with Critical Path : 4Tm+8Ta

� Fully Pipelined Direct Implementation� Requires 32 registers with Critical Path : Tm

� High Performance Architecture� Lifting step equations modified

� Throughput of 1 input/output per cycle

� Requires 20 registers with Critical Path : Tm

� Flipping Architecture

28 May 2007 ISCAS 2007 8

Flipping Architecture

� Multiplications moved from critical path using

inverse multipliers

� Critical path reduced to Tm + 5Ta

� No hardware Overhead

� 5-Stage pipelined implementation

� 11 registers required

� Critical Path : Tm

28 May 2007 ISCAS 2007 9

Proposed DFG Optimizations

� X6 in the present cycle essentially becomes X8 in the next cycle

� “a*X6” computed now can be stored and reused to obtain the “a*X8”

� no need to re-compute “a*X8”

� Similar argument for computations involving Y5, Y4 and Z3

28 May 2007 ISCAS 2007 10

Optimized DFG

1. e1 = X6 * a

2. e2 = X6 + Y5*b

3. e3 = Y5 + Y4*c

4. e4 = Y4 + Z3 * d

28 May 2007 ISCAS 2007 11

4 Stage Pipelining� Critical Path is Ta + Tm

� Initiation Interval =1, Resource Requirement� 4 Multipliers

� 8 Adders

� 10 Registers

� 6 Pipelining Registers

� 4 for e1-e4

� Initiation Interval =2 Resource Requirement� 2 Multipliers

� 4 Adders

� 8 Registers

28 May 2007 ISCAS 2007 12

Reducing the Scaling Step Multiplier

Requirement

� 1D-DWT� Low Pass Coeff multiplied by k

� High Pass Coeff multiplied by 1/k

� Effectively in 2D-DWT� 25% Coeff multiplied by k*k

� 25% Coeff multiplied by 1/ (k*k)

� 50% Coeff multiplied by 1

28 May 2007 ISCAS 2007 13

Combining the 2 Scaling Steps

� Combine the scaling steps of Row-wise and column-wise 1D-DWT� Reduces 75% scaling step multiplications

� Saves 3 multiplier requirement at throughput of 2 I/O per cycle

� Proposed Architecture

28 May 2007 ISCAS 2007 14

Multiplier and Adder Synthesis

� Existing work presented critical paths with assumptions that Tm > 2*Ta

� In DWT, we have constant multiplications

� DWT constant multipliers synthesized

� Tm = 1.6*Ta

Tm: Multiplier Latency, Ta: Adder Latency

28 May 2007 ISCAS 2007 15

Comparison of 1D-DWT

� Critical Path column considers the multiplier synthesis results

� Proposed Architecture uses 1 register less compared to Flipping Architecture

28 May 2007 ISCAS 2007 16

Comparison of 2D-DWT

� Combining the scaling step multiplication� 3 lesser multipliers required

� reduces a pipeline register which reduces temporary buffer requirement

28 May 2007 ISCAS 2007 17

Flipping vs Proposed @ 4ns Clock

� 2 architectures synthesized under same clock constraints� 20% lesser area saving

� 25% power saving

� 3 lesser register requirement� Simplifies clock network => clock power saving

28 May 2007 ISCAS 2007 18

Conclusion

� 1D-DWT DFG optimizations proposed

� In (9,7) DWT, Tm comparable to Ta

� Lesser register requirement

� Area Saving

� Lesser memory requirement

� Simpler clock network

� Scaling steps combined

� Lesser multiplications

� Area Saving

� Power Saving

Technology

An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavelet Transform