An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavelet Transform

  • View
    522

  • Download
    2

Embed Size (px)

Text of An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavelet Transform

  • 1. An Efficient Pipelined VLSIArchitecture for Lifting-Based 2D-Discrete Wavelet TransformRahul JainPreeti Ranjan PandaIIT-Delhi

2. AgendaExisting WorkProposed ArchitectureComparative ResultsConclusion28 May 2007 ISCAS 2007 2 3. Discrete Wavelet Transform (DWT)At the core of JPEG2000 standard(9, 7) Daubechies coefficients defined inJPEG20001-D DWT using Daubechies (9, 7) filter two lifting steps one scaling stepEach lifting Step a prediction step an update step28 May 2007ISCAS 2007 3 4. Hardware Implementation of DWT2-D DWT implemented by row-wise andcolumn wise 1-D DWTDominated by memory size and bandwidthNo of pipeline registers Memory SizeObjective Smaller critical path Lesser pipeline registers28 May 2007ISCAS 20074 5. 1-D DWT Equation1.P1: Y(2i+1) = a * ( X(2i) + X(2i+2) ) + X(2i+1)2.U1: Y(2i) = b * ( Y(2i-1) + Y(2i+1) ) + X(2i)3.P2: Z(2i+1) = c * ( Y(2i) + Y(2i+2) ) + Y(2i+1)4.U2: Z(2i) = d * ( Z(2i-1) + Z(2i+1) ) + Y(2i)5.S: Z(2i) = k * Z(2i)6.S: Z(2i+1) = (1/k) * Z(2i+1)P: Prediction StepU: Update StepS: Scaling Stepa, b, c, d, k: constants defined in JPEG2000 standard28 May 2007 ISCAS 20075 6. Data Flow Graph (DFG)DFG derived from theequationsa, b, c and d nodes showthe corresponding constantcoefficient multipliersX7 and X8 are the off-chipreads required to computeZ4 and Z5X6, Y5, Y4 and Z3 are readfrom the on-chip buffer28 May 2007ISCAS 2007 6 7. Existing ArchitecturesNon-Pipelined Direct Implementation Requires 6 registers with Critical Path : 4Tm+8TaFully Pipelined Direct Implementation Requires 32 registers with Critical Path : TmHigh Performance Architecture Lifting step equations modified Throughput of 1 input/output per cycle Requires 20 registers with Critical Path : TmFlipping Architecture28 May 2007ISCAS 20077 8. Flipping ArchitectureMultiplications moved from critical path usinginverse multipliersCritical path reduced to Tm + 5TaNo hardware Overhead5-Stage pipelined implementation 11 registers required Critical Path : Tm28 May 2007ISCAS 20078 9. Proposed DFG OptimizationsX6 in the present cycleessentially becomes X8in the next cyclea*X6 computed now canbe stored and reused toobtain the a*X8no need to re-computea*X8Similar argument forcomputations involvingY5, Y4 and Z328 May 2007 ISCAS 2007 9 10. Optimized DFG1.e1 = X6 * a2.e2 = X6 + Y5*b3.e3 = Y5 + Y4*c4.e4 = Y4 + Z3 * d28 May 2007ISCAS 2007 10 11. 4 Stage Pipelining Critical Path is Ta + Tm Initiation Interval =1, Resource Requirement4 Multipliers8 Adders10 Registers6 Pipelining Registers4 for e1-e4 Initiation Interval =2 Resource Requirement2 Multipliers4 Adders8 Registers28 May 2007ISCAS 200711 12. Reducing the Scaling Step MultiplierRequirement1D-DWT Low Pass Coeff multiplied by k High Pass Coeff multiplied by 1/kEffectively in 2D-DWT 25% Coeff multiplied by k*k 25% Coeff multiplied by 1/ (k*k) 50% Coeff multiplied by 128 May 2007ISCAS 2007 12 13. Combining the 2 Scaling StepsCombine the scaling steps of Row-wise andcolumn-wise 1D-DWT Reduces 75% scaling step multiplications Saves 3 multiplier requirement at throughput of 2 I/O per cycleProposed Architecture28 May 2007ISCAS 200713 14. Multiplier and Adder Synthesis Existing work presented critical paths with assumptions that Tm > 2*Ta In DWT, we have constant multiplications DWT constant multipliers synthesizedTm = 1.6*TaTm: Multiplier Latency, Ta: Adder Latency28 May 2007ISCAS 200714 15. Comparison of 1D-DWTCritical Path column considers the multipliersynthesis resultsProposed Architecture uses 1 register lesscompared to Flipping Architecture28 May 2007ISCAS 2007 15 16. Comparison of 2D-DWTCombining the scaling step multiplication 3 lesser multipliers required reduces a pipeline register which reduces temporary buffer requirement28 May 2007 ISCAS 200716 17. Flipping vs Proposed @ 4ns Clock 2 architectures synthesized under same clock constraints20% lesser area saving25% power saving 3 lesser register requirementSimplifies clock network => clock power saving28 May 2007ISCAS 200717 18. Conclusion1D-DWT DFG optimizations proposedIn (9,7) DWT, Tm comparable to TaLesser register requirement Area Saving Lesser memory requirement Simpler clock networkScaling steps combined Lesser multiplications Area Saving Power Saving28 May 2007 ISCAS 2007 18