A Power Efficient Architecture for 2-D Discrete Wavelet Transform

Preview:

Citation preview

A POWER EFFICIENT ARCHITECTURE FOR 2-D DISCRETE WAVELET TRANSFORM

Rahul Jain, CoWare India

Preeti Ranjan Panda, IIT-Delhi

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

2

Agenda

� Memory Power Optimization

� Existing Z-Scan based Schemes

� Low Power Z-Scan (Proposed Architecture )

� Results

� Conclusion

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

3

� Importance of Optimizing Memory System Energy

� Many emerging applications like JPEG2000 are data intensive

� Memory system can contribute up to 90% energy

� Concurrently Optimizing Memory Architecture and Accesses

� Algorithm Level� Reduce memory requirement

� Improve regularity of accesses

� Build optimized memory architecture� Memory Partitioning

� Custom Circuits

Memory Power Optimization

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

4

Z-Scan based Schemes [Chiu-SIPS’03]

� Suspending a DWT line computation

� Store 4 intermediate values

� Z-Scan

� Column Processing starts early

� On-Chip Buffer Required = 4*MM =Image Tile ht

� Optimal Z-Scan

� EBCOT Code-Block size (CW*CH) considered

� On-Chip Buffer Required = 4*M+4*2*CW

� Usually CW=CH=64 (values used in exp.)

2* CW

2* CH

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

5

Low-Power Z-Scan (1)

� Generalize the Z-Scan� Compute r elements in a row� For Z Scan, r =2� For Optimal Z-Scan, r = 2*CW� On-Chip Buffer Required = 4*M+4*r

r r

2*CH

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

6

Low-Power Z-Scan (2)

� r will be a sub-integral multiple of 2*CW� This considers the Code Block Size

� 2 separate buffers used� Row Buffer (RB) = 4*M� Column Buffer (CB) = 4*r

� How to decide the value of r ?� Size of CB α r� RB Sleep Time α r

CB: r locations

RB in Low Power Mode

RB access

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

7

Memory Power Analysis (1)

� Let us assume that each element is computed in unit time (Energy and Power can be used interchangeably)

� For a memory of size 2n, Let

� Pa(2n) : memory access power

� Ps(2n) : sleep mode / data retention mode power

� Pw(2n) : wakeup power for each state transition from

sleep mode to active mode

� Let, Ps(2n) = s* Pa (2

n) and Pw (2n) = w* Pa (2n)

� s = 0.1, w = 0.33 (Assumed for Experiments)

� Buffer Accesses

� Read at Resumption

� Write at Suspension

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

8

Memory Power Analysis (2)

� Row Buffer Power

� 2 access per r elements

� RB in sleep mode for r-2 element computation

� Wakeup RB once per row

� Power per ‘r’ element computation:

Prow_buffer (r, M) = 2* Pa(M) + (r-2) * Ps(M) + Pw(M)

RB in Low Power Mode

Row Computation Suspends

Row Computation Resumes

Wakeup

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

9

Memory Power Analysis (3)

� Column Buffer Power

� 1 access per element

� Power consumption per element computation:

Pcol_buffer (r) = Pa(r)

� Power per 2-D DWT Element Computation:

Prow_buffer (r, M)/r + Pcol_buffer (r)

Col Computation Suspends

Col Computation Resumes

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

10

Variation of Power with r

0.00E+00

1.00E-10

2.00E-10

3.00E-10

4.00E-10

5.00E-10

6.00E-10

2 4 8 16 32 64 128

M=512

M=256

M=128

M=64

M=32

Value of r

Energy (J)

r=16

r=32

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

11

� Banked Buffer

� Increases the average idleness of the each buffer

� Lower Access Power

� Predictable state changes, no timing overheads

� Let there be ‘b’ RB banks and ‘c’ CB banks

� Average RB power per element:

Prow = [Power of bank in use*M/b + Sleep Power*(M-M/b)] / M

= [{Prow_buffer (r, M/b) / r} * M/b + Ps (M/b) * (M-M/b)] / M

� Each bank waked up once for M*r elements� Additional Row Buffer Wakeups per Element = b/M*r

Power Implications of Banking (1)Power Implications of Banking (1)

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

12

� Average column-buffer power per element:

Pcol = [{Pcol_buffer (r/c)} * r/c + Ps (r/c) * (r-r/c)] / r

� No of Column Buffer Wakeups per Element = c/r

� Additional Wakeup Power :

Pwakeups = [Pw(M/b) * b/M*r ] + [ Pw(r/c) * c/r ]

� MUX power considered

� Total Power per Element :

Prow + Pcol + Pwakeups + Pmux

Power Implications of Banking (2)Power Implications of Banking (2)

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

13

r vs Power (Banked Case, M=512)

Min Power with r=64, c=4, b=8

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

14

Energy Consumption Comparison

MZ-scan

(10-11J)

Optimal Z-scan

(10-11J)

Low-Power Z-scan

(10-11J)r c b

% imp

32 23.4 29.1 8.08 32 4 4 72.2

64 25.5 29.3 8.13 64 4 4 72.3

128 29.9 29.7 8.18 64 4 8 72.5

256 38.5 30.6 8.29 64 4 8 72.9

512 55.8 32.3 8.49 64 4 8 73.7

1024 90.3 35.8 8.89 64 4 8 75.2

Up to 90% and 75% improvement over Z-Scan and Optimal Z-Scan respectively

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

15

Energy Modelling

� Sequential Access Memory [Moon-CICC’02]

� Configured as a circular buffer

� Address Sequencing logic and decoders replaced with row sequencer to get low power and high speed

� Banked implementation used for big memory

� Energy Modelling [Coumeri-TVLSI’00]

� Empirical Equations for modelling energy of on-chip SRAM memory

� Model parameters are Size, Bit Width, Access Mode

� Individual equations for different memory components

� To model SAM, Row Decoder, Column Decoder, Buffers not considered

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

16

Conclusion

� A methodology to arrive at a Low-Power DWT architecture proposed

� Co-Optimization of Memory Architecture and Access pattern done

� Up to 90% energy saving achieved

� The derived architecture depends on the target memory technology

� Would lead to different architectures for ASIC and FPGA implementations

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

17

References:

� [Chiu-SIPS’03]: Mu-Yu Chiu et al (2003).Optimal data transfer and buffering schemes for JPEG2000 encode. IEEE Workshop on SIPS, Aug. 2003, pp. 177 – 182

� [Moon-CICC’02]: Joong-Seok Moon et.al (2002). Low-power sequential access memory design. Custom Integrated Circuits Conference, 2002. pp.111 – 114

� [Coumeri-TVLSI’00]: Coumeri, S.L et al (2000). Memory modelling for System Synthesis. IEEE Trans. VLSI Systems, , June 2000, pp:327 – 334

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

18

Thank You

Questions!

Backup Slides

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

20

Discrete Wavelet Transform� 2D wavelet transform:

� 1st:1D wavelet transform to all rows

� 2nd:1D wavelet transform to all columns

� Each Row/Column can be computed independently

� Store 4 values at line computation suspension

Z(2i+1)

Z(2i)0 2 4 6 8

Y(2i+1)

X(i)

Y(2i)

0

0

2

2

4

4

6

6

8

8

1 3 5 7

1 3 5 7

1 3 5 7

Colored arrows show multiplication by constants a, b, c, ddefined in JPEG2000 standard

10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006

21

Buffer Structure

� The Buffers are all the time full

� They are accessed like a circular FIFO

� General Memory Row Decoder not required

� use a counter

� use a shift register loaded with a 1 initially

� Every Write Signal

� Increments the counter

� Shifts the Register

� Store all the 4 intermediate values in one Column

� No need for the Column Decoder

� This would be similar to Sequential Access Memory (SAM) [Moon-CICC’02]

Recommended