Upload
rahul-jain
View
496
Download
0
Tags:
Embed Size (px)
Citation preview
A POWER EFFICIENT ARCHITECTURE FOR 2-D DISCRETE WAVELET TRANSFORM
Rahul Jain, CoWare India
Preeti Ranjan Panda, IIT-Delhi
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
2
Agenda
� Memory Power Optimization
� Existing Z-Scan based Schemes
� Low Power Z-Scan (Proposed Architecture )
� Results
� Conclusion
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
3
� Importance of Optimizing Memory System Energy
� Many emerging applications like JPEG2000 are data intensive
� Memory system can contribute up to 90% energy
� Concurrently Optimizing Memory Architecture and Accesses
� Algorithm Level� Reduce memory requirement
� Improve regularity of accesses
� Build optimized memory architecture� Memory Partitioning
� Custom Circuits
Memory Power Optimization
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
4
Z-Scan based Schemes [Chiu-SIPS’03]
� Suspending a DWT line computation
� Store 4 intermediate values
� Z-Scan
� Column Processing starts early
� On-Chip Buffer Required = 4*MM =Image Tile ht
� Optimal Z-Scan
� EBCOT Code-Block size (CW*CH) considered
� On-Chip Buffer Required = 4*M+4*2*CW
� Usually CW=CH=64 (values used in exp.)
2* CW
2* CH
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
5
Low-Power Z-Scan (1)
� Generalize the Z-Scan� Compute r elements in a row� For Z Scan, r =2� For Optimal Z-Scan, r = 2*CW� On-Chip Buffer Required = 4*M+4*r
r r
2*CH
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
6
Low-Power Z-Scan (2)
� r will be a sub-integral multiple of 2*CW� This considers the Code Block Size
� 2 separate buffers used� Row Buffer (RB) = 4*M� Column Buffer (CB) = 4*r
� How to decide the value of r ?� Size of CB α r� RB Sleep Time α r
CB: r locations
RB in Low Power Mode
RB access
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
7
Memory Power Analysis (1)
� Let us assume that each element is computed in unit time (Energy and Power can be used interchangeably)
� For a memory of size 2n, Let
� Pa(2n) : memory access power
� Ps(2n) : sleep mode / data retention mode power
� Pw(2n) : wakeup power for each state transition from
sleep mode to active mode
� Let, Ps(2n) = s* Pa (2
n) and Pw (2n) = w* Pa (2n)
� s = 0.1, w = 0.33 (Assumed for Experiments)
� Buffer Accesses
� Read at Resumption
� Write at Suspension
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
8
Memory Power Analysis (2)
� Row Buffer Power
� 2 access per r elements
� RB in sleep mode for r-2 element computation
� Wakeup RB once per row
� Power per ‘r’ element computation:
Prow_buffer (r, M) = 2* Pa(M) + (r-2) * Ps(M) + Pw(M)
RB in Low Power Mode
Row Computation Suspends
Row Computation Resumes
Wakeup
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
9
Memory Power Analysis (3)
� Column Buffer Power
� 1 access per element
� Power consumption per element computation:
Pcol_buffer (r) = Pa(r)
� Power per 2-D DWT Element Computation:
Prow_buffer (r, M)/r + Pcol_buffer (r)
Col Computation Suspends
Col Computation Resumes
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
10
Variation of Power with r
0.00E+00
1.00E-10
2.00E-10
3.00E-10
4.00E-10
5.00E-10
6.00E-10
2 4 8 16 32 64 128
M=512
M=256
M=128
M=64
M=32
Value of r
Energy (J)
r=16
r=32
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
11
� Banked Buffer
� Increases the average idleness of the each buffer
� Lower Access Power
� Predictable state changes, no timing overheads
� Let there be ‘b’ RB banks and ‘c’ CB banks
� Average RB power per element:
Prow = [Power of bank in use*M/b + Sleep Power*(M-M/b)] / M
= [{Prow_buffer (r, M/b) / r} * M/b + Ps (M/b) * (M-M/b)] / M
� Each bank waked up once for M*r elements� Additional Row Buffer Wakeups per Element = b/M*r
Power Implications of Banking (1)Power Implications of Banking (1)
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
12
� Average column-buffer power per element:
Pcol = [{Pcol_buffer (r/c)} * r/c + Ps (r/c) * (r-r/c)] / r
� No of Column Buffer Wakeups per Element = c/r
� Additional Wakeup Power :
Pwakeups = [Pw(M/b) * b/M*r ] + [ Pw(r/c) * c/r ]
� MUX power considered
� Total Power per Element :
Prow + Pcol + Pwakeups + Pmux
Power Implications of Banking (2)Power Implications of Banking (2)
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
13
r vs Power (Banked Case, M=512)
Min Power with r=64, c=4, b=8
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
14
Energy Consumption Comparison
MZ-scan
(10-11J)
Optimal Z-scan
(10-11J)
Low-Power Z-scan
(10-11J)r c b
% imp
32 23.4 29.1 8.08 32 4 4 72.2
64 25.5 29.3 8.13 64 4 4 72.3
128 29.9 29.7 8.18 64 4 8 72.5
256 38.5 30.6 8.29 64 4 8 72.9
512 55.8 32.3 8.49 64 4 8 73.7
1024 90.3 35.8 8.89 64 4 8 75.2
Up to 90% and 75% improvement over Z-Scan and Optimal Z-Scan respectively
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
15
Energy Modelling
� Sequential Access Memory [Moon-CICC’02]
� Configured as a circular buffer
� Address Sequencing logic and decoders replaced with row sequencer to get low power and high speed
� Banked implementation used for big memory
� Energy Modelling [Coumeri-TVLSI’00]
� Empirical Equations for modelling energy of on-chip SRAM memory
� Model parameters are Size, Bit Width, Access Mode
� Individual equations for different memory components
� To model SAM, Row Decoder, Column Decoder, Buffers not considered
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
16
Conclusion
� A methodology to arrive at a Low-Power DWT architecture proposed
� Co-Optimization of Memory Architecture and Access pattern done
� Up to 90% energy saving achieved
� The derived architecture depends on the target memory technology
� Would lead to different architectures for ASIC and FPGA implementations
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
17
References:
� [Chiu-SIPS’03]: Mu-Yu Chiu et al (2003).Optimal data transfer and buffering schemes for JPEG2000 encode. IEEE Workshop on SIPS, Aug. 2003, pp. 177 – 182
� [Moon-CICC’02]: Joong-Seok Moon et.al (2002). Low-power sequential access memory design. Custom Integrated Circuits Conference, 2002. pp.111 – 114
� [Coumeri-TVLSI’00]: Coumeri, S.L et al (2000). Memory modelling for System Synthesis. IEEE Trans. VLSI Systems, , June 2000, pp:327 – 334
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
18
Thank You
Questions!
Backup Slides
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
20
Discrete Wavelet Transform� 2D wavelet transform:
� 1st:1D wavelet transform to all rows
� 2nd:1D wavelet transform to all columns
� Each Row/Column can be computed independently
� Store 4 values at line computation suspension
Z(2i+1)
Z(2i)0 2 4 6 8
Y(2i+1)
X(i)
Y(2i)
0
0
2
2
4
4
6
6
8
8
1 3 5 7
1 3 5 7
1 3 5 7
Colored arrows show multiplication by constants a, b, c, ddefined in JPEG2000 standard
10 August 2006 10th IEEE VLSI Design And Test Symposium, 2006
21
Buffer Structure
� The Buffers are all the time full
� They are accessed like a circular FIFO
� General Memory Row Decoder not required
� use a counter
� use a shift register loaded with a 1 initially
� Every Write Signal
� Increments the counter
� Shifts the Register
� Store all the 4 intermediate values in one Column
� No need for the Column Decoder
� This would be similar to Sequential Access Memory (SAM) [Moon-CICC’02]