12
Thesis Documentation -1 Literature Survey Title: Low–power Multi port Register File for D.S.P.s Introduction: In recent years, the desire of portable operation of all types of electronic systems has become clear. And a major factor in weight and size of portable devices is the amount of batteries which is directly impacted by power dissipated by the electronic circuits. In addition, the cost of providing power and associated cooling has resulted in significant interest in power reduction even in non-portable applications which have access to a power source. In particular, a digital signal processor (D.S.P) is widely used in mobile applications like mobile phones and power consumption is quite critical in these computation intensive processors. Almost all the D.S.P. algorithm related processing demands the following: 1. Fast computational blocks like MAC (Multiply and accumulate). 2. Multiple sets of data accessible simultaneously. 3. Large no. of Registers for implementing Delays and accumulation. So, a large multi ported Register file is indispensible for D.S.P.s. These Register files are quite large that they often occupy nearly 40% of total area and 40-50% of data path power. So, a large multi-ported register which is also energy efficient is a necessity. In survey of literature related to this area, first mention should be the article referred in ref [1], which primarily guides our thoughts towards low power design of any electronic system. Here all possible aspects of system design are investigated with the goal of reducing power consumption. Here it is assumed that application, which is desired to be implemented, is known which is partially true for our case, as we consider register files for D.S.P.s. Basically CMOS circuits dissipate power whenever capacitances are switched. Power can be reduced by minimizing this capacitance through operation reduction, choice of number representation, exploitation signal correlation, resynchronization to minimize glitching, logic design, circuit design and physical design. So, we take the gist of their work as we confine ourselves to signal processing. Maintaining a given level of computation or through-put is a common concept in signal processing applications, in which there is no advantage in performing the computation faster than some given rate, since the processor has to simply wait until further processing is required. This enables an architecture driven voltage scaling strategy, in which reduction in operating voltage leads to reduction in energy consumption and the resulting reduction in logic speed is compensated through parallel architectures. They focus on various aspects of design and show their effect on power dissipation such as:

Thesis Documentation Lit Survey

Embed Size (px)

Citation preview

Page 1: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 1/12

Thesis Documentation -1

Literature Survey

Title: Low–power Multi port Register File for D.S.P.s

Introduction:

In recent years, the desire of portable operation of all types of electronic systems has

become clear. And a major factor in weight and size of portable devices is the amount of 

batteries which is directly impacted by power dissipated by the electronic circuits. In addition,

the cost of providing power and associated cooling has resulted in significant interest in power

reduction even in non-portable applications which have access to a power source.

In particular, a digital signal processor (D.S.P) is widely used in mobile applications like

mobile phones and power consumption is quite critical in these computation intensive

processors. Almost all the D.S.P. algorithm related processing demands the following:

1. Fast computational blocks like MAC (Multiply and accumulate).

2. Multiple sets of data accessible simultaneously.

3. Large no. of Registers for implementing Delays and accumulation.

So, a large multi ported Register file is indispensible for D.S.P.s. These Register files are

quite large that they often occupy nearly 40% of total area and 40-50% of data path power. So,

a large multi-ported register which is also energy efficient is a necessity.

In survey of literature related to this area, first mention should be the article referred in

ref [1], which primarily guides our thoughts towards low power design of any electronic system.Here all possible aspects of system design are investigated with the goal of reducing power

consumption. Here it is assumed that application, which is desired to be implemented, is known

which is partially true for our case, as we consider register files for D.S.P.s. Basically CMOS

circuits dissipate power whenever capacitances are switched. Power can be reduced by

minimizing this capacitance through operation reduction, choice of number representation,

exploitation signal correlation, resynchronization to minimize glitching, logic design, circuit

design and physical design. So, we take the gist of their work as we confine ourselves to signal

processing.

Maintaining a given level of computation or through-put is a common concept in signalprocessing applications, in which there is no advantage in performing the computation faster

than some given rate, since the processor has to simply wait until further processing is

required. This enables an architecture driven voltage scaling strategy, in which reduction in

operating voltage leads to reduction in energy consumption and the resulting reduction in logic

speed is compensated through parallel architectures. They focus on various aspects of design

and show their effect on power dissipation such as:

Page 2: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 2/12

1.  Capacitive voltage transitions

a.  Logic function choice: ex: NOR gate and NAND gates yield different switching

probabilities of the output for the same input characteristics.

b.  Logic style choice: Dynamic and Pseudo NMOS dissipate more power than Static

logic style

c.  Input Signal characteristics.

d.  Circuit topology: The manner in which logic gates are interconnected can have astrong influence on switching activity

2.  Leakage component of Power

a.  Reverse biased diode leakage.

b.  Sub-threshold conduction.

3.  Short-Circuit current component of Power

They also mention some Physical, circuit and logic level aspects of design in power point of 

view.

The next mention in this field is work referred in [2], as we go more into our core designof constructing register files. This work, together with next two references forms examples

illustrating energy efficient design of Register files.

Consider the basic block diagram of typical Register file which is similar to SRAM except

that registers are more delay optimized and occupy much more area per bit than SRAM.

+

Figure: Block diagram of Typical Register file

Source: ref[2]

Write decoder and read decoder select the word line to be written / read respectively.

The storage cell contains two inverters connected back to back just like in the case of SRAM

Page 3: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 3/12

cell. All the bit lines are pre-charged in the first half of the clock and in the next half they

remain at 1 or discharge based on the value stored in the cell. Two kinds of cell structures arise

by the kind of sensing of cells: 1.Single ended and 2. Differential. The names are quite self-

explanatory. Shown above is Differential sensing in type so, a sense amplifier (S.A.) is used to

sense bit line quickly by the difference of voltages between bit and bit_bar. Detailed treatment

of these kinds comes a bit later.

Above figure shows a single read and single write port register file. A multi-port registeris kind of similar to this except that it has additional sets of decoders and output stages (one

per port) and some added control circuitry and looks as follows.

Source: ref[2]

Now, looking at our Register file we can divide the total power consumption into

decoder, control circuitry and the bank. The output stage, including buffers come under Bank

power. Authors show the simulated power of the three circuits. They are as follows

Page 4: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 4/12

 

Figure: power dissipated by different components of a refister file

Source : ref[2] 

The graph is self-explanatory. We clearly see that the contribution of bank to the powerconsumption is much higher than the remaining in all the cases. So we have a detailed look at

the structure of the register cell. As seen earlier, there are basically two kinds of structures:

1.  Differential-end Structure:

The following figure shows a traditional 1-bit N-entry 2R1W differential-end register file

circuit. WWL is a write word line control signals. The gray region is the storage cell. In write

operation, write decoder decodes Nth entry address and activates N th WWL line. Then M1

and M2 are ON, which enable storage cell written with input data. The same function is

done by read decoder and RWL signal during a Read operation.

Source: ref[2]  

2.  Single-end structure

The following figure shows typical single ended 1-bit N-entry Single ended register file

Page 5: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 5/12

 

Source: ref[2] 

It’s a kind of modified in terms of operation compared to differential-end

structure. An inverter is present at the output in order to drive the bit line.Comparison of both circuits:

• For each read cycle in Diff-end ckt.

 –  Bit or bit_bar are active => more power

 –  No. transistors is more=>more leakage power

 –  But access time is low

 –  Suitable for registers for higher entry

• For single-end ckt.

 –  Access time is high

 –  Not suitable for higher entry registers(due to large bit line capacitance)

Source: ref[2] 

They try to build a higher entry register using a combination of smaller entry registers

which can be designed using single ended cells. And connect all the bit lines to an AND gate.

They built 128 entry register file using 4 32-entry register files each with separate pre-charge

Page 6: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 6/12

circuits and 32 AND gates each having 4 inputs are required for combining all of them. This way

they reduce the bit-line capacitance as they use separate pre-charge circuit for each bank of 

lower entry register file.

Next mention is the work referred in [3]. This is an example of exploiting the kind of 

operations done by a D.S.P. They consider conventional register file architecture named “word-

level parallel architecture” for implementing multi-channel data processing hardware. It offersa higher degree of random access which is not required in this particular example. So, they

propose a new file architecture named “word-level serial architecture”, which has a limited

degree of random access that is sufficient for this example. The following figure helps in

understanding this.

Figures : (a) Word level parallel and (b) Word level Serial Architectures

Source: ref[3] 

Registers storing state values of same channel are colored identically and the different

state registers of same channel are named 1, 2, 3… r. If there are n channels then the registers

required are n*r. In parallel architecture, all the registers are accessible to the ALU. In serial

architecture, only the registers of one channel are accessible to the ALU.

This method reduces the no. registers seen by the bus while writing/ reading the data

into the registers thereby, reducing the capacitance seen by the ports.

Power dissipation reduces in this Architecture because:

1.  Reduction in capacitance of port.

2.  Reduction in size of MUX needed.

3.  Reduction in no. clock-gating cells required from (n*r) to r.

Where n is no. rows(or channels) and r is no. of columns(or states) in WL-serial Architecture.

So, n*r must give the total no. register entries.

Page 7: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 7/12

The first two reasons are obvious from figures 2&3. Third reason is explained by following

figure:

Source: ref[3]

Usually while writing into a register entry, data is made available to all the entries and

the enable signals coming from write decoder drive the clock gating cells in a way that only

required entry gets written. So, the no of clock-gating cells required is less in Word level serial

architecture.

The same idea is extended to single-channel processing at the end of the work by simply

partitioning the computations into two parts and then using half the no. of computational units,

compute the values accordingly. This will be clearer by the following figure.

Page 8: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 8/12

Next work (ref [4]) mentioned provides more practical guidelines for our design. Though

it doesn’t emphasize on low power design but it provides a basic idea about design of multi

ported register files. They design a 16 port (10 read 6 write) register file which provides

synchronous reads and asynchronous writes. They assure in every clock cycle all the reads

happen after the write requests are fulfilled. This is done by generating two out-of-phase local

clocks for read circuitry and write circuitry from the global clock available throughout the

processor.They have a storage cell and each and every cell is associated 16 ports which asks for a

large cell accommodating these lines so, they choose a 7-Transistor single ended cell design

because using differential ended design would end up doubling the lines required in each cell

thus increasing the area of the cell and thereby increasing the length of the lines which slows

down the register file operation.

Figure: structure of cell

Source: ref[4]

As we can see we have 10 read ports and 6 read ports connected to back-to-back

connected inverters. Additional buffer is to drive the large load of ports. The line ‘Neq’ is the

only addition which is to reduce the delay of the read ports providing less resistance path while

discharging the bit line.

The structural plan of register file is shown in the following figure. All the sixteen 32-bit

registers are placed in such a way that 16-bits of each entry sit on both sides of decoders. The

sixteen decoders sit in the middle and drive the word lines both sides, two biggest blocks are

the memory cells. Each Read port has an output latch to buffer the read value and drive those

values on to the bus. Each Write port contains an input register in order to latch the value to bewritten from the bus. In addition to that we have addressregisters one for every port for storing

the addresses to be written or to be read. A clock control circuitry is present in order to

generate local clocks from global clock.

Page 9: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 9/12

 

Figure: Floor-plan of the register file referred in ref[4]

When the positive edge of the system clock is coming, the addresses, accessing enable

signals and data will be locked into input registers. Then the write decoder, of which the enable

signal is high, translates the addresses into word lines. Only one word line is pulled up and the

others keep low. Practically, the data is driven at the same time and the write bit-line(WBLarrives earlier than the write word line(WWL). So when the selected word line logic is ’1’, the

data has been ready and can be stored into memory cells quickly. The read operation is similar

with the write operation except that the read bit-line (RBL) is precharged until the word-line is

in high voltage and then discharged through the read cell. As soon as the RBL is stab;e the

output latch will amplify and output the data. The timing diagram is shown below.

Figure: Timing diagram of register file

Source: ref [4] 

The next work in ref[5] is again a design example but this emphasizes a little on low

power design they choose single ended read ports and double ended write ports in order to do

a quicker write. They design a 9- write and 17-read register file. The operation is same as the

above example in case of read. Here differential current mode write is employed in order to

meet the performance requirements. The following figure shows the structural difference in

the cell with respect to previous example.

Page 10: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 10/12

 

Figure: structure of register file referred in ref[5]

The work in ref[6] introduces a novel technique of conditional charge sharing which

detects the bit-line flips and avoids unnecessary precharging of bit lines in case of absence of 

bit line flip. This reduces power dissipated during writes which doesn’t require bit line flips.

Page 11: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 11/12

REFERENCES

1.  A.P. Chandrakasan, R.W. Brodersen, "Minimizing power consumption in

digital CMOS circuits," Proceedings of the IEEE , vol.83, no.4, pp.498-523, Apr

1995

2.  Ting-Sheng Jau, Wei-Bin Yang, Chung-Yu Chang, “Analysis and Design of 

High Performance, Low Power Multiple Ports Register Files”, in IEEE Asia Pacific

Conference on Circuits and Systems (APCCAS), pp. 1453 – 1456, 4-7 DEC 2006,

Singapore.

3.  M.Mueller, A. Wortmann, S.Simon, S. Woke, S.Buch, M. Wroblewski,

J.A.Nossek, “low power register file architecture for application specific DSPs”, in

IEEE International Symposium on Circuits and Systems(ISCAS) , pp. VI-89 – VI-92,

26-29 MAY 2002,Arizona.4.  Yu Qian, Wang Dong-hui, Zhang Tie-jun, Hou Chao-huan, "A design of 

500MHz 10-read 6-write register file," ASIC, 2005. ASICON 2005. 6th International

Conference On , vol.1, no., pp.311-315, 24-0 Oct. 2005.

5.  Shenglong Li, Zhaolin Li, Fang Wang , "Design of a high-speed low-power

multiport register file," Microelectronics & Electronics, 2009. PrimeAsia 2009.

Asia Pacific Conference on Postgraduate Research in , vol., no., pp.408-411, 19-21

Jan. 2009.

6.  Kimish Patel, Wonbok Lee, Pedram, M., "Minimizing power dissipationduring write operation to register files," Low Power Electronics and Design

(ISLPED), 2007 ACM/IEEE International Symposium on , vol., no., pp.183-188, 27-

29 Aug. 2007.

7.  Andrei Pavlov , Manoj Sachdev, “SRAM circuit design and operation”, CMOS SRAM Circuit Design and Parametric Test in Nano-Scaled Technologies, Vol.

40, pp. 13-38, 2008. 8.  Masaki Kondo, Hiroshi Nakamura, “A Small, Fast and Low-Power Register

File by Bit-Partitioning”, in 11 th International Symposium on High-Performance

Computer Architecture(HPCA-11) , pp.40-49,12-16 FEB 2005 , San Francisco.

Page 12: Thesis Documentation Lit Survey

8/8/2019 Thesis Documentation Lit Survey

http://slidepdf.com/reader/full/thesis-documentation-lit-survey 12/12

9.  K.K. Parhi, VLSI Digital Signal Processing Systems, Wiley-Interscience,

1999.

10.  Jessica Hui-Chun Tseng, “Energy Efficient Register File Design”, M.S.Thesis,

Dept. of Electrical Engg. and Computer Science, Massachusetts Inst. Of Tech. ,

Dec. 1999.

11.  D.Suvakovic, C.A.T. Salama, "Guidelines for use of registers andmultiplexers in low power low voltage DSP systems”, in Proceedings of the 8th

Great Lakes Symposium on VLSI, 1998., pp.26-29, 19-21 Feb 1998