low power digital

1 LOW-POWER VLSI DESIGN:

AN OVERVIEW

1.1 WHY LOW-POWER?

Historically, VLSI designers have used circnit speed 85 the "performance" metric. Large +., in terms of perfoimanee and silicon area, have been made for digital processorz, microprocessors, DSPs ( D i t d Signal Processors), ASICs (Application Spec& ICa), ete. In general, "small area" and "high performance" are two cordieting constraints. The IC designers' activities have been involved in trading off these constreink. Power dissipation issue was not B d e sign criterion but an afterthought. In fact, power considerations have been the ultimate design criteria in special portable applications such as wristwatches and pacemakers for a long time. The objective in these applications war minimum power for maximum battery life time.

Recently, power dissipation is becoming an important constraint in B design. Several reasons anderlie the emerging of this issue. Among them we dte:

Battery-powered systems such BS bptop/noteboak campatus, electronic organiserr, etc. The need for these systems a r k s from the need to extend battery We. Many portable electronics nse the rechargeable Nickel Cad- mium (NiCd) batteries. Although the battery industry has been making efforts to develop batteries with higher energy capaeity than that of NiCd, 8 strident increase does not seem imminent. The expected improvement of the energy density is 40% by the turn of the century. With iecent NiCd batteries, the energy density is around 20 Watt-hour/pound and the voltage is around 1.2 V. So, for example, for a notebook consuming a typical power of 10 Watts and using 1.5 pound of batteries, the time of operation bdween recharges is 3 hours. Even with the advanced battery

2 CHAPTER 1

technologies. such as Nickel-Metal Hydride (Ni-MH) which provide large energy density characteristics (- 30 Watt-hour/pound), the life time of the battery h still low. Since battery technology has offered a limited improvement. low-power design techniques are essential for portable devices.

Low-power design is not only needed for portable applications but also to reduce the power of high-performance systems. With large integration density and improved speed of operation, systeme with high do& frequencies are emerging. These systems are using high-speed products snch as microprocessors. The cost as9ociated with packaging, cooling and fans required by these systems to remove the heat is incteasing significantly. Table 1.1 shows the power consumption of various microprocessors that operate in the frequency range of 66-t-300 MHu. This table demonstrates that, at higher frequencies, the power dissipation is tw excesive.

*

rn Another issue related to high power dissipstion is reliability. With the generation of on-chip high temperature, failure mechanisms are provoked [El. Among them, we cite silicon interconnect fatigue, package relstcd failure, electrical pameter shift. electrornigration, junction fatime, ete..

In addition, there is a trend tv keep the computers from using more than 5% shlue of the total US power bndgct [9]. Note that 50% of office power is nsed by PCs. Since the processors' frequency is increasing, which results in increased power, then low-power design techniques are prerequisites.

The power dissipation issues and the devices' reliability problems, when they are sealed down to 0.5 fin and below. have driven the electronics industry to adopt a snpply voltage lower than the old standard, 5 V. The new industry

Low-Power VLSI Design: An Overview 3

PowerPC 603 IBM 486SLC2 MIPS R4200

standard for IC operating voltage is 3.3 V (i 10%). The effect of lowering the voltage to much lower values can be impressive in terms of power saving. The power is not only reduced but also the weight and volume associated with batteries in battery-operated systems.

_. (!4 0 (W)

80 0.5 3.3 2.2 [lo] 66 0.8 3.3 1.8 [Ill 80 0.64 3.3 1.8 [IZ]

1.2 LOW-POWER APPLICATIONS

Low-power design is becoming a new era in VLSI technology, 8s it impacts many applications; such as:

Battery-powered portable systems; for example notebooks, palmtops, CDs, language translators, etc. There systems represent an important growing maiket in the compoter industry. High-performance capabilities, eompara- ble to those of desktops, are demanded. Several low-power deroprocessors have been designed for these computers. Table 1.2 shows some examples of there low-power processors. However, these circuits still consume significant power an the order of 1-to-3 Watts. These &ems have their power

dissipation dominated by I j O devices such as hard disk ddves and LCD displays. The total expected power dissipation of notebooks is 2 Watts with 4 pounds weight and daily recharge.

Electronic pocket commvnication products such 8s; cordless and cellular telephones, PDAs (Personal Digital Assistants), pagers, ete. Table 1.3 shows a battery analysis far B handheld cellular system. Low-power is crucial for extending the battery life of these systems. Also, battery improvement is needed. The PDAs requite a large *mount of dats processing with multimedia capabilities. The expected power of PDAs is around 0.5 Watt with 0.5 pound weight. Also the expected power for pagers is 10 mW with 0.125 ponnd weight.

4 CHAPTER 1

Handheld Cellular Example Motorola Microtac RF Power GOO mW

750 mAH secondary NiCd 75 minuter talk time Battery life I 20 hours standby

Total power load I 650 mA x G V = 3900 m W

. SubGHz processors for high-perfomance workstations and computers. 100 MBz systems and over are emerging, and 500 MHz and higher will be common by the end of the decade. Since the power consumed is increasing with the trend of frequency increase then processors with new architectures and circuits optimized for low-power are crucial.

Other applications such as WLANs (Wireless Local Area Network) and electronic goads (calculators, hearing aids, watches, ete.).

rn

1.3 LOW-POWER DESIGN METHODOLOGY

In order to optimize the power dissipation ofdigital systems low-power methodology should be applied throughout the design process from system-level to proeeer-level, while realizing that performance is atill essential. During optimization, it is very important to know the power didribution within a proeer- SOL Thns. the parts or blocks consuming an important fraction of the power ate properly optimized fa power 9a-g. Fig. 1.1 shows the different design levels of an integrated system. The process technology is under the control of the deviee/process designer. However, the other levels are eontrolled by the circuit designer.

1.3.1 Power Reduction Through Process Technology

One way to reduce the power dissipation is to reduce the power supply voltage. However the delay increases sigdcantly, particulsrly when VDD approaches

Low-Power VLSI Deszgn: An Overview 5

c I LOGIC/CIRCUlT

I DEVICEPROCESS I Figure 1.1 Power reduction design ~pacr

the threshold voltage. To overcome this problem, the devices should be scaled properly. The advantages of scaling for low-power operation are the following:

Improved devices’ charlrcteristics for low-voltage operation. This is due to the improvement of the current drive capabilities;

Rednced capacitances throngh small geometries and junction capacitances; rn

I Improved interconnect technology;

Availability of multiple and variable threshold devices. This iesults in good management o f active and standby power trade-off; and

Higher density of integration. It was shown that the integration of 8 whole system, into a single chip, provides orders of magnitude in power savings.

1

6 CHAPTER 1

Table 1.4 shows the effect of ecaling on microprocessor performance [14]. The power &sipation can be reduced by one order of magnitude at fired frequency of operation.

L (/4 I 0.50 I 0.35 1 0.25 I 0.15 L.ff (P) I 0.35 1 0.25 1 0.15 I 0.10 VDD (V) I 3.3 1 2.5 1 1.8 I 1.5

Area (mm') I 8 x 10 15.6 x I I 4 x 5 1 2.5 x 3 Clock (MH.) I 100 1 150 I 225 I 330 Power (W) 1 5.0 I 3.3 I 2.35 1 1.5

m Inn MR- - ~ " " "

Area (%ma) 1 6.4 x 8.4 I 4.5 x 6 I 3.2 x 4.2 1 2 x 2.5 Power(W) 1 5.0 I 2.2 I 1 1 0.45

1.3.2

To minimize the power at circnit/logic level, many techniqoes can be nsed such

Power Reduction Through Circuitnogic design

as:

Use of more static style over dynamic style;

Reduce the switching activity by logic optimim.tion;

Optimim clock and bns loading;

Clever circuit techniques that minimise device count and internal swing;

Custom design may improve the power, however, the design cost increases;

Redace VDO in "on-critical paths and proper transistor sizing;

Use of multi-!+ logic circuits; and

Re-encoding of sequential &enits.

Low-Power VLSI Design: An Overuiew 7

1.3.3 Power Reduction Through Architectural Design

At the architecture level, several approaches can be applied to the design:

rn

m

m

rn

Power management techniqoes where annsed blocks are shutdown;

Low-power architectnrcs based on parallelism, pipelining, etc.;

Memory partition with selectively enabled blocks;

Reduction of the number of global busses; and

Minimieation of instruction set for simple decoding and execution.

1.3.4

Among the techniqves to minimize the power at the algorithmic level, we cite:

Minimking the number of operations and henee the number of hardware resonrces; and

Data coding far minimum switching estiuity

Power Reduction Through Algorithm Selection

rn

1.3.5 Power Reduction in System Integration

The system level is also important to the whole process of power optimization. Some techniques are: . Utilive low system clocks. Higher frequencies are generated with on-chip

phbse locked loop; and

High-level of integration. Integrate off-chip memories (ROM, RAM, etc.) and other ICs such 61 digital and analog peripherals.

rn

1.4 THISBOOK

Tb3 book is an early eontribntion to the field oflow-power digital VLSI circuit and system design. It targets two types of aodiences; the senior undergrad- uate and postgradoate university stodents and the VLSI circuit and system

8 CHAPTER 1

designer working in industry. In this book we have tried to cover the basics, from the process technologies and device modeling to the architecture level, of VLSl system. The fundamentals of pow- dissipation in CMOS Circuits are presented to provide the readers with Juffieient badrgranod to be famdiaz with the low-power defign world. Several practical eheuit examples and low-power techniqucs, mainly in CMOS technology, me discussed. Also low-voltage issues for digital CMOS and BiCMOS eircnitr are emphasiied. This book also provides an extensive study of advanced CMOS subsystem design. brious power minimiaation techniques, 8t the circuit, logic, architecture and algorithm levels, are presented. Finally, the book includes a rich list of references, treating advanced topics, at the end of each chapter. This allows the readers to study, in depth, any topier they find interesting.

This book is orgganiad into eigth chapters. The first chapter i s an introduction to low-power design. The other chapters me presented in the following sections.

1.4.1 Low-Voltage Process Technology

Chapter 2 deals with CMOS bulk, bipolar, BiCMOS and CMOS Silicon On Insolstor (SOI) process technologies. Several CMOS technologies (N-well and twin-tub) and low-voltage CMOS enhancement me reviewed. Bipolar technology with emphasir on advanced stmetme. is considered. The topic of the isols- tion techniques wed for both bipolar and CMOS is addressed. Three BiCMOS technologies, with different perfomance/cmt, are presented. Complementary BiCMOS structnre, where a vertical irolated PNP transistor merged with an NPN transistor in 8 CMOS process. The design rules of a 0.8 ~"m BiCMOS process is supplied. Finally, SO1 technology is reviewed for low-voltage and low-power spplieatianr.

1.4.2 Low-Voltage Device Modeling

Chapter 3 addresses the topic of device modeling. This tapk is of iderest to those readers who need to analyze, design and/or simulate circuits. It intro- duces commonly used models of both MOS and bipolar devices. In this chapter we consider simple analytical models which EM be used for circuit malysir and design of deep-rubmicromete. MOSFETr a t low-voltage. Also, a simple model to compute the leakage current, henee the static power dissipation, of MOS-

Low-Power VLSI Deszgn: An Overview 9

FETs i6 discussed. The SPICE’ device models of an 0.8 pm CMOS/BiCMOS process are also presented. This should help the reader to appreciate the meaning of the model parameters as well as to analyse the power and delay of the low-voltage cirenits presented throughout the book. Supply voltage scaling, due to reliability and power dissipation issnes, is presented.

1.4.3 Low-Voltage Low-Power VLSI CMOS Circuit Design

Chapter 4 focuses on CMOS logic circuit design. The sauces of power dissipation in these circuits are reviewed. Simple models for delay and power dissipation estimation m e presented. The concept of switching activity is introduced and examples are given. The power dissipation due to spurious transitions is described. Several CMOS design styles, such 8s pseudo-NMOS, dynamic and NO RAee (NORA) logics, are studied. Guidelines for low-power physical design 810

presented. Other circuit variations of the static complementary CMOS, which are suitable for low-power applications, are discussed. This indodes the pass- transistor logic family such as Complementary Pass-transistor Logic (CPL), Dual Pass-trmsistor Logic (DPL), and Swing Restored Pass-transistor Logic (SRPL). Also an overview of clocldng strategy in VLSI systems is covered. In- duded in this chapter is ane important area which is the I/O circuits. The power dissipation of the 1/0 circuits in also analped. Finally, techniques to reduce static and dynamic power components for CMOS design are also reviewed. This chapter is intended to provide the readers sufficient background in low-power circuit design.

1.4.4

A variety of BiCMOS logic circuits suitable for 3.3 and sub-3.3 V are presented in Chapter 5. The chapter starts with the introdoction of the conventional BiG MOS (totem-pole) gate which was used in 5 V applications. The degradation of this gate, with supply voltage scsJing, is demonstrated. The BiNMOS family suitable for low-voltage applications (3.3 - 2 V range) is introduced. It is shown that it provides better performance and delay-power product than CMOS, at these voltages, even a t low fan-out. Other logic families, for low power supply voltage operation, are also discussed. Finally, this chapter presents several low-voltage applications of BiCMOS.

Low-Voltage VLSI BiCMOS Circuit Design

‘SPIUE i s th. mod comonlyured circuit timulator.

10 CHAPTER 1

1.4.5 Low-Power CMOS Random Access Memory Circuits

The objective of Chapter 6 is two-fold. It is intended to present &=nit technique for active and standby power reduction in static and dynamic RAMs, and to apply the concepts bebind these techniqoes for other applications b e cause RAMs have seen a remarkable and rapid progrw in power reduction. These techniqoes are applicd to the architectural and dreuit levels. Several advanced circuit structures and memory organisstions are described. Circuits, operating at a power supply as low as 1 V, are dm discussed. The Voltage Down Converters (VDCs) used as DC-DC converters are also treated. Their low-power aspects ere investigated.

1.4.6 VLSI CMOS SubSystem Design

Chapter 7 presents B subsystem view of CMOS design. A variety of building blocks of VLSI systems such as adders, multipliers, ALUs, data path, ROMs, PLAs, ete. are &cussed. Several options of each subsystem are presented with power dbripation emphasis. The use of PLL in high-speed CMOS systems for deskewing the internal dock is &o examined. Low-power issuer of CMOS subsystems ilie &o included.

1.4.7 Low-Power VLSI Design Methodology

In Chapter 8 advanced techniques to reduce the dynamic power component at several levels of design are presented. Lowering the power supply voltage while maintaining the performance is one technique for power reduction addressed extensively in this chapter. It is shown that low-power techniques at the high-level (algorithmic and architectural) of the design lead to a power saving of several orders of magnitude. Several exxamples are included to give the reader a desr picture of low-power design aspects. In addition, the pow- estimation techniqnes, at the G c n i t , logical, architectural and behavioral Lev- els, 61e overviewed. The goal of powa estimation is to opt-e power, meet requirements and know the power distribution through the chip.

REFERENCES

[l] Special Report, 'The New Contenders," IEEE Spectrum, pp. 20-25, De

[2] D. W. Dobberpuhl et al., 'A 200-MHz 64-b Dual-Issue CMOS Micro- processor", IEEE J . Solid-State Circuits, vol. 27, no. 11, pp. 1555-1567, November 1992.

131 W. J. Bowhill et d., "A 300MBs 64b Qoad-Issue CMOS RISC Miero- processor," IEEE International Solid-State Circaits C o d , Tech. Dig., pp. 182.183, February 1995.

cember 1993.

141 Technology 1995: Solid State, IEEE Speetmm, pp. 35-39, January 1995.

[5] D. Bearden, et d., "A 133 MHe 64b Four-Issue CMOS Mieroproeessor,' IEEE International Solid-State Circuits Conf., Tech. Dig., pp. 174.175, February 1995.

[6] MIPS Press release, 1994.

[TI A. Charms, ot al., "A 64b Microprocessor with Multimedia Support," IEEE International Solid-state Circuits Conf., Tech. Dig., pp, 178-179, February 1995.

EDN, "01. 39, no. 4, pp. 41-46, February 1994. [8] C. Small, "Shrinking Devices Pat the Squeese on System Packaging,"

[9] P. Verhofstadt, "Keynote Address," IEEE Symposinm on Low Power Elec- tronics, Tech. Dig., October 1994.

[ID] G. Gerosa, et d., "A 2.2 W 80 MHz Superscalar RISC Microprocessor," IEEE Journal of Solid-state Circuits, "01. 29, no. 12, pp. 1440-1454, De- cember 1994.

[ll] R. Beehade, et al., "A 32b 66MAu Micropzocersor," IEEE International Solid-state Circuits Conference, Tech. Dig., pp. 208-209, February 1994.

12 LOW-POWER DIGITAL VLSI DESIGN

[I21 N. K. Yeung, Y-H. Sutu, T. Y-F. Su, E. T. Pak, C-C Chao, 5. Akki, D. D. Yau, and R. Ladenquai, "The Deign of a 55SPECint92 RISC Proees- IOI under ZW," IEEE Internationd Solid-State Circuits Conference, Tech. Dig., pp. 206-201, Febrmry 1994.

[13] 5. Lipoff and A. D. Little, "Evsluation of New Battery Technology in Se lected Applications," IEEE Workshop on Low-power Electronics, Phoenix, AZ, August 1993.

(141 J. M. C. Stork, "Toehaalogy Leverage for U1L.a-Low Power In€mmation Systems," IEEE Symposium on Low Power Electronics, Tech Dig., pp. 5255. October 1994.

2 LOW-VOLTAGE PROCESS

TECHNOLOGY

This chapter ~ e w a ffi an introduction to IC fabrication of CMOS bnlk, bipolar BiCMOS and CMOS SO1 devices including sub-micron devices for low-voltage applications. Section 2.1 is a review of CMOS process technologies. Examples for an N-well CMOS process and a twin-tub CMOS process are considered. Section 2.2 deals with bipolar technology with emphasis on advanced hipola structures. The topie of the isolation techniques used for both bipolar and CMOS is addressed in Section 2.3. In Section 2.4 we discuss the similarities between advanced CMOS and advanced bipolar transistor strnetnres to demon- strate how both technologies m e indeed convergiug. The BiCMOS technologies we introduced in Section 2.5. with emphasis on CMOS-based processes. Three BiCMOS technologies, with different performance/cost, w e presented. Section 2.6. introducer a complementary BiCMOS structure, where B vertical isolated PNP transistor is merged with an NPN transistor in B CMOS process. In Sec- tion 2.7, B table with the design rules of B generic 0.8 pm BiCMOS process is supplied. Finally, in Section 2.8, SO1 technology is reviewed for low-voltage applications.

2.1 CMOS PROCESS TECHNOLOGY

The idea of CMOS wao first proposed by Wanlaoa and Sah [l]. In the 198O's, it was widely acknowledged that CMOS is the technology for VLSI because of its unique advantyes, such as low power, high noise margin, wider temperature and voltage operntion range, overall circuit simplification and layout effie. The development of VLSI in tho 80's has driven the integration density to millions of transistors on B single chip.

14 CHAPTER 2

In this section we review two CMOS bull. technologies: N-well and twin-tub proeeeser. Other processes such ar retrogradwvell technology is not discussed.

2.1.1 N-well CMOS Process

In the N-well CMOS process, the P-channel transistor is formed in the N-well itself and the N-channel in the €-substrate. Fig. 2.1 illustrates cross-sectional views and process steps of B typical N-well process.

The process starts by growing an oxide on the wafer. The oxide is then patterned to open N-well windows. Phosphorus atoms are implanted into the &- con followed by a high-temperature annealing to diffuse the well [Fig. Z.I(a)]. The LOCOS (Local Oxidation of Silicon)' technique is used to isolate the Merent active areas. After removing the nitride used in the LOCOS process, a photoresist layer is deposited and is then patterned by B P-well mark (new mark). This is followed by low energy ion implantation of boron (B I/I) to adjust the threshold voltage of the N-channel transistor [Fig. Z.l(b)]. A see- ond ion implantation can be applied to eliminate punchthrough in the short channel device. Simiirly, the threshold voltage of the P-channel tramistor is adjusted [Fig. Z.I(c)]. A thin gate oxide is then grown and B layer of polysilicon is deposited and doped with phoaphoros. The polyailiean is patterned to form the gates of all the transistors and intereonneetion layer [Fig. Z.l(d)]. The source and drain regions are then implanted by using =photoresist mark. Boron is used for the Pf regions of the P-channel transistors and arsenic for N-channel transistors [Fig. 2.l(e)]. The Nf and P+ regions e.re dso used N- and F- we& contacts, respectively. The photoresist is removed and a thick oxide is deposited by Chemical Vapor Deposition (CVD) ar an isolation layer between the polysilicon layer and the subsequent metal layer. Contact holes are opened in the oxide layer and metal (usually aluminum) is deposited on the whole wafer. At this stage, the metal is patterned and annealed at d s t i v d y low-temperature (450 C) [Fig. Z.l(f)]. One or two other metal layers are um- ally added. At the end, the wafer is pauivated and windows are patterned over the metal bonding pads to provide electrical contacts with pins.

'For nore dctoils on the LOCOS iadationnrrc Sictian 2.8.l.

PI

16 CHAPTER 2

0

. Strip 1eisUordde Grow gate oxide Deporitpolysilicon

8 Apply photoresist and pattern . stripresirt

a Apply photoresist

for P-ehanorl . Patteln s/D regions - ~ m i ~ r p + s r n . Stripphotar&t . RepeatiorN+SlD . Stripphotore%l

. Grow oxide . Etch contact hoie . Deposit mptd . Pattar" metal . Metal anneal

Figure 2.1 (emtinwd)

2.1.2 Twin-Tub CMOS Process

An alternative =pproa& for CMOS devices fabrication is to use two separate v& (tubs) for N- and P-channel transistors in a lightly doped N- or P-type snbrtrate. This "twin-tub" CMOS technology uses a single mmk that d o w a it to form two independently doped and self-aligned tubs [Z] ; hence both CMOS devices types are optimiaed independently. This tlexibility in selecting the substrate type with no change in the process flow is the major advantage of twin-tub CMOS. This technology is alro more attractive when the devices are scaled down to submicron dimensions.

Low- Voltage Process Technology 17

Fig. 2.2 shows the major steps involved in B typical twin-tub process. The starting material is B lightly doped P-epitaxial material over a, Pi- substrate to reduce latch-up. In addition to the conventional N-tub process, another N-type (arsenic) shallow implant is used to increase the suifaee doping of the N-tub to prevent punchthrough (far short channel devices). It is also used to form the channel-stoppers' for the P-channel transistors [Fig. Z.Z(a)]. The photoresist is stripped and a selective oxidation of the N-tub is performed. The nitride/pad wide layers are removed to implant boron, which is driven in to form the P-tub. This is followed by a second boron ion implantation for the channel-stoppers for the N-channel device [Fig, 2.2(b)]. The N-tub oxide is then stripped. So far only one mask (N-tub mask, MASK#l) is required for self-aligned wells and channel-stopper processes. Both tubs are driven in. LOCOS isolation is developed to isolate between the devices using MASK#2, which defines the active areas. After the LOCOS process, baron is implanted through the pad oxide (wed in the LOCOS) to reduce the threshold voltage of the P-channel transistor using MASK#3. This process results in a buried-channel PMOS transistor. The pad oxide is then removed. The remaining steps are similar to those used in the N-well process where MASK#4 is needed to pattern the polysilieon [Fig. 2.2(~)]. MASK#B and MASK#B me required to form the N t and Pi Joureer/drainr (S/D), respectively. MASK#? for contact openings, and MASK#8 for patterning the metal [Fig, 2.2(d)].

The fabrication ofsobmicron MOS transistors requires additional process steps to avoid hot carrier effects. Fig. 2.3 illustrates &CMOS twin-tub structure with Lightly Doped Drain (LDD). Both NMOS and PMOS devices have lightly doped extensions to the ~ouice and drain regions. The electric field near the drain is reduced due to its light doping. This prevents the generation of hot carriers. The major process steps to fabricate the LDD structure are shown in Fig, 2.4.

2.1.3 Low-Voltage CMOS Technology

Seded CMOS has been reoognived BE the technology suitable for low-power battery operated systems demanding high-speed operations. Conventional sealed CMOS technology undergoes a drastic reduction in speed when the power sup- Ply is reduced to 1 V and sub-l V. Ifthe threshold voltage is sealed aggressively, the subthreshold leakage current increases drastically, which causes limitations for battery applications. Hence, high-performance low-power sealed CMOS technology is needed for ultra-low voltage operation. One key in achieving low- Power CMOS devices i s the reduction of the junction capacitances 8s well =

'For marc dctaila on Lhc Ehannel-atopprra rrfcfrr to S d i m 2.3.

18 CHAPTER 2

P-tub N-rub

P-rub

P epi-1aycr

A P rpi4ayer

. stripe rcsir, 8 Grow sclcctivc hick oxide - Remove niindeipad oxide . B in ( P - ~ ~ I I ) . B anneal (P-wolll . 2 n d B Ill (channel-stoppis)

. H'SID . P'SID . contacts . Metalhalion

Figure I.l Twin-tub pmscss sequence


Field irxidc Side wil l

20 CEAPTER 2

other pararitic capacitances. Also, the subthreshold cmrrent should be reduced when low threshold voltage (VT 5 0.3V) is wed.

Extensions and variations of standard CMOS process have been proposed to enhance the performance of devices at low-voltage [3, 41. There devices have good short channel behavior, low junction eapadtbnce and ledwed parasitic resistance. The power supply choice depends on performhnce/reliabity/power trade-offs. Reduced power supply is needed far low-power applications, but 8 deeprubmicron CMOS device with ultrathin gate oxide and low threshold voltage should be used to improve performance. Table 2.1 shows the speed achieved at low-voltages using deepsubmicron processes.

Table 1.1 Perforrnsnee cornperison tow-uoltsge.

[ Name [Ref.] I C M O S Process 1 Voltage (V) I Delay (ps) I IBM [3] 0.10 pm ATLT [4] 0.10 pm NEC [5] 0.15 pm Fujitsu [6] 0.10 pm 21.0

0.15 pm 50.0 Toshiba [8] 0.35 pm 52.0

An example of improved performance CMOS technology suitable for low-voltage is the one proposed by Toahiba [a] called CMOS Shallow Jnoction Well FET (SJET). Fig. 2.5 shows the cross-sectional view of the CMOS-SJET process. The N-well and P-well depths are very shallow and comparable to the max- mum depletion layer width in the channel. With this CMOS-SJET structure the depletion layer of the NMOS device, for example, is extended compared to the original one and reaches the depletion layer of the P-well and the N- type sobstrate. As B result, the total depletion layer width is inmeaced and low depletion capacitance, Go, is obtained. This leads to the reduction of the subthreshold slope ( s w Section 3.3.2). Thus, the threshold voltage can be reduced at low power supply voltage compared to the conventional CMOS p r e CWS. Furthermore the wells are designed to reduce junction capacitance of the S/D tegions by 40 to 55 % compared to the conventional one. The structure of Fig. 2.5 alro uses dual polysilicon gate Nt and Pt, to optimize the threshold voltages of the MOS devices. Mo W-polycide gates m e used to reduce the poly sheet resistance. The delay of the CMOS-SJET inverter is 2.5 times better than that of conventional CMOS using the same gate sine (0.5 pm technology) a t 1.5 V power supply. The power-delay product of a CMOS-SJET gate a t

Low-Voltage Process Technology 21

P MOSFET N MOSFET

W

N-Subsmh

1.5 V nsing 0.35 p m teehno1o.q is 1.3 fJ which is 113 times improvement of that for conventional CMOS de~ces . However, the main drswback with the CMOS-SJET is the large body effect due to its retrograde doping profile.

2.2 BIPOLAR PROCESS TECHNOLOGY

The technology ofepitaxial growth gave rise to the economical manufacturing of monolithic bipolar ICs as it allows a high-quality thin film of semieonductox to be grown on the top of a sobstrate. Jonction-isolation and ep i tuy techniques triggered the progress of bipolar technology. Althongh, most of the focos has been on the development of CMOS for the last ten years, yet, we find that bipolar technology has achieved significant progress as well. Impressive high-speed resalts were demonstrated at the 1985 ISSCC (International Solid- State Circuits Cafereme) and thereafter. ECL (Emitter Coupled Logic) gate delay of 15 ps have been reported 191. It was shown that advanced silicon bipolar technologies, although quite complex, eould be integrated at the LSI level and operate at frequencies above thore of CMOS circuits. Since then, the interest in sdvaneed bipolar processes has increased. The key features for such technologies are: i) self-aligned base, ii) advanced isolation techniques such 8s deep-trench, and iii) polySicon emitter contact.

22 CHAPTER 2

LOU- Voltage Process Technology 23

A1

P

Figure 1.7 C r o a s a d i o n d vicw of the SICOS bipolm device structure [ll]

hsve been replaced by the side wall base electrodes. This allows the base are& to be almost as large as the emitter. The SICOS rtructnre is suitable for VLSI applications became of its density and low perasitics

One of the features of advanced bipolar transistors is the replacanent of aln- minUm by polysilicon for the contact of the emitter. This step has led to noticeable improvement in the current gain of bipolar transistam. For further reading on polysilicon emitter BJTs refer to [lo, 12, 131.

In this aection, we introduce &typical DoublePolysilicon Self-Aligned (DPSA) process technology as an example of the advanced bipolar technologies'.

Any bipolar process typically starts with creating the bnried layers and the epitaxial layer. Fig. 2.8 illustrates the major steps of the epitaxid growth with an iv+ buried layer (BL). This buried lsyer is introduced to reduce the collector resistance ofa hipolar transistor. While the epitaxial layer offers the high-quality silicon host far the bipolar transistor. The steps involved in Fig. 2.8 are the following. First, an oxide lsrer is grown on the substrate and is then patterned using the buried layer mask. The photoresist on the oxide ser~es as a mask against etching and ion implantation. After etching the oxide, the exposed regions of the silicon surface are implanted by arsenic or antimony to form the Nt buried layers. The photoresist is then removed and an annealing step is carried out. All oxide is then stripped. An N-epitariai layer is grown

'A r-irw of conrmntiond bipolar t.~chnology using the jundion isolation ttchniquu can be f o n d in [la].

24 CHAPTER 2

Pholamm

. Grow oxide . Apply pharonaa Pducdetch N+BLmark

8 Implant Sb

Strip resist . Strip oride Si Epitaxial Laycr . Annenl

Epilaxy (intrinsic layer)

on the substrate as shown in Fig. 2.8(b). The thickness of this epitadal layer can he as low as 0.8 pm for advsnced digital bipolar technology. The problems limiting the &g down of the thickness of epitaxial layer are the autodoping and oot-diffusion of the boried Ieyer.

Fig. 2.9 amstrates the sequence of a DPSA process assuming B starting stimc- ture with N+ buried layer, N-epitaxial hyer and isolation oxide as shown in Fig. 2.9(a). First, photoresist is deposited and patterned to define the collector contact region (deep Nt collector sink). This region is then implanted with phosphorus to increa~e its doping level. The photoresist is stripped and

Low-Voltaqe Process Technology 25

Oxide isolalion

Initial Svucmre Apply photoresist PatBrn pholomist

, : ,: (N+calleelor mask) . P In for lhcN'sink (3

CVD Oxide

(4

. Svip photoresistloride . DepositP+palySiio~ide . Pattendetch oxidalpolyS1

26 CHAPTER 2

DepositCVD oxide . RiE etch of oxide

Deposit !he second lcvcl oipulyrilicon P Ill IN+poIy) - Anncal

a Pauemictch N+ p01ysi

a Dcposil oxide - Open wnracl haler Dcposil metel Pallemicuh mcial


B P-type bare is implanted through a pre-implantation oxide as shown in Fig 2.9(b). The resist and the oxide are then removed. A combination of P' polysilicon and oxide layers are deposited o m the wafer. These layers are then etched 8 s shown in Fig. 2.9(c). A CVD oxide is deposited eyer the wafer. The oxide is then dry etched using reactive ion etching (RIE). The Pi- polysilieon is walled with the oxide (called sidewall space^) [Fig. P.S(d)]. The secondled of polysilicon is deposited and implanted with phosphoros that will ultimately form the diffosed emitter junction. At this stage, the wafer is annealed to drive the dopants from the P+ and Nf polysilicon layers. Fig. 2.9(e) illwtiates the structure after patterning the N+ polysilicon. The P+ diffusion under the polysilicon forms the extrinsic base. The eontaet openings to the P+ and Nf palyrilieon, and collector are etched. This is followed by the metallieation step. At the end, the metal is patterned 81 shown in Fig. 2.9(I).

The advantage of bipolar devices is their high-speed performance. However, there are not suitable for battery backup systems because they consume high DC current. Many logic circuit techniqoes have been proposed for low-power adlow-voltage operation, particularly for telecommunications applications 115, 161.

2.3

2.3.1 CMOS Device Isolation Techniques

ISOLATION IN CMOS AND BIPOLAR TECHNOLOGIES

Isolation in an integrated circuit means to electrically isolate similar or different transistors. In a CMOS chip, where more than one million transistors can be integrated, 1 pA/tran&tor of leakage cnrrent due to a bad isohtion can lead to a. few watts of DC power consumption, Moreover this leakage current pzovokes susceptibility to thelatch-up as will be discussed in Section 3.1.6.

Isolation in CMOS is reqnired to separate the devices electrically by elimioat- ing the inversion layers, which might be induced by the interconnection layer between the trmsiston. The principle of isolation in CMOS is based on a field oxide formation between two active mess [Fig, 2.101. The width of the isohtion region should be minimiied to attain dense layout and particularly for VLSI circuits.

28 CHAPTER 2

Active Area Active Area

SubrLrare <’ 7’

Active Area Active Area

SubrLrare

Figure 2.10 Fidd oydc irol~tirm in MOS integrated circuits.

Several isolation techniques have been proposed and used. The most popular are LOCOS (Local Oxidation ofSilicon) [17], trench i d s t i o n [la, 19, 20, 211, and selective cpitaxy [22]. Selective epitaxy is not studied in t h s chapter.

2.3.1.1 Local Oxidation ofSilicon (LOCOS)

LOCOS is a relatively simple process for the isolation of active devices in CMOS technology. It is realivcd by forming a thick field oxide (FOX) between the active meas. FOX is very thick (0.4 - 0.6 hm), hence the corresponding field threshold voltage is high. The condition for preventing an inversion layer under FOX and between two active regions is that this field threshold voltage should be higher than the highest power supply voltage used on chip. The field threshold voltage can be further increased by iaipig the doping level under the FOX, Thir can he achieved by selectively implanting the regions over which the FOX is subsequently grown. These redom are commonly knom as chonnel- 8toppera.

The steps of the LOCOS process m e illwtrated in Fig. 2.11. A p d oxide of 40 nm is grown and is followed hy chemical vapor deposition of B 100 nm thick nitride layer, which masks the active region. The pad oxide is called stress-relief-oxide (SRO) because it protects the silicon from stress caused by the nitride during nuhsepucnt high temperature processes. Sicon nitride is used as a mask to protect the active region from oxidation. A layet of photoresist h applied to the wafer and then patterned using the mask of the active areas. The nitride/oxide layers ace etched [Pi. 2.11(4]. A P-type dopant is


I PChanncl-Stop Substrate

I Substrate

30 CHAPTER 2

Nitride PolySiiicon

Nilridc

- Figure 1.11 Poll buffered LOCOS promni

implanted to form the channel-stoppers [Fig. Z.ll(b)]. The photoresist, which is used for protection against ion implantation, is sttipped and a thick thermal oxide is grown; i.e. FOX. Only local oxkdstion is reahed hecanre the nitride masks the cegions heneath it. At the end, the nitride/oxide are removed [Fig. Z.Il(c)]. During this LOCOS process, 56% of tho FOX thickness b under the silicon surfwe because the oxidation consumer some of the silicon. This p m ceie is called remi-reeerred LOCOS isolation. One problem associated with this PCOCOIS is the lateral extension of the field oxide under the nitride during the oxidation, forming what is cded bird’s be& encroachment [Fig. 2.11(~)]. A typical value ofthb encroachment is 0.5 pmlside. This encroachment limits the sealing of the active areas and the c h e l width of the MOS device. Moreover, this bird’s beak introduees imprecise channel widths.

The Pofy Buff=? LOCOS process was developed to iedoce the hid’s heat encroachment [23]. Ln this modified LOCOS process, the nitride mask thickness has been inereared to 240 nm snd B polysilicon streas relief buffer layer or50 nm has been added between the nitride and B 10 nm pad oxide [Fig. 2.12(a)]. This srrangement prevents deep lateral extenlion ofthe field oxidc under the nitride layer [Fig. 2.12(h)]. A 0.8 pm field oxide thickness results in 0.15 pmlride of


encroachment and 2.2 pm minimum isolation pitch. Other techniques to solve the problem of the bird's beak encroachment can be found in [24, 25, 261.

2.3.1.2 Trench Isolation

Treneh Isolation is mother alternative to LOCOS isolation process. This technology has been accepted relatively quickly b the industry [Z'f]. It addresses the isolation problem between opposite type devices (like N-channel and P-channel MOSFETs in CMOS technology). The advmtages of the trench isolation me: i) no bird's beak encroachment, ii) latch-up fiee structure, and iii) planar sorfacc.

Fig 2.13 illustrates the steps of the trench isolation process. First, the pad oxide, the nitride and the thick oxide layers are patterned using the mask of the active areas. The thick oxide series ar s mask in the trench processing [Fig. 2.13(.)]. A deep trench is formed by dry etching (RLE). This is fallowed by B boron implsnt to ueate the P+ channel-stoppers at the bottom of the trench. The top thick oxide is removed, and the trench sidewds are oxidived [Fig. 2.13(b)]. The polysilicon is deposited over the whole wafer, filling the trenches. The polysilicon is used as the trench dielectric because it uniformly fills the trenches better than other dielectrics. The surface polysilicon is then etched to yield the stroetore shown in Fig. 2.13(c). The wafer is oxidized using the nitride as a mask. The nitride is finally removed as illustrated in Fig. 2.13(d). At this stage, conventional processing can be used to integrate the CMOS devices.

Although trench isolation permits reduction of the separation between the active regions; it has several drawbacks: i) it is a costly process because of the large number of processing steps, and fi) it can not be used BE an isoletian region for the inactive parts of the chip. In this ease, LOCOS is usnally used. T h e description of other trench isollrtion processes c m be found in [28].

2.3.2 Bipolar Device Isolation Techniques

The first tsehnique used for bipolar isolation was based on collector/substiate junction isolation [Fig. 2.141. The N-wells ( N collectors) ofthe adjacent transistors were separated by Pt isles, which are deeply diffused to reach the P-type substrate. By tying these ides and the robstrate to the most negative voltage, thejunctions between them and the N-type collectors are revuse biased. Thus,

32 CHAPTER 2

. Grow oxidelnitrideloxide . Pattern al ive region

. RIE trench . Implant boron . Remove hick oxide OXidizB m e h walls

Complement wcll Porl-orocersinP "

CII

. Oxidize Remove nitride


B E C

I P-Subairare

Figure 1.16 isolation.

Cross-sectional view of an NPN bipolar tranaialor with LOCOS

all the components in different N-wells (N collectors) me isolated. The area conmmed by the isolation isles is large relative to the tramsirtor area.

The pa&s density of the bipolar technology tan be improved by r e p k g the junction isolation with LOCOS kolation. An additional advantage of LOCOS isolation is the reduction of the parasitic collector-substrate capacitance. Fig 2.15 illustrates the cross-sectional view of an NPN bipolar tranktor with LO- COS isolation. The ares oecnpied by the oxide isolation is proportional to the

34 CAAPTER 2

epitaxial layer thickness. As the epitaxial thickness is being reduced for higher device performance the oxide isolation area becomes smaller, which means that LOCOS may become a practical isolation technique for advanced bipol-1 and BiCMOS technologies.

Fig. 2.16 illwtrates thc proecsr steps for oxide isolation in a bipolar pmcesl. After epitaxy growth, a thin layer of Si02 is grown and B layer of SiJNI is deposited. A photoresist layer is applied and patterned with M isolation mark [Fig. 2.16(a)]. Then the nitride/pad oxide layers and approximately half of the epitaxial layer are dry etched. Boron implant is performed to form the ehannel-stopper [Fig. 2.16(b)]. The photoresist is then removed and the wafer i s oxidized to grow the thick isolation oxide. This oxide is called recessed ozide. The SisN* and the pad oxide are stripped at this stage. The resulting strocture is almost planar. In this structure the bird’s beak is formed BE in the MOS ewe [Fig. 2.16(c)].

In the early 198O’s, new isolation techniques such as grooves and trenches [29, 30, 311 were demonstrated. These techniques reduced the collector-substrate capacitance and increased the packing density. Hence they improve circuit speeds The fabrication process is the same BS the one described in CMOS trench isolation.

2.4 CMOS AND BIPOLAR PROCESSES CONVERGENCE

An interesting exchange of process technology know-how between the CMOS and the bipolar domains has taken place over the years. We have seen that epitaxial and buried layers hsvc been used for CMOS to mute the latch-up. At the same time LOCOS, which WBS originally developed for CMOS, has been used for isolsting bipolar transistors. The use of polysilicon for creating self- aligned MOS transistors was later adapted for self-digned poly emitter bipolar transistors. Another uample of the convergence between bipolar and CMOS is the use of oxide spacers in CMOS for formation of LDD regions, while, it has been osed in bipolar to reduce the reparation between the base contact and the emitter. The convergence of both technologies made the attractive ides of merging bipolar and CMOS seem more rational and feasible than ever.

Many of the steps of the advanced CMOS and bipolat procesrer ate similar, hence, they can be shared for the fabrication of MOS and bipolar trsosistors


Photoresist Oxide I \ Nilode . NtBL PruceES

Cmw epi-layer (N type1 Grow pad oxide Dep06if nihidelresisl

Epi-layer Palteem resisl

. Slnp r w k l . Croiu sclecdvcoxidc Remove nilndeloride (CI -+ c

36 CHAPTER 2

when they are integrated in a BiCMOS process. Some examples of there steps are:

1. The N-well, which can be used bl the body of the PMOS transistor and ar the N-collector of the NPN transistor;

2. The N+ buried layer of the NPN can be used to form B retrograde well for the PMOS to reduce the latch-up susceptibility;

3. The polysilicon can be used for the CMOS gatos and for the emitter contacts;

the s e l f - w e d extrinsic base of the NPN transistor; 4. The rhdow P-type implantation c a n he shared by the PMOS S/D and

5 . The shallow N-type implantation can be shared by the NMOS S/D and the emitter of the NPN transistor; and

6. The final annealing s t e p match.

However, as more steps me being shared by the different devices, the device charactedstics have to be compromised. There is L tradeoff between the process complexity and device quality.

2.5 BICMOS TECHNOLOGY

Although the idea ofmerging bipolar and CMOS on the same chip originsted 20 years ago [32], it was not feasible from a practical point of view becsuse of the lack of adequate process technology. With the technological progresr achieved in r-t ycarr, this idea has been revived. There are many techniques t o merge bipolar and CMOS devices as reported in the literature [33, 34, 35, 36, 37, 381. There me two ways of classifying BiCMOS processes. One way ih to classify them according to the baseline process. A CMOS-based BiCMOS process is a CMOS bareline process, to which a bipolar transistor is added. Similarly, a bipolar-bared BiCMOS process is a bipolar bascline process, to which CMOS transistors are added. In both eases, the added device would have to be compro- mired, which means that its characteristics can not be optimired. Alternatively, BiCMOS processes can be classified according to their co.t/performance. In this regard, three categories can be identified:


1. Low-cost;

2. Medium-performance; and

3. High-performance (high-speed).

In this section, we present three examples of BiCMOS processes. The first one represents B low-cost proeers. It needs only one mask to incorporate the bipolar device in B CMOS-based process. The second example shows a medium- perfamanee BiCMOS process, which requires 3 extra masks to a CMOS process. The third example illnstrbter a high-performsnce process in which polyd- icon emitter and self-aligned structures are used.

2.5.1

In a low-cost BiCMOS proeerr, a bipolar transistor is added to B CMOS process with minimum additional process steps. A typical N-we!J CMOS/bipolar process sequence is listed in Fig. 2.17(a). The N-well of the PMOS is nsed for the collector of the vertical NPN. The base is implanted in a separate step using an additional mask. The P+ S J D and the extrinsic base shme the same implantation step. The emitter and the Nt S/D ofthe NMOS are also implanted in the same step. Fig. 2.17(b) illustrates the cross-section of an N-well BiC- MOS strmtuie. The process complexity is comparable to that of the CMOS. Howeuer, there me many trade offs in designing the emitter, base, and collector of the NPN. If the CMOS proccss is optimbed, some of the bipolar device parameters, suuh as the breakdown voltage and the gain, may be satisfactory, but many others are degraded. For example, due to the absence of the buied layer and the deep Nt collector in the NPN, the collector resistance is high. Hence, the cut-off frequency is low, the current drive is poor, and the collector-emitter saturation voltage is high.

Example 1: Low-Cost BiCMOS Process

25.2

Fig 2.18 shows B cross-sectional view of B BiCMOS stmeture, which can be realized by adding an N P N to a baseline twin-tub CMOS process. This structure has an N + buried layer and a deep Nt collector sink which enhance the collector conductivity. The N + buried layer, under the PMOS, with tho nniform N-well form a desired retrograde N-well. Similarly, the Pt buried layer creates a retrograde P-well far the NMOS transistor. It also acts 81 an isolation

Example 2: Medium-Performance BiCMOS Process

38 CHAPTER 2

t

CMOS (Bme) Bipolar (Addition)

P-SubsUale

LOCOS isolation N-well __I Collector ]

NMOS channel implanration PMOS channel implantation Gate oxide

SiDN+implantation S l D P + implanmtion

Contact opening MeMiZa~CIn

Polysilicon gate

Pentrinsic base I ~~ Base P implantation

(a)

WN NMOS PMOS


40 CHAPTER 2

region between the N t buried layerr. A thin epitaxial layer (1 pm - 2 pm) is used to increase the cutoff frequency of the NPN transistor and to reduce the required width of the isolation islea between the bipolar transistors. The N collector is formed at the same t ime with N-well of the PMOS transistor. After the formation of LOCOS a deep N+ sinh is implanted and driven in. The Pf extrinsic base is impknted at the ssme time with Pf S/D regions of the PMOS transistor. The Nt emitter and the N+ S/D share the same implantation step. In this process an aluminum emitter contact is used. Therefore. the 3i.e of the emitter is larger compared to the case where a self-aligned polysilieon emitter contact iv used. This process uses only 3 extra masks to form the bipolar transistor. The first mask is needed for N t buried layer. The second mask is used to implant the N+ deep collector, and the third one for the base implantation.

The BiCMOS process described above can be optimized to be used far high performance circuits. The collector resistance is low in comparison to the low- cost proecsr (exsmple 1). For a 0.8 pm process, the cut-off frequency (ft) of a bipolar can be as high m 5 081.

2.5.3

A high-performance BiCMOS process can be achieved b7 replaeiog the N t S/D implant, used to form the emitter in example (21, by a doped polysilicon emitter. One mtra mask is required to open the emitter window of the bipolar transistor. The ion implantation of &he poly emitter and MOS gates is developed simultaneously. As shown in Fig. 2.19, four additional mask levels (N' buried layer, Nt deep collector, P-base, and emitter window) me required to ohtnin an advanced BiCMOS.

After the farmstion of the N f / P + buried layers, the conventional twin-tub process is carried out. LOCOS is developed to isolate the devices. The deep collector N t is implanted and driven in, and the P-base iS then patterned and implanted. The threshold voltages of the MOS transistors are adjusted hy additional ion implantations. After the gate oxide growth, a thin polysilicon is deposited as shown in Fig. 2.20(a). The emitter window is then pettermed and a second polysilicon layer is deposited [Fig. Z.ZO(b)]. The polysilicon is then doped by implantation and patterned to define the CMOS gates and polyrilieon emitter [Fig. Z.ZO(c)]. Next, implants are selectively carried out to form the LDD regions for CMOS. Before implanting the Nt/P+ S/D regions. a sidewall

Example 3: High-Performance BiCMOS Process


42 CHAPTER 2

Polysiticon / NPY

P-base N-well

Thick piysilicon (450 nm)

Poly-Erniller \

0 Apply photarcsisf . rauem emi,,er

0 Etch polytoxidc . s,ripresin

Deposit LPCVD poly

(250 "rn) 2nd pan

of spiit poiy

lmplilni AsiQ - Apply pho~oicsist . Pattern poly

. strip reSiEl

- Dry etch poly

. A n n 4


oxide is formed nelu the emitter and gate edges. Fig. 2.19(b) shows the find crosrsection of this BiCMOS process.

The BJTs realiaed in the presented high-performance BiCMOS process have low collector resistance (because of the buried layer and deep sink), high current gain (becsuse of the poly emitter contact) and low parasitic capacitances (because of the self-alignment). With this BiCMOS process ft's greater than 5 GHz can be achieved.

BiCMOS technology k a relatively high cost and complexity, because it requires a total of 15 masks for snbmicron process. S e ~ e r d solutions have been proposed to redwe the number of process steps to lower process complexity and cost. Recently one idea [40] has resulted in low-cost 0.35 fim BiCMOS technology which needs only 11 masks by &g W-plog trench collector sink. This technology is suitable for 3.3 V power supply voltage and promising for low-power mixed-signal applications.

Recently BiCMOS technologies with high N P N f*'s transistor, from 10-to-30 GHz., have been reported [38, 40, 411. The applications of these technologies are, for example, for low-voltage (3 V and sub3 V) and high-speed logic circuits. Another application of BiCMOS is mixed andog/digitd ICs .an& from teleeommnnication circuits and high-speed networks to wireless systems. Among these npplicstions, BiCMOS can be used for low-power high-frequency portable systems. Bipolar devices can be used for high-frequency and high- speed parts with low-power innovative circuits, and CMOS can be used for low-speed ultra-low-power parts.

2.6 COMPLEMENTARY BICMOS TECHNOLOGY

In a Complementary BiCMOS (CBiCMOS) process both vertical NPN and PNP transistors me merged with CMOS on the same chip. Recent investiga- tions indicate that CBiCMOS allows for improving the performance of BiCMOS gates at low supply voltages [42, 43, 441. Moreover far wireless applications, where high-speed m d Im-power charactelistics are iequired, CBiCMOS technology is one of the solution. The added PNP device to conventional BiCMOS can be oscd to efficiently design lowvoltage circuits. Further discnssion on CBiCMOS circuits can be found in Section 5.3.2. Although, to date, the NPN has shown superior performance to that of PNP, future trend indicates that PNP performance k approaching that of NPN. Same of the problems wsoci-

44 CHAPTER 2

ated with the PNP transistor are its high collector resistance, low current gain, and high b s e transit time.

It has been recently reported that CBiCMOS processes can offer NPNs with fe'g of 8-20 GHz and PNPr with 2-7 GHa A [45, 46, 41, 48, 49, 501. Fig. 2.21 shows a cross-sectional view and process flow of a CBiCMOS [46]. The N+ buried layet of the NPN transistor creates a retrograde well for the PMOS transistor. The Pi buried layer is only used for isolation isles between NPN transistors. After the epitaxial layer growth, twin-well and LOCOS processes are performed. The P-well of the NMOS device is used 86 the collector of PNP tr-tor. A second high energy (600 keV) boron ion implantation is carried out to form the retrograde well (2nd P-well) for the NMOS and the P+ buried 1ny.r for PNP device. The S/D implants of MOS transistors are used simultaneonsly for the extrinsic baser of the NPN and the PNP transistors. The emitters of the NPN and the PNP are formed by the self-aligned contact doping technique to simplify the process flow. Finally, the metal is deposited and patterned.

Complementary BiCMOS offerr a technology with versatile devices. It adds flexibility for mixed bipolar/MOS circuit design. The CBiCMOS technology promises further improvements to BiCMOS circuits performance.

2.7 BICMOS DESIGN RULES

In this section, B set oflambda-based derign rules of a typical BiCMOS processs (for 0.8 pm, X = 0.4 pm) is presented. The corresponding device parameters are presented in Chapter 3.

the minimum length of the MOS gate is 2X and the minimum length and width of the bipolar emitter contact is 2X and 4A respectively. Table 2.2 describes the ba3ic marks used in the layont design of BiCMOS devices. The rest of the masks are generated automatically.

Table 2.3 h t r the de3igp rules for the (design) masks only of a typical BiCMOS technology in terms of the parameter A. The corresponding graphical representation of design rules is illustrated in Plate 1. Plate I1 shows the layouts of minimum size PMOS, NMOS and bipolar transistors in * 0.8 pm BiCMOS technology.

6Thcgiucn designrules arctypiodof~gmcricO.8 wm high-pdarmanccBiCMOS pco'osera.


P~rvbrUalc

N + I P + b w i d layer

N - t p spifBxill layer

Nn'iwinweIl(lnP-wcllfor PNP)

Field ihlulion

Callmior deep N'

DccpPt Ill for NMOS retrograde well uod

2nd P-well for PNP ( P+ bwicd layer)

Gate (CMOS)

NMOS SD(Ntsrs ins i c brrc forPNP)

PMOS SID ( P Cwindc bsrc for NPN)

NPN Base PNP Bare Caniacl haler

Ntwd P'eniLL~r implant

Mctslizaalion

P+

I I

Figure 1.11 MOS [48].

(e) Fabrication pmcom flow: (b) C r o ~ c ~ o c t i m s l view of CBiC

46

Teble 1.1 Basis BiCMOS Design Masks.

CHAPTER 2

N-well (NW)

Nt deep collector (CN)

P bare (CP)

Polyrilicon (PO)

Emitter window (EW)

N i md Pt (DN and DP)

Contact (CO)

Metal 1 (Ml)

Via (VIA)

Metal 2 (M2)

The NW mark is used to define the N substrate (bulk) of the PMOS and the N- collector of the NPN transistor.

The CN mark defines the area which is exposed for the N + sink implantation.

The CP maJk defines the ~e9;cm vhich is to receive an P-implant to create the basc dmlrion.

The PO mark defines the gate and the emitter electrodes, and the polysilicon interconnect layer.

The EW mask definer the opening for the emitter window.

The DN (DP) mask d e h a the N+ (Pi) somzce and drain regime of the N-eh-d (?-channel) device within the P-well (N- well), and the body contact regions in the N-wen (P-well) respectively.

The CO mark defines the contact openings.

The M1 mark defines the metal 1 interconnects.

The VIA mask d&ms the openings of the via that connects metal 1 to metal 2.

The M2 mask definer the metal 2 interconneets.

Lou- Voltage Process Technology 47

1. N-weU(NW) 1.1 minimum width 1.2 minimum spacing

12A 12A

2. N+ -diffusion (DN) 2.1 minim- width 3A 2.2 minimum spacing 3A 2.3 minimum NW overlap ofDN OX 2.4 minimum NW to external DN spacing 6A

3. P+ -diffusion (UP) 3.1 minimum width 3A 3.2 minimum spacing 3A 3.3 minimum NW overlap of DP 4A 3.4 minimum NW to external UP spacing 4A 3.5 minimum space to DN (same potentid) CIA 3.6 minimum space to DN (different potentid) 3A

4. N-collector plug (CN) 4.1 minimum width 4.2 minimum spacing 4.3 minimum space to NW 4.4 minimum NW overlap of CN 4.5 minimum space to DN 4.6 minimum space to DP

5. P-base diffusion (CP) 5.1 minimum width 5.2 minimum spacing 5.3 minimum NW olerlbp of CP 5.4 minimum space to CN 5.5 minimum space to DN 5.6 minimum space to DP

4A 12A 1OA 3A 6A 5A

4A 4A 3A 5A 3A 3A

48 CHAPTER 2

6. Polyrilieon (PO) 6.1 minimum width 6.2 m-um spming 6.3 minimum space to DP or DN 6.4 gate overhang of DP 01 DN 6.5 minimW0 space to CN or CP

7. Emitter window (EW) 7.1 minimum width 7.2 minimum length 7.3 minimum spacing 7.4 minimum CP overlap of EW 7.5 minimum poly overlap of EW

8. contact (CO) 8.1 minimum size (single) 8.2 minimum rise (double) 8.3 minimum spacing 8.4 minimum DN or DP overlap of CO

8.6 minimum PO overlap of CO 8.7 minimum CN or CP overlap of CO 8.8 minimum PO to CO spacing in P b s e 8.9 minimum poly emitter CO to CP spacing

8.5 minim"rn space to gate

9. Metal 1 (MI) 9.1 minimum width 9.2 minimom spacing 9.3 minimum M I overlap of CO 9.4 maximum current density

2A 3A 2A 2A 1A

2A 4A 3A 2A 2A

1A 1 A 2A 2A

2A 3 A 1A

1 mA/pm

Low-Voltage Process Technology

Table 2.8 (continued)

10. Metal 2 (Ma) 10.1 minimum width 10.2 minimum spacing 10.3 maim- current density

11. Via(VIA) 11.1 minimnm size 11.2 minimum spacing 11.3 minimum MI or M2 owrlap of VIA 11.4 minimum VIA to CO spacing 11.5 minimum PO to VL4 spacing 11.6 minimum PO overlap of VIA

49

50 CHAPTER 2

Plate I: Design Rules of Table 2.5.


NMOS

PMOS

BIT

Plate II: Layouts of minimum size PMOS, NMOS and bipolar transistors.

52 CHAPTER 2

Si

2.8 SILICON ON INSULATOR

Silicon On lnsuletor (SOI) has recently received renewed interest for low- voltage and low-power applications. This is due to the reduction of the cost and improvement of its performance a t lower voltage. The emegenee of thio- film SO1 CMOS processes have demonstrated excellent charactubtier for deep submicron ULSI applications.

Many techniqnes existent to grow silicon on insolator [HI. The most mature technique ir the epitaxial growth of Silicon On Sapphire (SOS). Many LSI/VLSI circuits have been fabricated using SOS technology. SO1 can dso be produced by oring what is called SIMOX (Separation by IMplrtnted Oxygen) [52] technology. It is fabricated simply by the formation of buried oxide (SiOl) by implantation of oxygen underneath the surfsce of the silicon as illustrated in Fig. 2.22. Dose and energy of oxygen ions are as high as 2 x 10'8m-2 and 200 KeV respectively. A subaqaent thermal annealing at high temperature is performed to improve the qoality of the silicon overlayer. The buried oxide can be several hundreds of nm thick and the thin silicon layer can have several tens of nm thickness. Compared to SOS, SO1 SIMOX materials have better defect density and thin silicon layer control. The dislocation density can be lower than lO'~rn-~. One important phenomenon which u i r t s in CMOS SO1 devices is the kink effect. It consists of B "kink" which appears in the output characteristics of an SO1 MOSFET, as illustrated in Fig. 2.23. It is due mainly to the floating sobstrate of an NMOS device. An explanation of this phenomena c a n be found in [51].


Drain

Kink effect

Drain Voltage

Figure 2.m Kmk effect m tbc ouipvi chsrarterrslis of M SO1 MOS dcurce

The SO1 SIMOX is now m a t u n materid and represents a potential technology for low-power applications. Several LSIfVLSl circuits have been fabricated in SOI/SIMOX, particdarly for low-power application. Such circuits inelude PLL (Phare Locked Loop) for wireless terminals applications [64], and 1.2- GHe frequency divider under 1-V power mpply [55]. The SO1 technology was applied &so to design a RUy pipelined 512-Kb SRAM [53]. This SRAM worked successfdly do- to O.? V with an access time less than 5 nr.

Pig. 2.24shows B thin film SOI/SIMOX CMOS process cross-section. The process starts by the formation of buried oxide in silicon wafer ar explained above in [Fig. 2.24(a)]. Then, an oxide is grown on the surface silicon and 8 nitride hyer is deposited. Silicon nitride is used as n mark to protect the active region from oxidation. The nitrideloxide layers are patterned and a LOCOS isolation is applied [Fig. 2.24(b)]. At the end, the nitridejoxide layers are removed. This is followed by P I/I to Bdjut the threshold voltage ofthe N-channel transistor. Skilady, the threshold voltage of the P-channel transistor is edjdjnsted by I/I. A thin gate oxide is then gmvn and a layer of polyrilicon is deposited and doped with phosphorus. Then the Pt souice and drain regions of the PMOS are patterned and implanted with boron [Fig. 2.24(c)]. Similarly, the N+ S/D r@onr of the NMOS are patterned and implanted with phosphorus. A thick oxide is then deposited BS an isolation layer between the polysilicon and the subsequent metd layer. The oxide is etched at contact locations. N u t . the

54 CHAPTER 2

Srdp niMde and Midc

P-ChVTpimpianr

N-Ch Vm paitcm - N-Ch V m implant

Gmw gale oxide

Dcparir polyrilicon

and pattern

Figure 1 3 4 M- P ~ F C S S it- of CMOS lhin 61m SOI/SIMOX druicer.

metal lays (aluminum) is deposited over the whole surface. Finally, the metal is etched and annealed.

This simple process description shows that the SO1 process is much simpler than bulk CMOS. Forbdance, the wells are no longer needed, and the punchthrough implants aue also unnecessa~y if thin-film SO1 is used. Fig. 2.25 shows B

,. ... . . . .. ..

56 CEAYTER 2

Due to the dielectric isolation, the MOS devices have several advantages over bulk CMOS such as : absence of latch-up, high packing density and lower pma- sitic capacitances. SO1 reduces the circuit capacitance by 30% [57]. It has been discovered that if the silicon (containing the devices) is made sufficiently thin (< IOUnm), the MOSFET’s devices are f d y depletcd [51! even when Vos = 0. W y depleted thin film SO1 MOS dwiccs offer attractive characteristics for CMOS applications such ar immunity from short channel effect, absence of kink effect, superior aobthreshold leakage and high drdn 8atursAition current (due to low channel doping) [58, 59, 601.

Unfortunately, the technology hsr minor disadvantages such sr floating body effects which rault in i) floating body induced threshold voltage lowering and ii) low drain-tusauce breakdown voltage. For 1 V power supply this is not a problem. However for 3 V operation this could be an important limitation. Also, the threshold voltage is very sensitive to the thickness uniformity of the superficial silicon. In addition. the low thermal conductivity of the oxide underneath the thin film silicon layer is II severe problsrn when the SO1 circuit is operating at high-frequency. Therefore technological improvements are still needed to mlve there Limitations.

2.9 CHAPTER SUMMARY

In this chapter, we hme studied the proeerr technologies of CMOS and bipolar devices. W e have shown that the advanced CMOS and bipolar processes me converging, and many process techniques can be shsred for the fabdestion of both devices. The different options for merging bipolar and CMOS devices are then discussed. Three examples for BiCMOS processes with different eomplcx- itier a e presented The eomplemcntary BiCMOS process is ako considered. A table of design rules for a state-of-thcart BiCMOS technology is given for layout exercises. Several advanced technologies such as CMOS SOI/SIMOX and CMOS-SJET are reviewed for lm-voltage operation.

REFERENCES

[l] A F.M. Wanlans, and C.T. Sah, “Nanowatt Logic using Filed-Effect MOS Triodes,” International Solid-state Circuits Conference Tech. Dig., pp.32- 33, 1963.

[Z] L.C. Parrillo, R.S. Payne, R.E. Davis, G.W. Ratlinger, and R.L. Field. “Twin-Tub CMOS: A Technology for VLSl Chcuits,” International Eke- tron Devices Meeting Tech. Dig., pp. 752-755, December 1980.

[3] Y. Tam et al., “High-Performance 0.1 pm CMOS Devices with 1.5 V Power Supply,” International Electron Devices Meeting Tech. Dig., pp. 127-130, December 1993.

141 K. F. Lee et al., “Room Temperatare 0.1 pm CMOS Technology with 11.8 ps Gate Delay”, International Eleetmn Devices Meeting Tech. Dig., pp. 131-134, December 1993.

[5] K. TaLeuchi et al., “0.15 pm CMOS with High Rdiability and Perfor- mance”, International Electron Devices Meeting Tech. Dig., pp. 883-886, December 1993.

[6] T. Yamaeaki, K. Goto, T. Fukano, Y. Nara, T. Sn@, and T. Ito, “21 pr Switching 0.1 pm-CMOS at Room Temperature using High Pedormance Co Salicide Pmcess,” International Electron Devices Meeting Tech. Dig., pp. 906-908, December 1993.

[7] A. Oyamatsu, K. Kinugawa, and M. Kalrumu, “Design Methodology of Deep Submicron CMOS Dwices for 1 V Operation,’ Symposium on VLSI Technology Tech. Dig., pp. 89-90, 1993.

[8] B. Yoshimma, F. Mdatsooka, and M. Kalrmu, “New CMOS Shallow Junc- tion Well FET Structure (CMOS-SJET) for Low Power-Snpply Voltage,” International Electron Devices Meeting Tech. Dig., pp. 909-912, December 1992.

[9] T. Uehino, T. Shiba, T. Kikuehi, Y. Tamaki, A. Watansbe, Y. Kiyota, and M. Honda, “15-pr ECL/74-GAz ft Bipolar Technology,” Intecnational Electron Devices Meeting Tech. Dig., pp. 67-70, December 1993.


[lo] T.B. Ning, and D.D. Tang, "Bipolar Trends," Proe. IEEE, vol. 74, no. 12,

[Ill T. Nabamnra, T. Miyslaki, S. Takahashi, T. Kure, T. Ohabe, end M. Nagata, "Self-Aligned Bipolar Transistor with Polysilicon Sidewall Base Electrode far High Packing Density and High Speed," IEEE Journal of Solid-state Circnits, vol. 17, no. 2. pp. 226-230, April 1982.

1121 T.H. Ning, and R. D. Isaac, "Effect of Emitter Contsct on Current Gain of Silicon Bipolar Devices," IEEE Electron Device Letters, ED-27, pp. 2051-2055, November 1980.

pp. 1669-1671, December 1986.

[I31 A.K. Kspoor and D.J. Rodston, "Pdysiliilicon Emitter Bipolar 'IkansiS- tors," IEEE Press Book, 1989.

[14] M.I. Elmbsry, *Digital S i p o h Integrated Circnita," John Wiley & Sans,

\IS] B. h a + , Y. Ota and R. G. Swart., =Design Techniques for Low-Voltage High-speed Digital Bipolar Circuits," IEEE J . Solid-state Circuits, vol. 29. no. 3, pp. 332-339, March 1994.

[16] W. Wilhelm and P. Weger, "Low-Power Bipolar Logic," Inteznational Solid

[I71 E. Kooi, J.G. Van Lierop, and J.A. App&, "Formation of Silicon Nitride at II Si-SiOz Interface during Local Oxidation of Silicon and During Heat Treatment of Olddbed Silicon in NE, Gas," J . Electrochem. Soc., vol. 123, p. 1117, 1976.

[I81 R.D. Rung, H. Momore, and Y. Nagakubo, 'Deep-Trench Isolated CMOS Devices," International Electron Devices Meeting Tech. Dig., pp. 6-9, D h eember 1982.

1191 T. Yamaguchi, S. Morimoto, G. K-wamoto, H.K. Park, and G.C. Eiden, "High-speed Latch-up Free 0.5 pm-Chamel CMOS using Self-Aligned Ti- Si and DeepTrench Isolation Technologies," International Electron De- vices Meeting Tech. Dig., pp. 522-525, December 1983.

[20] R.D. Rnng, "Trench Isolation Prospects for Application in CMOS VLSI," International Electron Devices Meeting Tech. Dig., pp. 574-577. December 1984.

[21] A. Mikashiba, T. Homma, and K. Hamano, "A New Trench Isolation Technology as a Replacement for LOCOS," International Electron Devices Meeting Tech. Dig., pp. 578-581. December 1984.

New York, 1983.

State Circuits Conf. Tech. Dig., pp. 94-95, February 1994.

REFERENCES 59

[22] P. Singer, "Selective Epitaxial Growth Finds New Applications," Semicon- dnctor International, p. 15, January 1988.

[23] R.A. Chapman, et al., "An 0.8 mzm CMOS Technology for Eigh- Performance Logic Applications," International Electron Devices Meeting Tech. Dig., pp. 362-365, December 1981.

[24] K.Y. Chiu, R. Fsng, J. Lin, and J.L. Moll, "The SWAMI- A Defect Free and Near-Zero Bird's Beak Local Oxidation Technology for VLSI," Symp. on VLSI Technology Tech. Dig., pp. 28-29, 1982.

[ZS] K.Y. Chin, J.L. Moll, and J. Manoliu, "A Bird's Beah free Local Oxida- tion Technology Fearible for VLSI Circuits Fabrication," IEEE Trans. on Electron Devices, vol. ED-29, pp. 536-540, 1982.

[26] 3. Aui, P. Vande Voorde and J . Moll, "Scaling Limitations of Suhmi- won Local Oxidation Technology," International Electron Device Meeting Tech. Dig., pp. 392-395, December 1985.

[27] H.B. Pogge, "Trench Isolation Technology,' Bipolar Circaits and Technol- ogy Meeting Tech. Dig., pp. 18-25, September 1990.

[28] Y. Nits", ~~~~~~~-up Ree CMOS Structnre using Shallow lkench Isola- tion," International Electron Devices Meeting Tech. Dig., pp. 509-512, December 1985.

[29] H. Yamamoto, 0. Mieuno, T. Kubota, M. Nakamae, A. Shiraki, and Y. Ikurhima, "High-Speed Performance ofa Bwic ECL Gate with 1.25 Micron Design Rule," Symp. on VLSI Technology Tech. Dig., pp. 38-39, 1981.

[30] Y. Tamaki, T. Shiba, N. Honma, S. Miauo, and A. Hayas&, "New U- Groove Isolation Technology for High-speed Bipolar Memory," Symp. VLSI Technology Tech. Dig., pp. 2425, 1983.

[31] D.D. Tang, P.M. Solomon, T.H. Ning, R.D. Isaac, and R.E. Burger, "1.25 mwn DcepGmove-Isolated Self-Aligned Bipolar Circuits," IEEE Journal of Solid-State Circuits, vol. SC-11, pp. 925-931, 1982.

[32] H.C. Lin, J.C. Ro, R.R. Iyer, and K. Kwong, "CMOS-B$pIar Transistor Structure," IEEE Trans. Electron Devices, "01. ED-26, no. 11, pp. 945-951, November 1969.

[33] T. Ikeda, A. Watanabe, Y. Nishio, I. Mwuda, N. Tamba, M. Okada, and K. Ogiue, "High-Speed BiCMOS Technology with a Buried Twin Well Structure," IEEE Trans. on Electron Devices, vol. ED-34, no. 6, pp. 1304 1309, June 1987.


1341 H. Momose, K.M. Cham, C.I. Drowley, H.R. Grinold., and R.S. Fu, "0.5 Micron BiCMOS Technology," International Electron Devices Meeting Tech. Dig., pp. 838-840, December 1987.

(35) A.R. A l w e a , 3. Teplik, D.W. S c h d m , T. Hnlsemh, H.B. l i n g , M. Dy- dyk.snd I. &him, "Second Generation BiCMOS Gate Array Technology," Bipolsr Circnits and Technology Meeting Tech. Dig., pp. 113-117, 1987.

1361 B. Bastani, C. L a g , L. Wong, J . Small, R. Lahri, L. Bouknight, T. Bow- man, J. Mao~liu, and T. Tunt-od, "Advanced l Mimm BiCMOS Tcch- 0010gy for High Speed 256k SRAM'r," Symp. on VLSI Technology Tech.

[37] T. Y-guchi and T.H. Yuanriha, 'Process Integration and Device Per- formance of B Submicron BiCMOS with 1GGHB f< Doable Poly-Bipolar Devices," IEEE Trans. on Electron Devices, "01. 36, no. 5, pp. 890-896, May 1989.

Di., pp. 41-42, 198~.

[38] C. K. Lau, C-H Lin and D.L. Packwood, "Sub-micron BiCMOS Procer. Design for Manufaoturing," Bipolar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 76-83, 1992.

[39] C. H. Wang and J. Van Der Velden, '"A SinglcPoly BiCMOS Technology with a 30 GHa Bipolar A," Bipolar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 234237, October 1994.

[40] 8. Yoshida, H. Suziki, Y. Kinoshita, K. Imai, T. Ahnoto, K. Toksshiki, and T. Yamaaaki, "Process Integration Technology for Low Process Com- plexity BiCMOS using Trench Collector Sink," Bipolar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 230-233, October 1994.

[41] J. M. Sung et al., "BESTP- A High Performance Super-Aligned 3V/5V BiCMOS Technology, with Extremely Low Paraaitics for Low-Power Mixed-Signal Applications," IEEE Custom Integrated Circuits Conf. Tech. Dig., pp. 15-18, May 1994.

[42] H.J. Shin, "Performance Comparison of Driver Configorations and M- Swing Techniques for BiCMOS Logic Circuits," IEEE Jorunal of Solid- State Circuits. "01. 25, no.3, pp. 863-865, Jone 1990.

[43] S.H.K. Embabi, A. BeUaouar, M.I. Elmarry, andR.A.Hadaway, "New Full- Vdtag&wing BiCMOS Buffers," IEEE Journal of Solid-state Circuits, vol. SC-26, pp. 150-153, February 1991

REFERENCES 61

[44] M. Hiraki, K. Yam, M. Mioami, K. Sato, N. Matsumki, A. Watanabe, T. Nirhida, K. Sasa!&, and X. Seb, "A 1.5-V Full-Swing BiCMOS Logic Circuit," IEEE Journal of Solid-State Circaits, vol. 27, no. 11, pp. 1568- 1574, November 1992.

[45] Y. Kobayashi, C. Yamaguchi, Y. Amemiya, and T. Sakai, '"High Petfor- mmce LSI Process Technology: SST CBiCMOS," International Electron Devices Meeting Tech. Dig., pp. 760-763, December 1988.

[46] K. Higashitmi, H. Honda, K. Ueda, M. Hatanalra, and S. Nagao, "A Novel CBi-CMOS Technology by D I P Process," S p p . on VLSI Technology Tech. Dig., pp. 17-78, 1990.

[47] T. Maeda, K. Ishimaru, and H. Momose, "Lower Submicron FCBiMOS (Fully Complementary BiMOS) Proeerr with RTP and MeV Implanted 5GHs Vertical PNP Transistor," Syrnp. on VLSI Technology Tech. Dig., pp.19-80, 1990.

[48] W.R. Burger, C. Lage, B. Landau, M. DeLong, and J. Small, "An Ad- vanced 0.8 Micron Complementary BiCMOS Technolorn for Ultra-High Speed Circuit Performance," Bipolar Circuits and Technology Meeting Tech. Dig., pp. 78-81, December 1990.

[4Q] S.W. Sun, et al., "A Fully Complementary BiCMOS Technology for Sub- Half-Micrometer Microprocessor Applications," IEEE Trans. Electron De- v i e r , "01. 39, no. 12. pp. 2733-2139, December 1992.

[SO] T. Ikeda, T. Naksrhima, S. Kubo, A. Jonba, and M. Yamawaki, "A High Performance CBiCMOS with Novel Self-Aligned Vertical PNP," Bprt lar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 238-240, October 1994.

[51] J . P. Colinge, "SO1 Technology: Materials to VLSI," Kluwer Academic Publishers, 1991.

[52] K. Izumi, M. Doken, and H. Ariyoshi, "CMOS Device Fabricated on Buried SiOz layers Formed by Oxygen Implanted into Silicon," Electron. Lett., vol. 14, pp. 593-594, 1978.

[53] G.G. Shahidi, T.H. Ning. R.H. Dennard and B. Dawri, "SO1 for Low- Voltage and High-speed CMOS," International Conf. SSDM, Japan. pp. 265-267, 1994.

I541 Y. Kado, T. Ohm, M. Harada, K. Deguchi, and T. Tsuehiya, *Enhaneed Performance of Multi-GHz PLL LSls uabg Su&l/4mkon Gate Ultrathin


Film CMOS/SlMOX Technology with Synchrotron X-ray Lithography”, IEDM Tech. Digest, pp. 243-246, December 1993.

(551 M. Fujishima, K. A d a , Y. Omura and K. Irumi, “Low-Pow,, 1/2 Re- quency Dividers ~ & g 0.1-pmCMOS Circuits Built with Ultrathin SIMOX Substrate,” IEEE Journal of Solid-state Circuits, ml. 28, no. 4, pp. 510- 512, April 1993.

1561 T. Ohno, Y. Kado. M. Hsrada, and T. Truchiya, “A High-Performance Ultra-Thin Quarter-Micron CMOS/SIMOX Technology,” IEEE Sympo- sium on VLSI Technology Tech. Dig., pp. 25-26, 1993.

1571 Y. Yamaguchi, A. Ishibarhi, M. Shimiau. T. NiPhimura, K. Tsu);amoto. K. Aoric, and Y. Akasaka, “A High-speed 0.6-pm 16K CMOS Gate Array on 8 Thin SIMOX Film,” IEEE Trans. Electron Devices, vol. 40, no. 1, pp. 179-186, January 1993.

158) J. P. Colinge. “Subthreshold Slope of Thin Film SO1 MOSFET’s,” IEEE Trans. Electron Device Letters, pp.274-276, September 1988.

1591 J . C. Sturm, K. Tokunaga, and J. P. Colinge, “Inereared Drain Satura- tion Current in Ultrnthin SO1 MOS Transistors,” IEEE Electron Device Letters, vol. 9. no. 9, pp. 460-?, September 1988.

1601 Y. Omura, S. Nakashima, K. Pumi, and T. Ishii, ‘‘O.l-pm Gate Ultrathin Film CMOS Devices using SIMOX Substrate with SO-nm Thick Buried Oxide Layer,’ IEDM Tech. Dig., pp. 675-678. December 1991.

3 LOW-VOLTAGE DEVICE

MODELING

The objective of this chapter is two-fold. It is intended to review the basics of the MOS transistor, which is a prerequisite for Chapters 4. to 7., and to introduce commonly used models of both MOS and bipolsr devices [Sections 3.1, 3.2, and 3.61. In this chapter we consid- simple analytical models which can be used for circuit analysis and deign of deeprubmicrometer MOSFET's at low-voltage. Also, a simple model to compnte the leakage current of MOS- FET's is presented [Section 3.31. The more sophisticated SPICE device models are also presented to dw the reader to appreciate the meaning of the model parameters as well as the capabilities and limitations of there models The SPICE parameters for the 0.8 pm CMOS/BiCMOS prows presented in Chap ter 2 are included in this chapter for readers who are interested in designing and simulating low-uoltage CMOS circuits as well as BiCMOS circoita. In See- tion 3.4, supply wltage scaling due to reliability and power dissipation issues is presented.

3.1 MOSFET STRUCTURE AND OPERATION

Fig. 3.1' shows crosssections and views of an N-channel MOS transitor. By applying a positive voltage on the gate Vos, e. depletion layer is imdduced in the channel. Fnrther increase in VoS results in a surface inversion layer. The

channel width and length nrperliudy

64 CHAPTER 3

surface charge of the semiconductor (Qs cod/cm2) is equal in magnitude to the charge of the gate electrode (QG eoul/ema). Thus, we have

4s = - Po = ~ (Vos - VPB ~ d.)C, (3.1)

(3.2)

where Vos is the gate-source voltage and d, is the semicondnctor surface PO- tential. C, is the gate oxide capacitance per unit area and is given by

<o c., = - t.,

where eo is the oxide permittivity and t, in the gate oxide thickness. The flatbaod voltage VFB is given by

Qo is the total of dl charges in the oxide and near the interface oxide/silicon. This charge is positive. The work function difference between the gate electrode and the semiconductor d,, depends on the type of the electrode and the doping concentration of the semiconductor, For an aluminum electrode, we have

dm, = -0.61 + dt (3.4)

4". = ~ 0.55 + $f (3.5)

For N' polysilicon electrode, we have

The fcrmi potential $1 in Equations (4.4) and (4.5) is given by

(3.6)

(3.7)

N . 4fP = -&In(-)

Nd

for P - t y p e si l i

$f,, = +Kin(-) for N-type Si

where K = KT/q . The charge Qs is the s u m of the charge in the depletion layer QB and the inversion layer QI. Therefore;

ni

(3.8) QB +&I vos = vrs + b, - ___

The bulk depletion charge (per unit are*) consists ofioniied acceptors (P-type substrek) or donois (N-type substrate). The depletion charge of B P-type bulk, with zero biss b&-s-aouree voltage (VBB = 0), is given by

QBD = -9NaWn (3.9)

Low-Voltage Device Modeling

NMOS enhancemen1 NMOS dcplclion PMOS enhancement mode mode mode

(bl

Figure 9.1 (a) The layout and ~ m s a - s c ~ t i o n n l r i ~ n of m NMOS tzanrislor; (b) Symbola of different types of MOS tronnirtorr.

66 CHAPTER 3

where the q is the electron charge and N . is the donor concentration. The width of the depletion layer in the bulk (WD) is given by

(3.10)

The tnm-on (or threshold) voltage of an NMOS transistor is defined as the gate-source voltage at which the surface potential 4. is equal to 21dt[. This condition also defines what is known as the strong inversion'. At the onset of strong inversion we can assumc that Qs ii: Q B . Using Equation ( 3 4 , we can write the following expression of the threshold voltage

(3.11)

QBO is eqnal to -qN.W,,, where WD, = W D ( ~ . = 21dj1)3. Thus, the threshold voltage can be rewritten as

8 8 0 VTO = VPB t 4, - -

Go,

If the bulk-source is reverse biased (IVBBI > O), the threshold voltage becomes

WJ"(lv5al + zl4fl) (3,13) VT = VPB t 21$fl + c., This equation can be rewritten

VT = K"0 t 7 ( t / i i G m c l - &i) (3.14)

where the body effect coefficient 7 is given by

(3.15)

Low- Voltage Device Modeling 67

This valoe is negative and is not suitable for digital circuits where a positive VTIl is ieqmked fox switching. To get a reasonable VTo, the device rnrface is implanted with boron. The implanted dose DI came$ VTo to increase by the amount qDi/C,. The threshold voltage is hence given by

VTo = VFB + W,I t 7 f i + ,?$ (3.16)

Consider now the previous example, with DI = 1.725 x 10'2cm-' and 7 = 0.238 V1i2 we find that VT is equal to 0.7 V when lVss 1 = 0 V and is equal to 0.98 V when IVaai = 3.3 V .

The symbols of the NMOS and PMOS transistors are shown in Fig. 3. l (c) . Typical values of the VT are -2.5 V to -4 V far depletion-mode NMOS devices. For low-voltage CMOS they a m 0.3 V to 0.8 V for enhancement-mode NMOS devices, -0.3 V to -0.8 V for enhancement-mode PMOS devices.

When VGs < VTO, the transistor is in the cuiqffwgion, since no inversion layer exists, 85 r b w n in Fig. 3.2(a). The drain current is, therefore, approximately zero. When VGs > Vm, the channel is formed and a drain current flows from the dm.b to the source [Fig. 3.2(b)]. The transistor is in the linear region (&o called ohmic wgion) when VOD ( i . ~ VGE - VDS ) 2 VT. When Vcr > VT a d VDs > Vos - VT (ix. Vco < VT) the channel is pinched off as illustrated in Fig. 3.2(c) and the device enters the solurntion region. The drain-source voltage which causes the channel to pinchoff at the drain edge is commonly known as the saturation draksource voltage VDS. .~ and is equal to Vcs ~ VT.

The voltage drop between the pinchoff point and the wmce is VDS,.~. Any VoS higher thm V D S , . ~ wi l l appear between the pinchoff point and the drain. If we assume that the distance between the piacbaff point and the drain is extremely small compared with the overall length. then for VDS > VDS, .~ the drain current is constant. The carriers which reach the pinchoff paint are swept across to the drain by the potential (VDS - Vns..,) between the drain and the end of the channel.

68 CHAPTER 3

LowVoltage Device Modeling 69

3.2 SPICE MODELS OF TBE MOS TRANSISTOR

3.2.1 The Simple MOS DC Model

Let us now ana1y.e the simple DC model describing the I-V characteristics of an MOS transistor.

From Pip. 3.3 it C L L ~ be shown that the element dz har a resistance

(3.17)

We assume that the mobility (p) of the electrons in the channel of an NMOS device is constant. A cnrrent IDS crossing the incrementd resistance dR causes a voltage drop of

dV = IosdR (3.10)

Sobstitutlng from Eqoation (3.11) in Eqnation (3.10) and integrating from the sonrce to the dinin, we obtain

70 CHAPTER 3

To solve thL integration, we need to express the electron inversion charge density QI(=) in term of V . From Equation (3.8), we have

(3.20) 1 QBO Vos - V ~ B + - ~ C, C..

The surface potential 4, at any point z dong the channel is equal to ZlQfI + V ( z ) . By substituting for VFB - Qso/C, + 2l$fl by [Equation (3.11)] in Equation (3.20) we get

Q r ( a ) = 4 V c e - VTO - V ( x ) l G (3.21)

The surface potential at the drain is larger than that at the Y ) ~ C C by VDs. Therefore, the magnitnde of QI decreares with the distance across the channel. This is why the inversion layer is triangular a illustrated in Fig. 3.3. Assuming that QBO is constant across the channel and substituting for Qi from Equation (3.21) into Eqnation (3.19), we obtain

where kp is B process-dependent parameter defined as kp = pCs=. Equation (3.24) is valid only for VDS 5 VDS, .~ (ohmic region). When VDS exceeds V D S . . ~ the drain-source current saturates. The saturation current can be found by substituting for VDS by VDS, ,~ in Equation (3.24) and is hence given by

The characteristics ofan MOS transistor based on Equations (3.24) and (3.25) are s h o w in Fig. 3.4. The cnrrent eqnations (3.24) and (3.26) have to be modified if the bulk-source voltage is greater than eero by replacing by VT [see Eqnation (3.14)]. Note that when VDS is small (say 60 mV), Equation (3.24) can be a p p r o h a t e d by

Low-Voltage Device Modeling 71

72 CHAPTER 3

This equation expresses B linear relatiomhip between IDS and Vos. Using linear extrapolation, VTO and k p p can he determined 8s shown in Fig. 3.4(h).

The measured I-V characteristics show that the drain cnnent, in the saturation region, iS a weak function ofVDs. This is due to the channel length modulation phenomenon which can be explained s follows. Let us define

-9 ,

LLll = L.fl - AL (3.27)

where AL is width of the depletion layer between the pinchoff point and the drain as shown in Fig. 3.5. The voltage wrom this depletion layer is VDS - V D ~ , ~ ~ , therefore AL can be written as

The corrected saturation current becomes

If we assume that AL << 1, then we cam rewrite the current as &Ill

The ratio c a n be related to VDS by the following empirical relation

(3.31)

Thc channel modulation factor X is very small. A typical value of X is 0.01 V-?

The drain current model described, so far, is known as the LEVEL I (MOSI) model in SPICE'. Thi. model is also d e d the Shiehman-Hodgea model. How- eveq this model b still very simple' to accomt for state-of-thtart CMOS devices and might lead to B 100% error in the current particularly for low- voltage deepsubmicrometer CMOS devices. However, kp ( or p) can be used as D fitting parameter to reduce this error. This model in most suitable for preliminary analysis.

_ - AL - XVDS L m

4SPICE1GB or 381 oz 3C1. 'Tbis model 1- used in the 70's.


3.2.2 Semi-Empirical Short-Channel Model (LEVEL 3)

The MOS3 model (or MOS LEVEL 3) has been developed for short- and narrow- channel MOS ( L <_ Zpm, W 5 ZFm) [I]. The MOS3 model har the following features (compaed to MOSI):

* A model for mobility degradation with the vertical abd the horizontal electric fields;

A model for the threshold voltage of short- and narrow- channel devices (the (Drain Induced Barrier Lowering (DIBL) effect is accounted foz);

An improved model for the channel length modulation phenomenon;

Weak im&m conduction (subthreshold conduction).

rn

m

The threshold voltage expression is given by [I]

VT = VFB t 214~1 - UVDS t .rfs"sJ2l4rl+ IVBBI + FN(ZI+FI+ IVBBI) (3.32)

7 in thir expression is 9;wn by Eqoation (3.15). This expression includes:

74 CHAPTER 3

. The static Ceedback effect codficient (r (Due to DIBL effect) [2]

(3.33)

where ’1 is an empirical coefficient;

The correction factor for short-channel &eft is based on a modified trape- aoidal approach for calculating the charge QB [Fig. 3.61. The correction factor can be obtained from [3]

m

where W,, the depletion layer width of a cylindricsl junction and is given by

(3.35) W D W D ’ We = 0.0831353+ 0.8013929- - 0.0111077(-) 2, 2,

The correction factor for narrow-&-el MOS is given by m

3.2.2.1 Mobility degradation:

The mobility degradation due to the vertical electric field is modeled by the following simple equation [4]

where B is an empirical constant which depends on the oxide thikness. A typical value of 0 is 0.05. To account for the effect of lateral average electnc field, the effective mobility is related to the drhin-source voltage and the channel length by I41

(3.38)

In this expression, when the device operates in the saturation, Vos is replaced by VosSct.

Lou-Voltage Device Modeling 75

3.2.2.2 Chunnel length modulation

When VDS 2 VDS,.,, the channel length is modulated by an amount AL. This channel length redoctian is formulated in MOS3 by Baum'r model [5]. In this model the voltage ~ C I O Q S the depletion surface oflength A,? is modeled by I;(VOS - VDS.,,). x i s a fitting parameter.

3.2.2.3 Drnin current

In the LEVEL 1 model of SPICE, the drain current in the weak inversion region was assumed eero. The modeling of the subthreshold current in LEVEL 3 is based on the analysis by Swanson and Meindl [6]. The drain culrmt in weak inversion, which is b i d y L diffusion current, is given by

IDS = ~,el(var-v..)/nv,l (3.39)

where

end v, = v, + nvl (3.40)

(3.41) 0"s + Ca c, n = 1 +

76 CHAPTER 3

where dQs (3.42)

and Nps is a curve fitting parameter. V, marks the point between the weak and strong inversion modes. Typical d u e s of n range &om 1.0 to 2.5. I , is related to the c u r e d of Equation (3.39) by taking Vos = V,.

Fig. 3.7 illustrates the transfer characteristics of the weak inversion and drift model. The voltage V, insures the continuity of the current, but it is dear from the figure that at Vo3 = V, a discontinuity exists in the derivative. Therefore, the MOS3 model is not precise in simulating the intermediate region where the diffusion and drift currents are comparable.

In the strong inversion, the drsj, cuprent can be expressed as

= dVsa

The threshold voltage along the channel is given by

VT(Z) = VT t 7Fs(\lI24~1 t IVBSI t V ( z ) - d m ) + FNV(=) (3.44)

Using Taylor series expansion, W L have

VT(5) = VT + ( 1 + F B ) V ( Z ) (3.45)


By sobstituting for VT GornEquatian (3.45) in Eqoation (3.43), andintegrating we obtain the following expression for the drain current

1 + Fg I D S = Pcf/cozwcjfLc/f [vC3 - VT - 7 V D . I V D S (3.47)

The saturation voltage, which taker into aecomt the carrier velocity saturation effect, is gi~a. by

V D S , d = v,,, + v. - fi (3.48)

(3.49) where

Knc = (Vcs - &)/(I + F s ) v. = v,,.L,ffIP. (3.50)

a b l e 3.1 shows the CMOS device and ASPICE panmeters correspondence. Typical values for parameters of LEVEL 3 are shown in Table 3.2 for MOS devices of the 0.8 pm BiCMOS proces described in Chapter 2.

The LEVEL 3 model approximates the device physics and relies on the proper choice of the empirical pammeters to accurately reproduce the device characteristics.

3.2.3 BSIM Model (LEVEL 4)

BSIM (Berkeley Short-Channel IGFET Model) is a simple and accurate short channel MOS transistor model I?]. It is implemented in SPICE as LEVEL 4. The model was tested for effective channel length down to 1 pm. This model inelodes:

Carrier velocity saturation;

rn Drain-induced barrier lowering effect; . Vertical field dependence of carder mobility;

Non-uniform doping in the channel surface and sub-surface regions effect;

CHAPTER 3

TBble P.1

Pnramaer SPICE Description

CMOS dcvicc parsmetu and HSPICE ccrrsrpondmec

Keyword

LEVEL VTO TOX NSUB NFS UO VMAX ETA KAPPA THETA DELTA XJ CJ JS JSW MJ PB CJSW MJSW CGDO CGSO CGBO RD RS I D WD XL xw ACM LDlF

Model level Zero-bias thrcshold voltage Gate oxide thickness Substrate doping Surface fast state density Surface mobility Madmvm drift velocity of carderr Static feedback on threshold voltage Saturation field factor Mobility degradation factor Width effect on threshold voltage Junction depth Zero-bias balk junction cspacitanee Buk junction saturation current Sidewall balk junction saturation uurent Balk junction grading coefficient Junction potential Zero-bias side wall capacitance Sidewall cspacitsnee grading c o d Gate-drain overlap capacitance Gate-rource overlap capacitance Gate-bulk overlap capacitance Drain ohmic resistance Source ohmic resistance Lateral diffosion from drain or source Laterd dXusion dong the width Making and etching effects on W M d m g and etching effects on L Area calculation method Lateral diffusion beyond the gate

Low- V07tage Device Modeling 79

Table 3.2 MOS p.accs8)

ESPICE MOSFET =odd p m t t - (LEVELs1) (0 8 p m BxC-

SPICE Keyword N.Channel PChannel Units

LEVEL VTO TOX NSUB NFS uo VMAX ETA KAPPA THETA DELTA XJ CJ JS JSW MJ PB CJSW MJSW CGDO CGSO CGBO RD RS LD WD XL xw ACM LDIF

3 0.8

17.5 Y 10-9 3.23 x 10" 820 Y 10s

503 150 x lo8 45 Y lo-* 6.7 10-3

63.4 x lo-' fl 728

275 x lorQ 250 x lo-' 5 10-4

5.5 x 10-0 n.m . .. 0.92

205 x lo-'' 0.30

274 x 274 x 10-12 571 x 10-l'

596 596

0. 0. 0. 2

940 x 10Wo

59.5 x 10-9

3 -0.9

17.5 x 10-9 3.37 Y 10'6 764 Y 10'

165 190 x 108

121 x 10-8 1.45

135 x 10-3 0.336

230 x 450 x lo-'

5 x 10-4 5.5 Y 10-8

0.50 0.92

212 x 10-'1 0.30

215 x lo-" 215 Y lo-'> 571 x lo-''

1189 1189

0. 0. 0. 0. 2

1 x 10-8 m

80 CHAPTER 3

rn

rn Channel-length moddtion;

Depletion charge sharing by the drain and source;

Dependence of some electrical parameters on drain and substrate biases;

Better modeling of weak-, medium-, and strong- inverzion regions and elimination of the discontinuity problem in the drain-current; and

Geometric dependencies;

3.2.3.1 Threshold voltage:

The threshold voltage is given bj

VT = VFB + 4, + K I M ~ Kd9. t IVBBI) - ?VDS (3.51)

The two parameters, K , and K,, model the effect of non-uniform doping of the substrate on the threshold voltage. Typical values for KI and K2 are 1 V'lz and 0.12 iespectively. The factor q mod& the DIBL effect and accounts for the cbsnnel-length modulation effect. It is a function of VDS and VBB.

3.2.3.2 Drain current.

When V h 5 V D ~ , . ~ we have

* '=f) ((Vos - V*)VD, - -V& " ) 1 t UO(V0S - VT) (1 + $$V,,)

PO 2 I D S =

(3.52) where

(3.53)

g = 1 - (3.54)

XI a = 1 + 9 F ( Q . t IVBgl)-"'

and I

1.744 + 0 .836(h + ~ V B B ~ ) The parameters Uo = U&), U, = UI(VB) and po = p o ( v ~ s , V ~ ) are bias sensitive. For VDS > VDS. .~ , the drain current is given by

Low- Voltage flbevice Modeling 81

where I+..+J1+2.. K' =

2

and

The drain-source saturation voltage is given by

(3.56)

(3.67)

(3.58)

3.2.3.3 Suhhreshold curreni:

In BSIM, the total drain current is modeled as the Linear sum of a rtrong- inversion component and a weak-inverion component I,. I , is expressed BI

(3.59)

and (3.61)

The factor d.8 is empirkd to achieve the best fit. The Subthreshold parameter n is a function of Vpbs and VB.

3.2.3.4

BSIM user the following formula to aeeoont for the sensitivity of each parameter to the width and length of the channel

Sensirivity Factors of Model Parumerers:

(3.62)

where Po is an arbitrary parameter, LPo and WPo ate the Land W sensitivity factor. of Po.

82 CHAPTER 3

Another deep-submicrometer MOSFET's model called BSlM3 181 has been de- velopcd for circuit simulrdion. It uses an. improved threshold voltage, drain current snd chaanel-lenpth modulation mod&. The model is also simple and has a s d number of parameters (x 25).

3.2.4 MOS Capacitances

In transient simulation, MOS capacitances are very important for CMOS and BiCMOS circuits an&& The MOS capacitances can be divided into two types of lumped capacitors:

the depletion capacitors of the bu&drain and bulk-source pn junctions ( C m and C B S ) [Fig. 3.81.

the capacitors associated with the gate ( C a , COD, COB. Ccsm, C G D ~ and COB,) [see Fig. 3.8, except for COB-].

m

3.2.4. I Juncrion Depletion Cupucirurzces

The bull-source and the bullr-drain junctions have a bottom area As and AD respectively and B sidewall with a perimeter P, and PD respectively. Each of the bottom area and the sidewall contributes to the total depletion cap-tance. The bottom area capacitance is mesured per unit area, while the sidewall capacitance is measured per unit perimeter. Both of t h e e components are voltage dependent. As these junctioos a x normally zcyerse biased, we will consider the case when the bulk-soures and bulk-drain voltages ( V h and V B D ) m e less than 01 equal to 0.5#j (6 is the junction built-in potential).

The total bull-source and hulk-drain capacitances can be expressed by the following reletions [l]

The exponential factor. Mj and Mi.- are in the order of 0.3-0.5. C, is the zero-bias capacitance of the bottom jmction p a unit area and C;,- is the eel-bias capacitance per unit perimeter.


3.2.4.2 Gate Capacirances

The gate capacitances can be divided into taro categories:

The fid overlap capoeiioneea: gatedrain (CGD-), gatesource (Ccs-) , and gate-hmk (CDBm) ovellap capacitances. Both Ccs.. and Coom exist due to the lateral diffusion of the source and drain under the gate. They are usually given per unit width as Coso and Cooo. The total gate-source and gate-drain overlap capacitance is given by:

rn

cosm = CcsoWe:r, (3.65)

coo, = COD0 W.ff (3.66) where Cam and Cooo are eqod to C,L+ The capadtor COB, is due to the overlap of the gate a i d e and the bulk along the channel length at both ends of the active of the transistor. This capacitance is typically normalined to the effective channel length, the total COB^ is hence given

Coaw = C O B 0 L*ff (3.67) by

a4 CHAPTER 3

where Ccao is equal to C,,Wd

The nonlinear capacitance due to the c A q e of the bulk OP tAe channel. This capacitance is actually distributed but CM be modeled by lumped eap&tances. In the CEX when the channel does note& the capscitance CM be expressed as

CGB = cmwc,,Lc,f (3.68) When the device in in the linear resion the channel is extending uniformly Gom the m n x e to the drain. The channel shields the b d k and the CB-

paeitance exists only between the gate and the channel. The gate-buk capacitance goes to %em. The gate-channel capacitance can be oxpressed in terms of two equd lumped capacitances, B gate-source and a gatedrain capacitance, which am denoted Cos and CGD and are given by

.

(3.69)

Finally, when the device enters saturation, the channel at the drain pinches off and hence the gate-drain capacitance component becomes i e m while the pste-source capacitance esa be expressed by

1 C O S = COD = FcozweffL'ff

(3.10) 2 3

Fig. 3.9 depicts the change of the capacitance components as a fnnctbn of the gatc-source voltage (assuming that the sourcebulk voltage is zem). The total gate-ronrce capacitance is given by the snmmation of the Cosm and Ccs, and s i d m l y , the total gatedrain capacitance is given by the summation of C C D ~ and COD.

The above described capacitance model can be used for circuit analysis and eLeuit design. SPICE me8 B chargecontrol model, which IS- developed by Ward and Dutton [$I. This modelis bared on the mtod distribution of charge in the MOS stiuctue and its conservation.

Ccr = -C,W.,fL.ff

3.3 CMOS LOW-VOLTAGE ANALYTICAL MODEL

The MOS mod& discussed previously have been developed far circuit rimu- lators. These models (e.g. BSIM) involvc large numbers of parameters whose value. mud be derived from device measurements. With the% models it is difficult to develop an intlutive understanding of the device behavior. Therefore,


an analytical drain current model valid for submicrometer MOSFETs operating at lowvoltage is needed for hand calculation and first order circuit analysis, with reasonable accuracy.

3.3.1 Threshold Voltage Definitions

The threshold voltage, VT, has some definitions which are important for the estimation of the static power dissipation. The first definition is the utrapo- lated threshold voltage from the characteristic IDS - V m [me Section 32.11. Another one is the constant-current (Lo., 010 nA per width unit) threshold voltage. These voltages do not have the same value [lo, 11). The extrapolated VT has approximately 0.2 V more than the constant-current one [ll]. The extrapolated threshold voltage should be sealed down proportiondy to the supply uoltage. This is becmse the drive (saturation) current depends on (VDD - VT(ertrapo1ated)).

86 CHAPTER 3

3.3.2 Subthreshold Current

When the threshold voltage is scaled for low power supply voltage operation, subthreshold current increases significantly. This current a limiting fador for battery operated circnits. As shown in Fig. 3.10, the drain current in the subthreshold &on can be modeled by

IDS,"* = w;,,I,locv..-"l/s (3.71)

where VT here ir the constant-eorrent threahold voltage. I, and W. are the drain current and the gate width to define VT. S is the subthreshold swing parameter. which is the gate d k g e swing required to redvce the drain uuient by one decade. The current I, is related to VDs by

W.

I , = I;(1 - P=/". 1 (3.72)

The subthreshold swing is given by LIZ)

S cz 2.3K (1 + 2) Vldeeode (3.73)

where Cdis the drplelion-layer capacitance of the sourcejdrain junctions. Thus, S has a theoretical minimum limit which is 60 mvldeeade.

The leakage current, due to the subthreshold eandnction, is computed from ID^..,^ when Ves = 0. Then

I l d = - w.llIo,o-vds (3.74) W.

Using the examples of Fig. 3.10, typical values for constant-current and ax- trapohted threshold voltager are 0.3 V and 0.5 V respectively. The parameter 5 is equal to 75 mVldeeade and the leakage cnrrent is e q d to 1 pAlpm-

When estimating the static power dissipation, the worst-c leakage current has to be evaluated. In this E B S ~ , the worst csre threshold d t a g e , VT,, hsr to be used where

VT,. = VT - AVT (3.75)

AVT is the vapiation of the threshold voltage due to the process parmeters fluctuation such BS the oxide thickness, doping profile, junction depth, gate and width lengths, ete. AVT can be BS high as 50 mV on the same wafer and 150 mV for different wafers. This results in almost two decades ofleakage

Low- Voltage Devzce Modeling

current increase. Also the temperature effect has to be considered when leakage current is computed. The temperature affects both VT and S. A typical value of the temperature coefficient of the threshold voltage is 1.6 mV decrease per degree Celsius. The subthreshold suing, S increases by 0.25 mV/(decade.C) [See Equation 3.731. For example, if the temperature increases &om 25 C to 75 C, the thrcshald voltage decreases by 80 mV md the leakage current equalr 30 pA/pm (initid extrapolated VT = 0.5 V). This value ib 30 timu higher than that at 25 C. Both the temperature and process effects can result in a drastic increase of the worst-case static power dissipation. Note that this variation of VT greatly affects the delay of CMOS circuits a t low supply voltage, since the drive cuirent is proportional to (VDD - VT).

3.3.3 Low-Voltage Drain Current

A part of this model is based on the one proposed by 11.31. For long-channel devices, the carrier drift velocity v is related to the horizontal electric field E by B simple linear relation (v = p E ) where the carrier mobility is constant. For short-channel devices, the mobility is no longer a constant and is a function of

88 CHAPTER 3

the vertical electric field in the inversion layer. At this point we prefer to use the symbol & for the mobility to denote its dependence on the vertical dectrie field. Also, the velocity (v) is no longer proportional to E but is gjwn by the following twwregion piecewise empirical model [14]

where (3.77) 2%., E. = -

& where the saturation velocity device) and 6.5 x 10e em/s for holes (PMOS device).

The drain current in triode region (VDS 5 VDS,,,) is given by [I31

is equal to 8 x lo8 em/s for electrons (NMOS

The saturation current can be expressed by

ZDS8.t = "sdC-Wtfl(VOS - VT V D S . d ) (3.79)

By equating (3.78) and (3.79) we can derive the following expression for V D S . . ~

VD'oS,.t = (1 - X)(VCS - VT) (3.80)

where (3.81)

The drain current in the saturation can be rewritten a8

Ios,.r = KvSatCmWe~i(Vcs - VT) (3.82)

Note that VT, m the current eqnation, is the extrapolated threshold voltage The mobility & for electrons UUL be expressed [l5]

fin = 240\/0.06tO./(Vcs +vT) fm N C p l y - g a t e (3 83)

(3.84)

and far holes

65[O.O6t,/(V~s - VT)]"~ f m P' POlY- gate ..=( 65 [0.06t,/(T'as - VT - I)]"' fop Ni p l y - gate

where to, is in k and the mobility in cma/(Vs). Thn analytical model CM he used for gate length down to deepsobmcmn range

Low- Voltage Device Modeling 8'3

3.4

Scaling speed. MOSFET scaling can follow three theories:

CMOS POWER SUPPLY VOLTAGE SCALING

device feature size has been used to increase paddng density and

1. Constant Electric Field (CE) scaling [16].

2. Constant Voltage (CV) scaliog [l?].

3. Quasi-Constant Voltage (QCV) scaling 1171

Dimensions

Gate oxide

Doping

Voltage

Capaeitace

current

Gate Delay

Dynamic Power

Dynamic Energy

Expression

In the CE scheme all horizontal and vertical dimensions and voltages scale h e d y with the $ m e faetor. In the CV reheme, the dimensions are scaled, while the voltages w e kept constant. This scenario has been the most corn- monly used. While the constant electric field scaling is natural Lom the device physics point of view, the constant voltage scaling is more piactical from the systems standpoint. Changing the supply voltage every technology generation (when the feature sizes a e scaled) is too expensive because mdtiple pow-

90 CHAPTER 3

supply generatois will be required for each PC board. However, BS the channel length scales helow sboat 0.6 p m the 5 V supply voltage must be reduced for reliability rea~ons (e.6. hot carrier effects, breakdown, ete). The quasi-constant voltage scaliog is an intermediary scheme between the CE and CV views. The @c&g factors of the hoiieontal dimensions and the volts@ are denotd by kh and !ex, rerpectively. Table 3.3 summluiees the scaling ef the important device parameters according to the three theories as a fonction of the horizontal scaling factor (kh). Note that in the QCV scheme, the dimenions scale more aggressively than the voltage (k, = kh'.').

For the drain current, the following average value is used

I D S (I W/LC,(VOS - VT)'.5 (3.85)

Thk expression is not far fiom the one propored by [El. Table 3.3 shows the erect of device sealing on the delay, power and energy. It is assnmed that a gate drives other gates, where the load is mainly the gate cspscithnce. The threshold voltage is sealed proportional to VDD rcsling. The gate delays imprave with scaling for all the scenarios, but with II better rate in the CV scheme. However. the dynamic power. at maximal frequency, of the gate increases by a factor k;' in the case of CV. For the CE scheme, the power is reduced by a high factor equal to kF6. Also in this Table, the dynamic energy dissipated by a gate is reported. This is independent of fkquency. For all schemes, it has improved significantly, particularly for the CE case.

Scaling the snpply voltage is an efficient way to reduce the power consomption. However, to get B better performance 8t low-Vdtagge the device sizes and the threshold voltage have to be properly scaled. For B fixed sub-micron technology. the supply voltage can not be reduced aggressively, otherwire the *peed is degraded. However, for each fixcd technology generation, there is a lower limit power supply voltage VDD,~, [la]. For VDD'S higher than this minimum limit the speed does not improve significantly. Typical d u e s for VDD,~, are, 3.3 V and 2.5 V for L.,j of 0.5 pm and 0.3 pm, respectively. On the other hand, the h i e r lrmit of V ~ D is driven by the reliability and the power dissipation limiitation. The d n e of this VDD is proportional to the s p a r e root of design rules (6) [IS]. For 0.6 pm and 0.3 pm design rules with LDD structure, these high limits are 4.5 V and 3.3 V, renpeetively.


3.5 MODELING OF THE BIPOLAR TRANSISTOR

3.5.1 BJT Structure and Operation

Fig. 3.11 shows a cross-sectional view of a NPN bipolar junction transistor with geometrical layout and the corresponding symbols for NPN and PNP. To understand the basic operation of the bipolar transistor, one dimensional representation ofthe active mgim can be used. Fig. 3.12(a) illustrates a typical profile of the one-dimensional section of the active region [Fig. 3.12(b)]. The N+PN- sand+& farms the heart of BJT.

Consider an NPN transistor with VBE > 0.5V and VBC < OV (forward-active mode). The corresponding energy band diagram is shown in Fig. 3.12(e). When the NtP (emitter-base) junction is forward-biased, electrons are injected from the emitter into the base (current In=). A small fraction of these electrons recombine in the neutral base (I,B)8. The rest of the electrons, of which the cmrent I,, is constituted, diffosc through the base towards the reversebiased base-collector jnnction where they are swept by the electric field into the basecollector depletion kym. On the other hand, some of the holes in the base are injected into the N+ emitter region resulting in a current I p ~ . This component is small compared to I.B because the hales' concentration in the base ia much smaller than the electron concentration in the emitter. The emitter-bare depletion layer can be B rite for the recombination between the injected electrons and holes resulting in B current I,..,. Moreover, some holes ate swept into the base dne to the generation in the basecollector depletion &on, but this component is very small ( cz 10-'7A/pm2). The terminal currents can be -€ten 11% follows

Ic = I..c (3.86)

(3.87) IB = Z a t L d + Ira

4 = I,& + I d + IPE (3.88)

Note that it has been asmmed that the base and collector currents ere flowing in the device, while the emitter coxrent is a0-g out of it [Fig. 3.121. The emitter bjection efficiency, which is defined as the ratio of the electron's current iojected into the base to the total emitter eorrent, is by

(3.89)

92 CHAPTER 3

./ N-well

This ratio be due to electrons for an NPN transistot. The ratio

has to be nem unity; thst is, the emitter current should mostly

(3.90) 1C f l = - IB

is defined - the DC curcent gain.

Lou- Vololtage Device Modeling 93

94 CHAPTER 3

When the emitter-base junction is reversebiased and the collector-base jam- tion is forward-biased, the transistor is in the inverse xpion where the emitter and collector may be exchanged. When both junctions are reverse-biased the transistor is in the cutoflregion. But when they are forward-biased, the device is said to be in the astoration repion. In this situation, both junctions sre in- jecting into the bsse, the small electric fields in the two depletion regjons sweep the carders into the emitter and collector repiom. Both junctions collect as well as emit.

3.5.2 Ebers-Moll Model

In this section, we present the EbercMoU (EM) model, which is a simple DC model of the bipolar transistor. The Ebers-Moll model can be used for hand calculations and first order circnit analysis. The derivation of the model equations, in this section, is bared on the analysis by Rodston [ZO]. Lo Section 3.5.1, we have disms~ed the device operation in the forward active region only. For a general analysis, we assume that the base-emitter and the base-collector junctions &re forward biased. In the following discussion we wil l neglect the CnrrentS due to recombination in the apace ehsrge layeis and in the base. This implies that Inc = &',hence, Equation (3.88) reduces to

IE = Lc + &E (3.91)

The current due the holes injected &om the base into the emitter is given by

(3.92)

where h~~ is the equilibrium hole concentration in the emitter and WE is the neutral emitter width. The current Inc is dominated by the diffusion current in the base and is proportional to the gradient of the minority carders (electrons) in the neutral base. Because the neutral base width (WB) is very thin, this gradient is approximately a comtant. Therefore, we c a n write 1°C as [20]

q AE D,E P ~ E O [,VD./V. - 11 1201

W E I,o =

Inc = q AE D,B [ n B ( O ) ;:gag(wB)] (3.93)

where na(0) and na(Ws) are the electron concentrations at the edges of the emitter-base and collector-base depletion regions respectively [see Fig. 3.131. Note that the slope of the clectmns in the base is given by the term between the brackets as demonstrated by Fig. 3.13. 'B? app~ying KCL (i ... I , + I~ ~ I , = 0 ) . scL t h t is the differcncc

bstuten LB and I.o. If thc recombination in the bsrc i s n&c$cd (LB = 0). we can j l s . / w e that I,., ri L o .

-

Low-Voltage Device Modeling

KllliffC BaJC CDiieclor

95

Using thejunction law, the electron concentrations nn(0) and na(Ws), can be expressed rn terms of VBE m d VBC respectively. The current I., can hence be given by [ZO]

where Ng is the base impurity eoncentration.

The collector current is given by

Ic = Inc - Ipc (3.95)

The current IPc is due to the holes injected from the base to the collector8. The baSc-eoUcetor junction is basically a P+NN+ structure as shown in Pig.

*Not= Lhat I., w- mat inclvdcd in Eqv~tion (3.88) because in drriring Equation (3.86) we harr -rumEd that the Eallsstor-b-e junction was revc-c biased.

96 CHAPTER 3

3.12(a). An expression for I,c can be derived from the analysis ofa P + N N + diode. The reader is adviced to consult with reference [20] for the details of this analysis. The carrent I,, is gi~m by

where pnco is the equilibrium hole concentration in the collector, Wc is the epitaxial thickness under the base and T ~ ? , is the hole lifetime in the epitaxial layer. By substituting Lorn Equations (3.92) and (3.94) in Equation (3.91) and from Equations (3.94) and (3.96) in Equation (3.96) we get the following equations for I p and lc

I , = I , - U,I, (3.97)

Ic = -I, + at', (3.98)

Eqnations (3.97) and (3.98) m e called the EberrMoU eqmations. Fig. 3.14 shows the equivalent circuit of the BJT bared on the Ebers-Moll equations.

The EbersMoU model described above is general and can be used for any region of operation by substituting for VB, and V.c by lhe appropdate values. In the forward ective region, assuming that VBS = 0.8 V and VBC < 0.3 V the emitter and collector current of Equations (3.97) and (3.98) reduce to

la = I, sz I,, eV-1". (3.102)

where the reverse saturation current of the bare-emitter junction In, can be derived from Equation (3.99) snd is given by

Lour-Voltage Device Modeling 97

E

ligure 3.14 model

Equivalent DC &Ni t of the EST blucd on the Eb.ra;MoU

It can edsily be shown that the base current can he expressed as

(3.105) 1 - a, I B = - F

Ql Eqnatims (3.102), (3.103) and (3.105)arethe well-known current equation. ofa fommd biased bqpolar transistor. Note that Equation (3.105) yields the famous relation between at and the DC forward current gain P

The simple Ebers-Moll model lacks accuracy for the following three reasons

P = Qf/(l - a f ) 1.

1. It does not account far the parasitic resirtors of the emitter. base and collector.

98

PC

CRAPTER 3

d E’

2. It doer not aocount for the Early effect, which causes the collector current to increase 8s the collector-emitter voltage increases.

3. It does not sccount for the effect of the high collector currents on the current gain.

Next, we will discnss the modeling of e& phenomena separately,

3.5.2. I

Fig. 3.15 shows the modification of the EM model hy the addition of the base rwistanee RB, the collector resistance Rc and the emitter resistance RE. There extrinsic components represent the transistor’s parasitic resistances from their active region to their base, collector and emitter terminals, respectively.

The effect of the perasitie resistances ir important because the voltage drop BEIOSS them contribute to the external baseemitter and collector-emitter voltages VB1=. and V , , E , respectively, = shown by the following two equations

The Purusiricul Resisrors of a Bipolar Transistor

V B ~ E , = VBE + RsIs t RBI, (3.106)

Vo,w = VCE + RcIc + REIE (3.107)


The drop across the parasitic resistors has to be acconnted for to get more accurate iesalts from the EM model. Neglecting these drops may ~ V U L lead to erroneous iesults. For example, if the external collector-emitter voltage i n fonnd to be equal to 2 V one may dednce that the BJT operates in the active Ecgion. However, if Rc = 1.8K and RB = 0 . M and Ic I , = 1 mA, then the intrinde collector-emitter voltage (Von) is 0.1 V. This implies that the bipolar transistor is actually saturated. This phenomenon is known as Quari- Satuwlion.

3.5.2.2 The Early Effecf

The E d y effect refers to the base width modalation due to the change of the collector base reverse voltage (in the forward active region). As the collector- base reverse voltage increases, the base-collector depletion layer widens. The resulting reduction in the neutral base width causer the current gain to increase which, in turn, leads to an increase in the collector current [see Fig. 3.161. This effect can be modeled by introducing the Early voltage (Va,) in the expression of the collector cnrrent a5 follows

(3.108)

The inverse of the forward Early voltage 1,'VAj is analogous to the coefficient A in an MOS transistor. A typical value of VA, is 50 V. The AC output resistance of the BJT in the forward active region is related to the Early voltage and is given by

-v.r (3.109) 70 ~ I0

The Early effect in the inverse active region can be modeled by using the reverse Early voltage (VA,) which charaderises the slope of the collector cutrent in that region (inverse active region).

3.5.2.3 High Current Effects

The current gain and the cut-off freqnency are degraded due to high collector current. Fig. 3.11 shows the effect of the collector current on the gain. This

degradation can be referred to the high level injection in the base (Webster effect) and/or the base pushout (Kirk effect). For B detailed discussion on these phenomenon, the reader is advised to consult reference [ZO]. In the w e , -here the injection level in the bare is high (Webster effect) the collector

100 CHAPTER 3

Figure 8.18 Thcl-V shmatcnsticrdrr BJT

Low- Voltage Deuzce Modelzng 101

cnrsent can be expresed as [ZJ]

Ic = ev-l=v% (3.110)

where the forward knee current Ixje is defined - the collector current at which its slope in the Gummcl plot changes from 1 to l/Z [see Fig. 3.181. This current marks the onset of high level injection. The degradation of the current gain, when Ic > k,, can be described by the following relation [203

1x1 (3.111)

where & is the value of the gain when Ic < I z f . The modeling of the Kbk effect is very complex. However, simple model for the current gain, which can be used in first oidei circuit analysis, i n given below [Zl]

P = - = & - I0 IB IC

(3.112)

The aemracy of the simple EM model can be enhanced by acconntbg for the parasitic resirtars, the Early effect and high emrent effect which mn be modeled by simple analytical expressions as shown above.

3.5.3 Bipolar Models in SPICE

Two BJT models are implemented in SPICE. The Ebers-Moll model and a more sophisticated one, which is based on the Gummel-Poon (GF) model [ZZ]. The second model indudes the following second order effects:

rn

rn Base width modulation effect.

m . Very lour eument effect on the gain.

High-level injection effects (the Kirk effect is not included)

Base resistance -tion with current.

The GP model is based on one-dimensional analysis. It is valid for all regions of operation: cutoff, forward-active, invecse-active. and saturation. The GP- bared bipolar model is illustrated by the equivalent circuit shown in Fig. 3.19.

*A trpicai value of 1x1 B C ~ u i L a c s is 1 m.4/pmn’

102

in1ii f

CHAPTER 3

The two bad-teback diodes on the right represent the intrinsic base-emitter and basccollector junctions and their curients are given by 1231

I,, = I . -(e ves/n,v. - 1)

Iso = I* - (e vec/n,v, - 1 )

(3.113) qb

(3.114) 4s

where I, is given by [23]

(3.116)

The forward and reverse current e-on coefficient (nt ond %), which ate introduced in Equations (3.113) and (3.114), are used to model thelow currents. The parameter qb (base charge factor) accounts for the high current and base

Low- Voltage Device Modehng 103

Figure 2.1s Thc GP-blrrrd model of D b i p d v t r ~ $ i s t m

width madht ion effects. It is given b7 [23]

9s = + -1 (3.116)

qr models the effects of base width modulation and can be expressed as

The general expression of qs [Equation (3.116)] can be simplifled for lodevel and high-level injection conditions.

if PI q: /4 (low - level - injection) if q, > 91214 (high- level -injection) (3.119)

104 CHAPTER 3

The two back-to-back diodes on the left [Fig. 3.191 account far the currents caused bv the recombination of carders in the emitter-base and the collector- base space-charge layers and other recombinations. These currents - be modeled by [23]

c,r,(ev-~”-v~ ~ I) (3.120) c,r,(ev**’m=vs - I) (3.121)

where C,, C,. n. and n. have been introduced to fit the measured corrents.

Further improvements to this model ate possible by the inclusion of three parasitic resistances ( R c , Rs, RB); three jnnction capacitsnces (CE, Cc, Cs); and two diffusion capacitances (C-, Cdc) = shown in Fig. 3.19.

The model of the bare resistance take. into account the effect of the corrent (current crowding) through the following expression [24]

(3.122) tan(r) - I z tan(z)l

R B ( I ) = R B ~ + ~ ( R B - R B ~ )

where the variable z ia given by

Rg represents the low-current maximum resistance and RBm high-cmrent minimum residanee.

The junction depletion capacitance is a function of the junction voltage (V). This function can be approximated by the following two expressions

v -Mi

4, Cj.irp = C;(1 - - ) if V < FC4; (3.124)

The empirieal factor FC has a value between 0 and 1. Its default valne in SPICE is 0.5. Note that Equations (3.124) and (3.125) apply for a reverse and forward biased junction respectively.

The diffusion capacitances model the charge associated with injected carriers. For example, the electrons injected in the bare have B corresponding rtorsge charge

Q~~ = r,rcc (3.126)


The forward transit time q is current-dependent and is gjven by an empirical olprcrJirm[24]

Where VTF is a fitting parameter to model the change of 7, as a function of VBC ( 01 Vcs) , ITF models the change due to Io and XTF controls the increase of q. ICO is the collector current in the absence of the high-current effects which corresponds to that dEbers-Moll model.

The diffusion capacitance (associated v i th the injected electrons from the emitter into the base, when the base-emitter junction is forward biased) is gjvm

C D E = aQDB (3.128) by

Similarly, the base-collector junction has a diffusion capacitance, which is given by

(3.129) aQDc av,, CDC = -

where QDC = SIEC (3.130)

Although the SPICE models account for most of the first and second order effects, they m e not highly accurate. This originates from some weaknesses in the theory on which the models are based. As the device festnres are scaled down the currently a d a b l e models become less accurate. The physics and the theory of the sealed devices is more complex. Hence, aseluate modeling becomes very difficdt. One way around that problem is to chose the model parameters such that simulated device chsracteriaties agree with measurements. In practice, the models' parameters are extracted automatically using parameter analyser. with software tools to obtain the best fit. As a result, the values of the extracted parameters may not correspond to their actual values. For example, it is common to find B discrepancy of 20% between the measured cnrrent gain of a bipolar transistor and that listed in the SPICE fie. h o t h e r approach, which U eqmivalent to tweaking the parameterr, is to m e empifid models (eg. BSIM model), in which the empirical (fitting) parameters c m be optimized to get the best fit between simulation and measurements.

Typical GP parameters , for the 0.8 prn BiCMOS prsented in Chapter 2., a ~ e shorn in Table 3.4 and 3.5.

106 CHAPTER 3

Table I., Bipolar dcviccpar-ekx and HSPICE sorxspondcna

Para SPICE Description meter Keyword

IS BF BR NF NR VAF VAR IKF IKR ISE ISC NE NC RE RC RE IRB RBM CJE VJE MJE CJC VJC MJC CJS VJS MJS XCJC FC

Saturation current Ideal madmum forward gain Ideal madmum reverse gain Forward current-emirision coefficient Reverse current-emirision coefficient Forward early voltage Revers early voltage Forwadknee enrrent Reverse-knee current Baseemitter leakage ssturation current Basecollector leakage saturation current Baseemitter leakage emission coefficient Basecollector leakage emission coefficient Emitter resistance Collector resistance Base resistance at zero current Base current where RB = RB(O)/Z Minimnm high-current base resistance Base-emitter ser-bias depletion cap. Base-emitter built-in potential Base-emitter junction grading factor Basecollector aero-bias depletion cap. Basecollector built-in potential Base-collector junction grading factor Collector-substrate iero-bias cap. Collector-substrate built-in potential Collector-substrate junction grading factor Internal base fraction of base-collector cap. Coefficient for forward-bias depletion cap.


I, TF X T F XTF VTF VTF ITF ITF T, TR

XTB XTI ED KF AF

Table 3.4 (contznnrd)

Forward transit time TF biar-dependant coefficient TF barecollector voltage dependence c o d . TF high current parameta Reverse transit time Forward and re~erse betel0 temperature exponent Saturation current temperature exponent

Flicket noise coefficient Flicker noise exponent

Energy gap

Table 3.5 ASPICE BJT model pa~metcrr (0.8 I" BiCMO8 p r 0 ~ ~ s ~ ]

SPICE Vdue Units Keyword

IS Z x A BF 100 BR 1 NF 1 NR 1 VA P sn V .. VAR 5 V IKF 5 n 1 0 P A IKR 0. A ISE 0. A

108 CHAPTER 3

Table 8.6 (emlmurd)

RE 30 n RC 87 n RB 650 n IRB 0 A RBM 650 62 CJE 1.51~ lo-'' F V J E 0.87 V MJE 0 265

VJC o 713 V CJC 1.15~ 10-14 F

FC 0.5 T F 12.5~ Q

XTF 916.2 VTF 1.6

T R 4 x 10W8 J

ITF a.7x 10-2

XTB 1.4 XTI 3.5 EG 1.11 ev XF 2.9x10-e - AF 2.0


3.5.4 Chapter Summary

111 thk Chapter, we haw r r r icwcd the fundamentds ofthe 110s xiid bipolnr derirrv 'l'hr ~ m w t common device rwud11 usS4 i n SI'ICE ILRYC been pn w ~ t d

'The key device P B I I U ~ ~ ~ C ~ S of w h model haw been defined and rrplaincd, so that the rradcr is familiar with the drtailr of these niodclr and can apprecislr the importance af the different model parameten The reader 19 given B Lst of model parameterr, for B typical 0 8 pm RiCXOS prnccis. that can be used for circuit simulations T h o c modrl ran be used even a1 low-voltage opcralion. hlorcoser, ia .in,plc analytical model unltd for suhmirronwrr 1lOSFET'r has berm 1 l i r c i . r 4

REFERENCES

[I] A. Vlrudimirescu, and S. Lio, "The simulation of MOS Integrated Circaits using SPICEZ," M m o . No. UCB/ERL M80/7, Univ. Cdifomia, Berkeley, October 1980.

[Z] H. Masuda, M. Nakai and M, Kubo, "Characteristics and Limitations of Scaled Down MOSFET's Due to Two Dimensional Field Effect," IEEE Trans. on Electron Devices. Vol. ED-26, pp. 980-986, 1979.

[3] R.L.M. D u g , "A Simple Current Model for Short-Channel IGFET and Its Application to Circuit Simulation," IEEE Journal of Solid-State Circuits, vol. SC-14, pp. 358-367,1979.

(41 G. Merkd, J . Bore1 and N.Z. Cupces. "An Accurate Large Signal MOS Transistor Model for Use in Computer-Aided Design," IEEE Trans. an Electron Devices, vol. ED-IS, 1972.

[5] G. Baum and 8 . Beneking, 'Drift Velocity Saturation in MOS Tranris- tors," IEEE Trans. on Electron Devices, YOI. ED-17, pp. 481-482, 1970.

[6] R.M. Swanson and J.D. Meindl, "Ion-Implanted Complementary MOS Transistors in Lou-Voltage Circuits," IEEE Journal of Solid-state Cir- cuits, vol. SC-7, pp. 146-153, 1972.

171 B.J. Sheu, D.L. Scharfetter, P.-K. KO, and M.C. Jeng, "BSIM Berke- ley Short-Channel IGFET Model for MOS Transistors," IEEE Journal of Solid-state Circuits, vol. SC-22, pp. 558-566, 1987.

[8] J. 8. Huang, Z. H. Liu, M. C. Jeng, P. K. KO, and C. Ha, "A Robust physical and Predictive Model for Deep-Snbmicmmeter MOS Circuit Sim- ulation," IEEE Custom Integrated Circuits Conf., Tech. Dig., pp. 14.2.1- 14.2.4, May 1993.

[9] D.E. Ward and R.W. Dutton, "A Chargeoriented Model for MOS Tran- sistors Capacitances," IEEE Journal of Solid-State Circuits, vol. SC-13, pp. 703-707, 1978.


[lo] Y. P. Tsividir, "Operation and Modeling of the MOS Trwsistor,' Mc Gmw-Ha, 1988.

[Ill T. Sakata et al., "Subthreshold-Current Reduction Circuits for Multi- Gigabit DRAM'S," B E E Jonmal of Solid-state Circnits, vol. 29, no. 7, pp. 761-769, July 1994.

1121 S.M. Sae, "Physics of Semiconductor Devices," John WiIey & Sons, 1981.

1131 C.G. Sodini, P.-K. KO, and J.L. Moll, "The effect of High Fields on MOS Device and Cireuit Performance," IEEE Trans. on Electron Devices, Vol. ED-31, No. 10, pp. 1386-1393, October 1984.

[14] B. HoefRinger, H. Sihbert, and G. Z h e r , "Model and Performance of Hot-Electron MOS Transistor for VLSI," IEEE Trans. on Electron Devices, Vol. ED-26, pp. 513, 1979.

[I51 C. hu, "Low-Voltitge CMOS Device Scaling," IEEE International Solid-

(161 R.H. Dennard, at al., "Design oflon Implanded MOSFETa with Very S m d Physical Dimensions," IEEE Journal of Solid-state Circuits, vol. SC-9, pp. 256-266, October 1974.

State Circuits Canf., Ted. Dig., pp. 86-87, 1994.

[I71 P.K. Chatterjjee, et al., ''The Impact of Scaling Laws on the Choice of N-Channel or P-Channel for MOS VLSI," IEEE Electron Device Letten, Vol. EDL-I, pp. 220-223, October 1980.

[la] M. K e h m u , "Process and device Techoologiea of CMOS Devices for Low- Voltage Operation," IEICE Trans. Electron., vol. E76-C, no. 5, pp. 672- 680, May 1993.

[19] M. Kdkumu, M. Kinugawa, and K. H m b o t o , "Choice of Power-Supply Voltage for Half-Micrometer and Lower Submicrometer CMOS Devices," IEEE Trans. Electron devices, vol. 37, no. 6, pp. 13341342, May 1990.

[20] D.J. Rodstan, "Bipolar Semiconductor Devices," McGraw-HiU Publishing Company, 1990.

1211 K. Naknuato, et al., 'Characteristics and Scaling Properties of n-pn Tran- sistors with a Sidewall Base Contact Structure," IEEE Trans. on Electron Devices, vol. ED-32, no 2, pp. 328-332, February 1985.

[22] H.K. Gummel and H.C. Poon, "An Integral Charge Control Model of Hipa- lirr Transistors," Bell Syst. Tech. J., vol. 49, 1970.

REFERENCES 113

[23] 1. Getreu, “Modeling the Bipolar Transistor,’ Tektranix, h e . , 1916.

[24] P. Antognetti and G. Massobrio, “Semieandnctor Device Modeling with SPICE,” McGraw a;U, 1988.

4 LOW-VOLTAGE LOW-POWER VLSI CMOS CIRCUIT DESIGN

In thir chapter we introduce the CMOS logic gate with the development of simple models for delay and power disripstion estimation. These analysis permit us to understand the mechanisms that control the performance, particularly the power dkipation, of a logic circuit. Several CMOS d m i p s tyk , such as pseudoNMOS, dynamic logic and NORA, are presented. Other k c n i t variations of the static complementary CMOS, which are suitable for low-PO- applications, are discussed. These include the passtransistor logic families such as Complemendary Pass-transistor Logic (CPL), Dud Pasctramistor Logic (DPL), and Swing Restored Pass-transistor Logic (SRPL). Also an overview of clocking strategy in VLSl systems is covered. Included in this chapter is one important %re* which is the I/O circuits. The power dissipation of the I jO circuits is also analyzed. Findy, low-power techniques for CMOS design are also reviewed at the tr-istor-level. We will cover the low-power issues a t subsystem/system/architeeture levels in Chapter 6, 7 and 8 in more detail. Several books treat in detail other CMOS circuit design aspects [I, 2, 31. The reader CM refer to them.

Many issues existing in todays advanced CMOS circuit structures are considered; such as:

Concept of switching activity;

Single-phase clocking strategy; . Clock skew issue:

Power dissipation components of a CMOS gate and their importance;

Power dissipation in 110 circuits;

116 CHAPTER 4

rn

m Ground bouncing; and

m

Clock distribution in VLSl systems;

Low-power circuit techniques and design guideher.

4.1 CMOS INVERTER DC CHARACTERISTICS

Fig. 4.1 shows the basic complementary MOS inverter. Before deriving the DC-transfer characteristics of this inverter (the output voltage YC~SUI the input voltage), lets understand the operation of this circuit.

. When the input is BIGH, which means at VDD, we have

VSSn = Krn = VDD

v,, = K" ~ VDD = 0

(4.1)

(4.2)

In this case, Vosn > VT, and lVcstl < lVrpl. The PMOS is OFF and the NMOS is ON. The NMOS transistor N provider a current path to ground. The find stable value of the outpot voltage V. is

v, = 0 (4.3)

At the steady rtete, the DC cnment from VDD to the groondis controlled by the subthreshold current of the PMOS P, since this device ia OFF and the NMOS N has B VDS equals to zero. We assume that the junctions leakage is negligible. If VT,,' is low enough (lower for example than -0.5 V), the subthreshold current is negligible (< 1 pA/prn width). If (negative) is high, the subthreshold is not negligible and can be w high as 1 p A / p m for = -0.05 V [see Section 3.321. In this case the output is not exBctly at zero and can have a value of tens of mV. In this section we a m m e that the subthreshold cmient is not importmt. Low-VT CMOS circuits .%re treated in Section 4.10.

Similarly, when Kn is low (OV) Vos. f VT, and IV,s8l > [VTJ. The PMOS transistor is ON and the NMOS transistor iS OFF. The output voltage is given by

Also we assume that the leakage current is negligible. v. = VDD (4.4)

'Exbr*pold.ed thruhold voltage.

Lorn-Voltage Lou-Power VLSI CMOS Cixuit Design 117

T

%sf+ PMOS

* Figure 1.1 A CMOS Inruter

The logic levels of the CMOS inverter are close to VDD and ground and the logic swing is equal to VDO. This is B main feature of CMOS gates.

4.1.1 ltansfer Characteristics

In this section we discuss the DC ehaiacterirtier of the CMOS inverter of Fig. 4.1. Fig. 4.2 shows the DC transfer characteristic with the different regions of operation. For simplicity we use, for the MOS devices, the simple cnrrent models presented in Section 3.2.1. The circuit operation can be divided into fiue regions:

Region (A) : 0 5 Ern < VT, The NMOS transistor is operating in the subthreshold region and the current is assumed zero. Hence the PMOS current is also em. The PMOS transistor is in the linear region. Thus, V. = VDD.

118 CHAPTER 4

Region (B) : Vrn < K. < I L Ens is defined M the input voltage at whioh the gab of the inverter is maximum and is also defined s the gate threshold voltage. In this region, the NMOS transistor ia operating in the satmation region and the PMOS is in the linear region. Since the emrent in both devices is thc same (in sbsolute value), we have

IDS? = - IDS. (4.5)

The PMOS current is given by

I D S p '-Pp [ ( ~ ~ - v D D - v T n ) ( v a - - I / D D ) - ~ / ~ ( ~ - v D O ) z ]

(4.6) Where

6, = kp% (4.7) L e f f

(4.8)

The saturation cument of the NMOS is given by

where a. = k,- W.f f (4.11)

L . f f and

VGS, = Km (4.12)

Using equations (4.5), (4.6) and (4.10), the ontput voltage is given by

v, = (K*-Vrp)+ (4.13)

VDD P- 2 P P

(%, - VTp)' - a(%% - - vTv)vDD - -(!& - vT,)a

This equation of V, versus V, is plotted in Fig. 4.2 region (B)

Region (C) : K, = V& Both the NMOS and PMOS transistors we in the saturation region. In this case, the PMOS current can be given by

(4.14) (G" - VTJ I D , = -P,

Lou- Voltage Low-Power VLSI CMOS Circuit Design 119

'DI

YO

The NMOS saturation current is given in Eqoation (4.10). By equal- iring the absolute value of the two dr- currents we have

(4.15)

i% (4.16) p = - PP

This equation is very useful from B design point of view. Note, from this equation, that the logic threshold voltage of this gate is set by the designer; since the parameters & and /a are dependent on W c f f and L . t f . Moreover, the region (C) is deked for only one point of I$,, For symmetrical NMOS and PMOS devices we have

where

VT" = VTP (4.17)

If the designer set a 'PP (4.18)

120 CHAPTER 4

This ratio is a typicd example. The designer should set the rise ratio a5

(4.20)

We obtain (4.21) VD D K, = K*" = - 2

An inverter with this V,."* is sometimes called B symmetrical gate. The cutput voltage in this ea5e h not neeereary equal to VDD/2 and is given by the following inequality

K" -vT, < v. < V,,+ v, (4.22)

In reality, V. is set by the alight dependence of I D , versus VD'OS

Region (D) : K,," < V,, < VDD + In this region the NMOS is in the linear region while the PMOS is in the saturation region. Simi la analysis used in region (B] can be applied. The output voltage is given by

V. = (K* - V&) - (L ~ VT,,)' ~ &(I$. ~ VDD ~ VT?)~ (4.23) \i Pn

Region (E) : VDD + In this region the NMOS transistor is ON, and in the linear region, and the PMOS is operating in the subthreshold region. If we arirume that this current is too small then

< '4" 5 VDD

v. = 0 (4.24)

The cnrient flowing from VDD to ground, YC~ISYS the inpnt voltage, is plotted in Fig. 4.2(b). It reaches its madmum when both the MOS transistors are in saturation. It h important to note that for V, = K,," the DC power dissipation would be maximal.

Low- Voltage Low-Power VLSI CMOS G h o d Desrgn 121

Figvre 4.3 threshold voltage of ulr CMOS inverter

ERccl of thc ratio p on the (s) DC t r d w Fh~Eter i s t i c i (b)

4.1.2 Effect of p As we discussed before. the ratio 0 controls the threshold voltage of the CMOS inverter. This panmeter is set by the ekenit designer through the transistor sizes. Other psrameters such BS the mobility and the theshold voltage of devices are set during the fabrication and the circuit designer can not change them. Fig. 4.3 illustrates the dependence of DC transfer charaeterirtier and the threshold voltage of the CMOS inverter on the ratio p . Increasing 0 decreases the voltage &,". KU has II prwticsl maximum less than VOD t VpP and practical minimum greater than I+". Practical values mean that 0 can not have zero or infinite. In general, the circuit designer tries to set 0 = 1 for symmetrical operation unless the gate is used to switch an input s-8 different than a CMOS swing (from ground to VDD).

4.1.3 Noise Margins

Noise margin LG an important parameter in logic design. It i6 defined si the allowable noise voltage on the input 10 that the output is not affected. In other

122 CHAPTER 4

(a)

words, we would define the valid logic levels such that they are restored when they propagate through a digital circuit. The logic levels can be extracted from the DC characteristic. As illustrated in Fig. 4.4 we define the levels at the input by

. Logic 0 : for 0 5 Ii, 5 VrI,

rn Logic 1 : for fix 5 5 VDD

and at the output by

. Logic 0 : for 0 5 v. 5 V0'

Logic 1 : far Vog 5 V, 5 VDD

The LOW noise margin is defined by

N M L = ]fir. - V d (4.25)

Low- Voltage Low-Power VLSI CMOS Cnrcuit Dessgn 123

and the HIGH noise margin is defrned by

NMH = IVOH - Vrxl (4.26)

The V,r. and the V m lev& can be defined ils the points where the slope of the DC transfer characteristics is -1, i.e.,

These valuer can be deduced wing equations (4.13) and (4.23). To have good noise mar&, it is desirable to have Vii. and f i x each near the other, mound the point V D D ~ ~ .

For CMOS circuits, the HIGH output Voltage level VOH, can be defined by letting VOH = VDD and Vor. = 0. The CMOS logic inverter has fairly ideal transfer €nnnnctian and it tends to have very good noise margins. In some applications, either N M x or NM,, is compromised to have good speed of operation.

4.1.4 Minimum Power Supply

To obtain the maximum power raving in CMOS logic circuits, the power supply voltage should be reduced. So, what is the lowest practical supply voltage at which CMOS dl operate? In 19'12, Swansan and Meindl 141 demonstrated that the minimum supply voltage is given by

Vnom,n = BkTln (4.28)

At room temperature this value is equal to 0.2 V. This demonstrates that CMOS ir a good candidate for ultra-low-power applications.

4.1.5 Example of Noise Margins

For an inverter with W, = 2W, = 4 pn (in 0.8 p n CMOS technology), and using a threshold voltage VT = VT,=(V~,(=0.5 V, we have the fobwins values for N M L and H M H . At 3.3 V power supply voltage, Nnai. = 1.15 V and N M x = 1.45 V. However at 1.5 V, N M L = 0.60 V and N M H = 0.65 V. So the noise level should be kept low, particularly at low power supply voltage.

124 CHAPTER 4

T vDD 1

Figure 4.5 CMOS invat.? %nd witching chaiactuistic

4.2 CMOS INVERTER SWITCHING CHARACTERISTICS

In this section, we present the transient behavior of the CMOS inverter. A very simple analytic model for delay is developed. The objective of this analysis is to understand the parameters that affect the speed of the gate. We assume that the input has a step waveform. The delay t d , is the time difference between the mid point of the input rwhg and the mid point of the wing of the output signal. Referring to Fig. 4.5,

rn

td, is the 50% delay when the output is rising; and

tq k the 50% delay when the output k faUing.

The power dissipation issue during the switching is considered in Section 4.3.

Low-Voltage Low-Power VLSI CMOS Czrcuit D e q n 125

4.2.1 Analytic Delay Models

The load capacitance shown in Fig. 4.5 at the output of the CMOS inverter represents the total of the input capacitance of driven gates, the pararitic capacitance at the output of the gate itself and the wiring cepacitance. In Section 4.4, we discuss the estimation of this load capacitance. For simplicity we ac sume for 50% delay. that the MOS current is averaged, and is e q d to the saturation current. The equation of the saturation used in this seetion is the one given by Equation (3.82) Section 3.3.3. This saturation current is well modeled for short-ch-el devices,

4.2.1.1 Fall Deluy

When the input goes from low (ground) to high (VDD), initially the output is at VDD, the pull-down NMOS of Fig. 4.5 is in the saturation region. We wusume that when the output falls to VDD~Z, the NMOS drain current is approximated by the raturstion current IDs,&. Referring to the equivalent circuit of Fig. 4.6(a), the delay i s computed from the following differential equation

where I D S , , ~ , = Kn~.atCocWe~,m(Vcsn - E n ) (4.30)

We ~ s s u m e that the factor K, does not change. By integrating Equation (4.29) from t = tL, correrponding to V, = VDD, to 2 = t l , corresponding to V. = V D ~ / Z , and substitution of (4.30) into (4.29) we obtain

Note from this equation that the delay is inversely proportional to the width of the MOS transistor. So by aising the gate we can reduce the delay of the gate alone.

4.2.1.2 Rise Delay

When the input goes from high (VDD) to low (ground), initidly the output is a t zero. The pull-up PMOS transistor operates in the saturation region. Similarly using the equivalent circuit of Fig. 4.6(h), the rise delay is given by

(4.32)

126 CHAPTER 4

1 vDD 1 At t = t , Vo=V,, At t = t 3 V o = O

At t = t Vo=- v~~ 4 2

From the *bow equation we can deduce that the dse delay is greater than the fall delay for equally sisad MOS transistors. So We,,, phould be rised such that the two saturation currents are almost equal in order to get symmetrical rise and fall dehyr.

4.2.1.3 Delay nme

By definition, the delay time (sometiw called propagation delay) is given by

(4.33) 1

fz = #d, +td.)

Hence, for VT. = - V T ~ = VT the delay is given by

Low-Voltage Low-Power VLSI CMOS Circnzt Deszgn 127

Or the equation can be written as

(4.35)

The constant is slightly diected by VDD through the parameter K. This equ* tion shows a simple analytic expression for the delay time. We can observe that the delay is linesrly proportional to the total load capaeitsnce. Secondly, the delay increases when the power supply is scaled down. When VDD ap- prosches the threshold voltage of the device, the delay incresses drssticdy. If the threshold voltage L sealed down with the supply voltage and the oxide t b i c h m is sealed down too, then the delay can improve with VDO sealing. &om the CMOS circuit designer point of view, the only parameters thst can be controlled to opt-e the speed of CMOS gates me:

. . The supply voltage VDD.

The width of the MOS transistor;

The load capacitances (input of the n u t stage, wiring, ette.); and

Fig. 4.7(a) shows the simulated effect of the power supply voltage on the delay ofan inverter with fanout = 3, using the device parameters given in Chapter 3. We buffer the input voltage with one inverter stage to obtain accurate results. The delay is almost stable at high VDO, however when VDD approaches the threshold voltage of the NMOS and PMOS devices, it increaser drastically as expected by Equation (4.35). Therefore, the threshold wltage should be reduced to overcome this problem. In Fig. 4.7(b), the delay of the inverter is plotted versus the ratio VT~VDD at VOD = 2.5 V. For VT/VDD > 0.5. the delay incresses rapidly. In order to maintain improvement in circuit performace at reduced power supply voltage, VTJVDD must be 5 0.2.

4.2.2 Delay Characterization with SPICE

A data sheet for the delay of a cell (i.e., CMOS inverter) c ~ n be e d y prepared using SPICE. For example the load capzsitace 01 the fanout of a CMOS inverter is swept during the airnulation, and the relation of the type l a = a + b.C,(or fanout) can be obtained. Fig. 4.8 shows the delay YS. the external load capacitance C,. Other parameters can be extracted also.

128 CHAPTER 4

4.5 I

Low- Voltage Low-Power VLSI CMOS Circuit Deszgn 129

0.65 I 1

0.15 ' I 1 2 3 4 5 6 7 8 9 10

4.3 POWER DISSIPATION

To minimiae the power consnmption of a CMOS circait, the various power components and their effect mast be identified. There are two types of power dissipation. One is the m-nn power dissipation which is related to the peak of the instantaneous current and the other is the averagge power dissipation. The peak current has an effect on the supply voltage noise due to the power line resistance. It can cause heating of the device, thus resulting in performanee degradation. From the battery lifetime point of view, the average power dissipation is mole important.

There are three power dissipation components within the CMOS inverter. These are:

1. Static power csused by the leakage current and other Static current 1.t due to the value of the input voltage;

2. Dynamic power caused by the total output capacitance CL; and

130 CHAPTER 4

3. Dynamic power caused by the short-circait curent I,. during the switching transient

Sometimes component (2) and (3) are merged as total dynamic power

4.3.1 Static Power

This component is split sometimes into two other components. The sourcces of static power dissipation, in a complementary CMOS inverter, are leakage currents (P,*) a d current drawn &om the supply due to the input voltage (P,%). Hence the total static power is given by

P, = Psi + P.2 (4.36)

Leakage eubent consists of MOS junction leakage currents. Fig. 4.9 shows the parasitic diodes in a CMOS inverter. The body ties in this stroeture, such as the p&itic. diodes, m e not conducting (i.e. reverse biased and/or at iero voltage). The current in B diode is given by

9vd nkT Id = I,(exp - ~ 1) (4.37)

where n is the emission coefficient of the diode (sometimes equal to 1) and Vd is the applied voltage to the diode. Note that the current parameter 1. inereares with temmnrturc. The total rrower dissipation due to these le&am currents is given by

P,l = ~ I a , V L W (4.38)

A typical value of this leakage current Id is 1 fa/ device junction. This value is too small to have any effect on the static powex, because if we have o m million deuicer, the total contdbution to the power would be - 0.01 pW. This first component of the stat ic power is neglected, in the analysis, through all the chapters of this book except Chapter 6 in the c- of memory design.

We con$der now the second component of the static power which is a function of the input voltage Kn. Assume that the input of the pull-down NMOS, of the inverter, is at B voltage 0 5 K" < V,. In this ease the torrent is given by the subthreshold expression (Fig. 4.10)

I D S = zo-I w.,, oLsgw (4.39) WO


r V ss

132 CHAPTER 4

wherc VT is the constant-current threshold voltage. For V,. > VT the current is given by expressions discussed in Chapter 3. The corresponding static power disripation is given by

P.2 = IDsm*o.VDD (4.40)

Thc mean value ofthe current is for both the PMOS and NMOS transistors. For example if V.. = 0, VT = 0.15 V, W c f j = 10 fim and S = 75 mVJdeeade, this current is 1 nA. Far 1 million devices integrated, the total static power would be impmtant (1 mA of current). Note that this current increases drasticdly with the increase of temperature [see Section 3.321. This value, in standby mode. is not permitted lor battery-operated applications. CMOS circuits have been known to consume energy only during switching. But this is not troe mow. since low-VT CMOS is used far low-voltage operation. Some CMOS circuits, which exhibit a high DC current, are discussed in Section 4.6.

4.3.2

In this section we estimate the power dissipation due to the total oiitput load capacitance CL. This power is due to the currents needed to charge and discharge CL as shown in Fig. 4.11 and 4.12. We assumc a etcp input 10 neither the PMOS and NMOS m e on rimultanmurly. The average dynamic power Pa required to charge and dischsrgc II capacitance C, at I switching frequency f = IjT (Fig. 4.12) is given by

Dynamic Power of the Output Load

(4.41) I =

The output current is given during charging phsse by

(4.42) do df

I - . ~ - Ip = C,"

and during the discharge phase by

(4.43) i - ' dv. - In = -c&- df

Then Eqoation (4.41) becomes

Finally the dynamic power dissipation is

(4.45) T

Low-Voltage Low-Power VLSI CMOS Cmud Desegn 133

T VDD T vDD

This equation shows that the power dissipation is proportiond to the operating frequency. Moreover, the ieduction of the power supply d r a s t i d y reducer the power dissipation. Ideally, 3.3 V ~npply voltage rednces the power dissipation by 56% compared to that of 5 V. Moreover, at 1 V the power is reduced by 96% compared to 5 V. The expression of dynamic power in Equation (4.45) is valid only for an inverter. However, for E. complex gate the concept ofswitching activity is introduced [see Section 4.5.31.

During the h s t output transition (charging) from 0 - VDD, the energy drawn from the power mopply is Ed = CLV;,. For tbis transition, the energy stored in the load capacitor is

This means that during lhe output transition 0 - Vo0, hdf of the energy drawn Gom the supply is stored in the capadtar and the other haUis eonramed

134

...............

CHAPTER 4

/ ...

...

~

....... L ......

y ......

....... 1 Time

...... .> Time

\

Lou- Voltage Low-Power VLSI CMOS Circuit Design 135

by the pull-up PMOS transistor. For the outpnt transition VDD - 0, the mergy [l /2 CzViD) stored in the capacitor is consumed by the pun-down NMOS transistor and no current is drawn from the supply.

4.3.2.1 Energy vs. Power

It is important to distinguish between enecgy and power. If for uample, for a CMOS gate xe reduce its dock rate (I), its power coxsmption wil l be reduced by the same proportion. Howevu, its energy d still be the same. Assume that the gste is powered with a battery to perform computations. The time reqoired to complete the computation, with low dock rate, d beincreased. Therefore, after the computation the battery Uiy be jnst as dead as if the computation had been performed at high clock rate. So law-enecgy design is moreimportant than low-power design. The factor of merit in this case can be defined as the pmdud of energy limes the delay. The canvcntional term, low-power. is used through out this book to mean that we design for low-energy.

4.3.3 Short-circuit Power Dissipation

Even if there were no load capacitance on the outpnt of the inverter and the paradtics are negligible, the gate would still dissipate switching energy. If the input changes slowly, both the NMOS and PMOS transistom are ON, an excess power is dissipated due to the. short-circnit current. Fig. 4.13 shows the rhort- circuit cments BS the inverter switches as function of the id time of the input. We are assaming that the rise time of the input is equal to the fall time.

P,c = I,..,.LVDD (4.47) To estimate I,.,, we use the simple model of the short-circuit current of Fig. 4.14 151. Also we Bssume that the inverter has symmetrical devices, which mesni that = P, = 0 and VT, = -VT- = VT. We also assume that the rise time is equal to the fall time of the input signal (7, = rt = 7). The mean short-circuit current in the unloaded inverter is

r,,., = z Y T [j: i(t)dt + j:’i(tpt] (4.48)

Due to symmetry we have

136 CHAPTER 4

350 I

-50 ' 1 0 I 2 1 4 5 (1 7 8

Time (ns)

Figure 4.18 Shari-circuit evmnt function of the input dope

The NMOS transistor is operating in satmation, hence the above equation

The input voltage is given by

(4.51) VOO X * ( t ) = -f

It can be derived &om Fig. 4.14 that

VT * I = -7 VDD and t 2 = I 2 (4.62)

Then the integral leads to

Low- Voltage Low-Pourer VLSI CMOS Circuit Design 137

Figure 4.14 hput voltage and short-cbeuit cumnt model

Thk equation shows that the short-circuit power dissipation is also proportional to the tiequeney. The only parameters that can be controlled by the circuit designer at given frequency and power supply to reduce P., are: 0 and 7. The power supply s d n g greatly affects the reduction of short-circuit power dissipation. Note that this analysis was done for an unloaded inverter. For a loaded gate, if the outpnt signal and inpnt signd have eqnd rise/fd times, the short-circuit power dissipation will be less than 20% of the total power [5]. So it is very important to keep the edges fast, to have negligible P,* 01 a t least, it is desirable to have equal input and output rise/fd times.

If the load capacitance is high, the output rirejfaU times become larger than the input ones. In this case, the inpot ehsnges completely before the output changer rignificantly. Therefore, the short-circuit current is near zero. Note that if VOD is approaching (VT,, + VTz) 01 is less, the short circuit current can he eliminated because both devices can not conduct simultaneourlv.

138 CHAPTER 4

4.3.4 Other Power Issues

The total power dissipztion of a CMOS gate is given by

Pi,t,, = P. + Pd + PSC (4.54)

It represents the total power of a gate when it is switching at the same rate aa the operating frequency. In Chaptez 8, we will discuss how to estimate the power dissipation of a complex circuit.

Other power dissipation k u e s exist, such as: worst ease power estimation and temperature effect. These conditions are : maximum VDO andjunction tcmper- atarc, and faat-faat process. Static power dissipation (subthreshold carrent) is incieaad by the increased temperature and increased power supply. Dynamic pow= is not sensitive to the temperatare bat it is affected greatly by the worst caae VDD. Short-drcuit power dissipation depends on the temperature ju t as the short-circuit current doer. It is also dependent on the power snpply. The mobility and threshold voltage deereaae with increasing temperature. Each of these two parameters has an opposite effect on the current. So it is important to eonrider the worst case power consumption evaluation in any design.

The simulated average total power dissipation can be easily measured by the SPICE simulator u&g POWER MEASUREMENT commands. However, several papers in the literature have introduced "power meter" in circvit simulation to meaauce the power dissipation [6, 7, 81,

4.4 CAPACITANCE ESTIMATION

Previously we saw that the speed and power dissipation of CMOS gat- depend strongly on the total ontput load ce.paeitance. This capacitance is the sum of three components as shown in Fig. 4.15.

1

I Wiring capacitance noted C,.

Total input capacitances of N driven gates noted C,m; Parasitic output capacitance of the drive gate noted C,; and

For simplicity we estimate, in this section, the average value of Cr. over the range of the output awing. This approach is used only for b i t i d estimation

Low- Voltage Low-Power VLSI CMOS Czreutt Deszgn 139

of the design. More circait simulation and layout extraction and port-layout shdation arc needed fm mole accuracy. Moreover, it is sometimes interesting to derive a simple expression for the load capacitance to dee the impact of important parameters on the speed and the power dissipation. We h t eramine the different components of the outpnt load capacitance: then we illustrate by e.o example the estimation approach.

4.4.1 Estimation of C,, The total eapacitanee of the driven gates can be evaluated by 5m-g the input capacitance of all the receiving gates and we have

The gate capacitance of the receiving gate can be approximated by

n

Cq*te = con C ( W L ) < (4.56)

where n is the number of tr-torr of the gate. This expression sum3 the gate capacitances of all the transistors composing the driven circuit. For a CMOS inverter it is given by

;=I

(4.57)

140

3.5 I I , ' '?

i ! i ?

i

VOllll ,? ', ,' voD=3.3 v - 3 - y:

, 2.5 -

I 2 - Vin i - i

i i

0.5 - i i ; vout2 t

-

1.5 - .

1 - 7

-

i .

*< ei . . . . . . . .. _.. . -0.5

CHAPTER 4

Low-Voltage Low-Power VLSI CMOS Czrcuit Desrqn 141

T

6

Figwe 4.16 shows an example of the equivalent gate capacitance of the receiving gate. The driven inverter has the following drawn sizes : W, = W. = 20 p m and L = 0.8 pm. This gate can be replaced by an equivalent capaeitenee Cgacc z= 50 fF, which is approximately the same as the one ealeulated from Equetion (4.57).

4.4.2 Parasitic Capacitances

Fig. 4.17 shows the main contributions to the output parasitic capacitances of a CMOS inverter. Thus, it L estimated by

c, = CdP + Cd,, + Gjp + c,, (4.58)

142 CHAPTER 4

The drain overlap capacitance for NMOS and PMOS ir given by

cg. = c,w (4.59)

C, is ddned in SPICE parameters of Chapter 3 as CCDO. The drain junction capacitance is a function of the ~everse applied voltage during the switching of the inverter. The average value of this capacitance over the range of output swing is defined by

c, = 6,aAo + c j . , P ~ (4.60)

where AD and Po are the area and the perimeter of the drain junction a shown in Fig. 4.18. The average bottom junction capacitance is

The average side-wall capedance

(4.61)

Low-Voltage Low-Power VLSI CMOS Czrcuit Design 143

\I

4.4.3 Wiring Capacitance

The Simple model of wiring capacitance is bared on the parallel-plate model [Fig. 4.191 given by

(4.63)

where H is the thickness of the insulator layer (oxide), and C,. is the capaei- tanee per erea unit. The total capacitance of the wire is

c m

H c,, = -

c, = IWC,. (4.64)

where W is the width of the wire (metal or poly). and I is the length of the wire. Table 4.1 piyes some values of the widng capacitance per area for the 0.8 pm process presented in Chapter 2. This capacitmce can not be known in the early design stage but can be known after layout extraction.

When the thickness of the insulator becomes comparable to that of the wire, T, then the fringing fields at the edge of the wire become important. The effect of the fringing fields is manifested by the increare of the effective area of the plates [Fig. 4.191. Many approximations have been proposed to compute the

144 CHAPTER 4

Metal2 to Substrate 11 Metal2 to Metall 25 Metall to Substrate 19 Metal1 to poly 28 Metall to diffusion 27 Gate poly over field oxide 58

Table 4.1

Layer Perimeter Capadtace F/pm)

Typical 0.8-sm CMOS rim f&&g csparitmr.

Metal2 to Substrate 38 Metal2 to Metall 47 Metall to Substrate 44 Metall to poly 48 Metall to diffusion 47 Gate pdy over field oxide 44

effect of fringing capacitance. One relatively accurate empirical approximation is given by [9]

C,, = ~[(~)+0.77+1.06(-)0~"+ W W 1.06(-)0.6] T (4.65) B H

where C,, is the total capacitance ofthe wire per unit length. The contribution of the fringing effect in many -es k important. "able 4.2 shows the fringing capacitance per =nit of length.

4.4.4 Example

Consider en inverter with W, = 2W. = 20 pm with 3 pm length of each drain and source. This inverter is driving B Line of metall of 100 pm length by 2 pm width a d an inverter with W, = 2W, = 20 pm operating st VDD = 3.3 V.

Low- Voltage Low-Power VLSI CMOS Ctrcuit Design 145

The total load cspacitsnce is computed using the 0.8 pm device parameters presented in Chapter 3 BI follows:

m The gate capacitance of the dzivcn inverter is

c, = [%L,+W"I;,IC, = [20 x 0.8 + 10 x 0.81 x 2 f F w 48fF . The total ovedap capacitance at the ontput is

c,, = CGD,W, + CODhiW"

Then

C,, = 20 x 215 x lo-'+ 10 x 214 x lo-' = 4.30 t 2.14 w 7 fF

rn The total drain junction capacitances can be approximated at mid- voltage of 1.65 V (1/2 of V D ~ ) instead of eompnting integrh. We have far one drain junction

The drain areas are 60 pma and 30 p d far PMOS and NMOS respectively. The drain perimeters are 46 p m and 26 pm for the PMOS and NMOS transistors respectively. The total junction capacitance can be easily calculated and is

Cj s 3 2 f F Note that this capacitance increaser with the power supply voltage reduction.

The wire capacitance is estimated by adding the two components psx- allel plate and fringing capacitances. The ares of the wire is 200 pm' while its perimeter is 204 pm. We have

m

c, = w x I x C W ( p e V m a ) + Z(W + i ) x C&r length) = =

200pm' x 19 Y lO-'fF/pm' + 204pm x 44 x 10-3fF/pm 3.8 + 9.0 c 13 f F

Note that the fringing capacitance is an important portion of the total wire capacitance.

146 CHAPTER 4

Hence the total capaeitance at the output is 100 fF. Note that the contribution of the junction capacitance is important. The contribution of each component wries *om one circuit to another and it depends on the layout style osed. Before starting any circuit layout, it L important to keep in mind an estimation of capacitances snch BQ the gate a d ontput capacitance of 1 unit sbe inverter and the wire capacitance of, for example, 100 fin poly line and 100 p n metall line. With these data, when starting the design, it is possible to siee different transistors correctly.

4.5 CMOS STATIC LOGIC DESIGN

From the CMOS inverter we can re&e any static logic function by using the complementary NMOS and PMOS transistors. In this section we present the design of NAND/NOR, eomplex and tr-mission gates. The fanin of any complex gate is defined as the number of inputs of this gate. The fanavt of a complex logic gate is the number of driven inpnts attached to the output of this gate.

4.5.1 NANDINOR Gates

Fig. 4.20 shows B 2-input NAND gate (NAND2) and a Z-inpmt NOR gate (NOR2). Each input reqoires a complementary pair. In the case of the NAND gate, the PMOS transistors a r e connected in parallel, whilc the NMOS transistors are connected in series. But in the case of the NOR gate, the NMOS devices are connected in parallel, while the PMOS devices are connected in series. Thege gatea consnme only dynamic power while the DC power dissipation is vero (if VT'S are high) because there is no DC path between VDD and ground for any logic combination of the input. For the NAND and NOR gates of Fig. 4.20, any input combination (AB = 00,01,11,mlO) there is no path between the two I&.

The design of these gates, or any CMOS static gate, follows that of an inverter. As discussed in Sections 4.1 and 4.2, an inverter ir designed to meet a given DC and tianrient petformanee, then (W/L), and (W/L), are determined. The (W/L)- and (WjL), of the devices of II logic gate are determined BJ follows: For example we want to design a 3-input NAND (Fig. 4,21(a)) to have the same DC and transient as that of an inverter driving the same C,, (Fig. 4.21(h)).

Low-Voltage Low-Power VLSI CMOS Circuit Desagn 147

A gF 6

J

A m T

=c”

148 CHAPTER 4

We assume that W" = W", = w.* = Wns (4.66)

w, = w, = w,, = w,, (4.67) The first thing to do is to approximate the gbtc by M equivalent inverter where the effective p is given by

and

(4 .68) 1 1 1 1 3

w 2 0.s 0, G=G+-t -=-

and ?Pelf =a, (4.69)

To have LS of the gate in the midway of the power supply in DC characteristics, the following condition should be satisfied for the Sinpot NAND gate (see Eqnation 4-18)

PPLlf = a<n (4.70)

which means that (4.71)

To have the same delay BE an inverter with determined eiues, we should have (assuming that L is the same)

0. P, = 3

w,, = w*e,l = w, (4.72)

and w,,. = w,.,, = T W, (4.73)

But in practice the size of these transistors, composing the 3-input NAND gate, should be increased because the output parasitic capacitance afthe NAND gate (or any complex gate) is larger than that of the inverter. Hence

w, > w, (4.74)

W" > 3w"i (4.75) and

Note that by circuit simulation, we can properly size the transistors. Moreover, it should be noted that the back-gate bias effect has to be taken into consideration in the design of the series NMOS devices in NAND gate (or repier PMOS in NOR). The relies-connected MOSFETr, during switching, exhibit a threshold voltage increase doe to a non-null source-substrate voltage as shown in the simulation example of Fig. 4.22. In Fig. 4.22(a), the transistor NL of the


first NAND3 gate near the ootpot outl, is driven by the latest signal becanse N, 8nd N, are already ON. Therefore, the node oi is at the ground level and the source of the transistor N, is not subject to the body effect. In the other NAND3 gate, the transistor N, and N6 are ON, while Ne receives the input signal. In this case, the node a. and bz are eit II certain voltege Icvd. Henee, during the discharging period the transistors N, and N5 m e subject to the body effect. This effect slows the discharge of the output aa shown in Fig. 4.22(b). The output outl is discharged more ispidly than the output oui2. One way t o reduce the body effect at the logic level is to put the transistor, driven by the latest ardving signal, near the output. The e d y arri'ving sign& should be used to discharge the nodes snsceptible to the body effect. For example in ~n adder &=nit, the transistor driven by the carry is placed near the ontpot.

Let us derive the output parasitic capacitance ofthe m-input NAND gate and compare it to thst of the CMOS inverter of Fig. 4.21(b). We have

c, = *wpc,, + w,c,, + mC*? + c,. (4.76)

The Ce. of the m-input gate is larger than that of the CMOS inverter by the ratio W,/W,.i. Fmm the above equation it is obvions that C, of the m-inpnt NAND gate is lrtrger than that of the CMOS invater.

Note that for the same pedormance and far the same number of inputs the NAND gate consumes less silicon area than that ofa NOR gate because of the s m d e r *pea taken by the NMOS devices. Hence, CMOS NAND gates arc more widely used than NOR gates. Moreover, the NOR gate eonsume~ more power than the NAND gate.

4.5.2 Complex CMOS Logic Gates

The strategy used to build NANDINORgater can be extended to build more complex logic gates. Complex logic functions can be realiied by connecting several NAND, NOR and INVERTER gates. However, they can also be 6% eiently realized oring a single CMOS logic gate. Any complex CMOS gate is formed by two N and P logic blacks as shown in Fig. 423(a). The two blocks have the same number of transistors. Fig. 4.23(b) shows a threcinput complex CMOS gate and its logic equivalent symbol. The topology of the block N is the dual of the block P, i.e., p a d e l connections become sexier and vice v e w . In either the P or the N logic blocks, the pardel combination is placed Iar from the output to minimize the output capacitance and hence improves the speed and maybe the dynamic power dissipation. For example, the contribution of

150 CHAPTER 4

the N block to the output capacitance in Fig. 4.23(b) is less than that of Fig. 4.23(c). There is no direct DC path between VDD and ground for any of the logic input combination. In practice, the complex CMOS gates are used for a marimurn f& of 6-6.

Low- Voltage Low-Power VLSI CMOS Circuit Design 151

Logic Block

Logic

ci5 (C)

Figvre 4.13 CMOS

B

c-

152 CHAPTER 4

4.5.3 Switching Activity Concept

So far, we have discussed the dynamic power dissipation of an inverter due to the load capacitance. Whet about a CMOS complex gate driving a load capacitance ? The dynamic power dissipstion has two components in B complex gate. The internal cell power, P*mcd,,n, and the capacitive load power. The internal cell power consists of the power dissipated by of the internal capacitive nodes. Sometimes the internal short-circuit power is added to the internal cell dynamic power.

The dynamic power for B complex gate cannot be estimated by the simple expression Cr,ViDf, because it might not always switch when the dock is switching. The switching activity determines how often this switching occurs on a capacitive node. For N periods of 0 - VOD and VDD - 0 transitions, the switching activity a determiner how many 0 + V O D ~ transitions occur at the output. In other words, the activity Q represents the probability3 that a transition 0 - VDD win OEEU during the period T = l / f . f is the periodicity of the inputs of the gate. The average dynamic power of B complex gate due to the output load capacitance is

P* = aCLV;,f (4.77)

The internal power dissipation, due to the internal capacitive nodes, can be characterized by simulation. Fig. 4.24 illustrates an example of a complex gate with internal nod-. The internal dynamic power of a cell is gken by

" P k A p = x Q i C $ x v D D f (4.78)

i=,

where R is the number of the internal nodes, Q, is the switching activity of each node i, C; is the parasitic capacitance of the internal node, and V, is the internal voltage swing of each node i. The parasitic capacitance at the output is included with the load CL. Note that internal voltage swing can be different than VDO.

4.5.4 Switching Activity of Static CMOS Gates

In this section we consider the computation of the switching activity of static CMOS gates. We will discuss the case of dynamic gates and other circuit styles

lDvring tbis tranrition Lhc enorgy CzVi4 is dram &om the avpply 'Wc u s y m c that thc @c doar not expert-= sLkhbg

Low-Voltage Low-Power VLSI CMOS Circait Desaggn 153

I L

in the next sections. First we consider the c s e of a NOR gate. Then we treat several rtatk gates. Table 4.3illustrates the truth table of the NORgate. From the table the probability that the output is at zem is 3/4 and that it is at one is 114. The probability for (I - VDD transition is eompnted by multiplying the probability that the output d be at sera, Po, by the probability it d be at one, P,.

(4.79) 3 1 3 PNOn, = Po.P, = - Y - = - 4 4 16

We aFsume that the inputs ate uniformly distributed (i.e, the probabilities P(A=I)=P(B=l)=I/1).

W e show that for m y bodean function, the activity d a static gate is given

OI = P(0 4 1) = P,.P, (4.80)

where Po is computed by dividing the nvmber of zeros by the total n-ber of input eornbin&ons (N = 2" for n-input gate) and P, is computed by dividing the number of ones by N. Po is also equal to (1 -PI), Fig. 4.25 shows the probability that the output maker an 0 3 1 transition for several static gates. The probability of transition. at the inputs are assumed uniformly distributed.

by

Low- Voltage Lour-Power VLSI CMOS Circuit Design 155

~ P(O-21) P(0 +I j + 114 3/16

1/64 3D ‘I4

Figure 4.11 output octivitics Rr static lagie gates with d o d g dis tribnted inpute

4.5.4.1 Example

As an example of a logic decision far low-power, consider the different Lnple- mentation of an 6-input AND gate driving a 0.1 pF load. As shown in Fig. 4.26, we may compare the following implementations:

Implementatirm 1 : an 6-inpnt NAND and an invater.

Implementation 2 : two 3-input NANDs and one 2-input NOR.

Implementation 3 : three 2-input NANDr and ODE 3-input NOR

. rn

The library osed of such 8 comparison is a high-performance standard cell library optimbed for speed. Table 4.4 shows some eharacteristics of the library, where the average delay is reported which is the average vdue of the rise and

delay timer. W, = ZW, = 10 pm is set for all the t r d t o r s composing the different gates. The delay is a function of the outpui load capacitance4 C, in pF. The area is a function of a unit area called cell grid. Each unit area for a cell h= a certain height and width. Also included in this Table, is the input capacitance of a gate and the output parmitic capacitance in fFr. We make, for this example, the following annumptions:

‘Tlua saparitmcc doer not inrlvda the output pararilic one.

156 CHAPTER 4

P = 6314096 P = 6314096

01

lrnplernenialion I

Low-Voltage Low-Power VLSI CMOS Circuzt Deszgn 157

= m

We neglect the \siring capacitance between the Merent cells; and

We neglect &o the internal power of each gate.

Gate Area output Input Average type (eeU unit) cap. (fF) cap. (fF) delay (ns)

INV 2 85 4 8 0.22 + 1.00 C. NAND2 3 105 4 8 0.30 t 1.24 C. NAND3 4 132 48 0.37 + 1.50 C. NAND6 T 200 48 0.65 + 2.30 C. NOR2 3 101 48 0.27 + 1.50 C, NOR3 4 117 48 0.31 + 2.00 C.

First we compare the delay and the iliea of the different implementations. Us- ing the data of Table 4.4, the results are reported in Table 4.5. The delay may be computed or simulated by SPICE as illustrated in Table 4.5. The implementations 2 and 3 offer the best speed compared to the first one. However, they requiz. more area.

Implern. 1 Implem. 2 Implem. 3

Area (cell unit) 9 1 1 13 Computed delay (ns) 1.1 0.85 0.87 SPICE delay (m) 1.1 0.86 0.83

Let us now compare the power dissipation wing the power cost function. It ir defined by

Power coat = CP.-. , ,C, (4.86)

158 CHAPTER 4

lmplomentatian 1 P,

Po = 1 - P, PO-,

where Po+,,; is the probability of transition 0 - 1 at each node i and C: is the to td capacitance at each node i. We assume that the inputs A, B, C, D, E , and F a r e uncolrdated andrandom (i .~. , E = 0.5). For the implementstions of Fig. 4.26, we compote the transition probabilities. Table 4.6 summarizes the procednre of probabilties compntation of Merent nodes in the drcnit.

01

63/64 1/64

65/4086 oa/nuao

Implementation 2

PI Po = 1 - P,

PO-,

1/64 63/64

^^II^^^

01 0 2 2

718 7!8 1/64 118 1/8 63/64

7/84 7/64 65/4090

Note that the node 01, in implemention 1, has a lower switching activity =om- pared to the other two. To compute the power cost function we laiu not indude the p~imary inputs. Table 4.7 illnstrates the results of this calculation. The results indicate that implementation 1 has the lowest power. So technology mapping is important for low-power applications.

We consider now another example using low-area 0.8 pm CMOS standard eel! library for the &input AND implementation. Some characteristics of this library are show in Table 4.8. Cornpazed to the library presented in Table 4.4, this library uses sma!! transistors with W, = W, = 4 em. Compared to the

Low-Voltage LowPower VLSI CMOS Circutt Deszgn 159

case of the highperformance hbrary, the cell area unit, in the low-area ease, LS smaller by a factor of 1.5. Note that the delays of diRerent gates are higher. Bowever, the input gate and output parasitic capacitance$ me lower Thus, this hbrarg c a n be used for low-power fonction implementation.

Table 4.8 Characteristic. of s lov.mcs 0 8 ,zm CMOS bbprrry

Gate Area Output Input Average type (cell unit) cap. (fF) cap. (fF) delay (ns)

INV 2 35 13 0.23 t 3.73 C, NAND2 3 60 13 0.28 + 4.40C, NAND3 4 65 13 0.34 t 6.00 C. NAND6 7 81 13 0.53 t 7.13 C, NOR2 3 62 13 0.35 t 6.27 C, NOR3 4 69 13 0.47 t 8.84C,

Implem. 1 Implem. 2 Implem. 3 Power cost (D) 3.5 19.5 43.7

The delays reported in Table 4.8 do not indnde the effect of the input voltage dope. The delay, of the merent implementations, w.s simulated with SPICE and it is almost the pame for all the configuration. The delay is - 1.5 "8. Using the same reasoning discussed earlier we can compute the power cost function wing this library. The transition probabilities are the same, except the total

160 CHAPTER 4

node capacitances which are different. The results of the power cost evaluation are illustrated in Table 4.9.

The power cost, in the case of low-power library, is almost half of that of high- performenee. Still, implementation 1 hea e. low-power chs*Factedstie while the speed is h o s t the S-e compared to the others. The me- is also lower than the other implementations. This example shows that the power dissipation e m be Fedneed a t the gate level. Even if we take into account the wire capacitances between the cells atill, the conclusion is valid. The topic of low-power at the gate-level is discussed more in Chapter 8. Keep in mind, that in this comparison, the internal power of the gates has not been considered.

4.55 Glitching Power

Note that in the probabmty discussed so far, we assumed that the gates had e e m delay. In that case, we m e not taking into account the glitches and we consider only the transitions between stable states. Glitches must be considered if we assume non-aero delay at gates. Thus the total dynamic powei of a circuit is the total dynamic power with iero delays power and the glitching power. So what is the glitehing phenomenon?

In a static logic gate, the output or internal nodes can switch before the correct logical value is being stable. To illustrate this spurioos transition, Fig. 4.2T shows an example of a circnit with a cascaded configuration. When the inputs ABC make the following transition 100 - 111, the output, with %em delay gates, should stay high. However, considering a unit delay for each gate, the output 01 is delayed compared to the input C and hence csusing the output Z to evaluate with the new value of C and the old value of O1. In that care, the output expedenee. a dynamic hazard (glitch). This transition increases the dynamic power of the circuit and adds a dynamic component to the switching activity,

Another example is shown in Fig. 4.28(a). The cawaded circuit exhibits a glitching pioblem. However, the same function can be implemented oring balanced delay implementation as shown in Fig. 4.28(b). These are some mles to amid this problem:

Balance delay paths; psrticdaxly on highly loaded nodes. Insert, if possible, buffers to equirliee the fart path; and

Lou-Voltage Low-Power VLSI CMOS Circuit Design 161

m . Avoid if possible the carcaded implementation; and

Redesign the logic when the power due to the glitches is an important component.

4.5.6 Basic Physical Design

To implement simple gates, the physical layout should be performed. It is usually eary to draw a layout of a gate with well arranged transistors. For example, for the inverter, Fig. 4.29(~.) shows a possible layout implrmenta- tion. The metall is need for the power liner. Many uariations can be drawn, depending on the use of the gate. Fig. 4.29(b) shows another layoot variation of the inverter prhere metal2 is used BS the power lines. For clarity the wells and body ties are not shown in there layouts.

Similarly, the rchemstic of NAND2 and NOR2 gates E B ~ be converted to layouts. Fig. 4.30(a) shows one pwsible layout of a tw-input NAND gate. The layoot can &a be arranged to draw the inpot poly lines vertically. The layout artist should draw the gate taking into consideration the environment of this cell (the connectivity to others). Fig. 4.30(b) shows the lilyout of a two-input NOR gate. Note that the junction mess should be aptimieed during the layout to reduce the power dissipation and improve the speed of the cell. An imple mentation of a %input NOR gate with B high output drain junction capadtsnce is shown in Fig. 4.31.

To do a layoat of a complex gate (i.e, several tens of transistors), the folloving general layout guidelines can be used :

. m

Set the siaing of the transistors composing the gate;

Run V D ~ , and Vss in metal (1 or 2) hodmntdy. For example, VDD at the top and Vss a t the bottom of the cell in semi-rectangular form;

Define the polysilicon gate lines odentatioionr and order them for maximum active area cros~over to form the gate regions;

Place the N-block (NMOS transistors) near Vss and theP-block (PMOS transistors) near VDD. The PMOS devices should be located in the common N-well ifthey use the same bulk potential;

Adhere to the design rules snd m e if possible an interactive DRC (De- sign Rule Checker);

m

rn

m

162

A E C l o o

CHAPTER 4

I i i

z

B

*- (a1 D

Lorn- Voltage Lou-Power VLSI CMOS Circud D e q n 163

164 CHAPTER 4

"OD

i;ll

v~~

B

A

lhl

. . -. . .

B

A OUI

Low-Voltage Low-Power VLSI CMOS Circuit Design 165

rn Keep the internal junction and wire capacitances to the minimum to minimiae the p’aes and the delay; and

Complete the uonnection of different nodes inside the cell using the different layers available (metall, p l y , etc.).

m

Note that the power Line widths are drawn taking into consideration the current consamed by the cell because the electromigation phenomena sets the minimum width of eoodacturs.

Far low-power design, these are some layont guidelines:

m

m

Identify, in your circuit. the high switching activity nodes;

Use for these high activity nodes low-capacitance iayers such BS metall, metal$ ete.;

Keep the wires of high activity nodes short;

Use low-capacitance layers for high capacitive nodes and busses.

For large width devices, use special layout; such BF interdigitated fin- gers [3] and donut (round transistor); to achieve & l o w drain junction capacitance; and

Design complex cells or blocks using, as much as, possible custom a p proaeh.

rn

w

m

4.5.7 Physical Design Methodologies

There are many layout methodologies to do the physical implementation of a complex circuit. The furt methodology is called fill-eartom design, where the layont of each transistor is optimized. The layout of B complex block is performed by costom design for r e a ~ o n ~ of speed. However, this style leads to low design productivity snd is ~ a x l y used in ASIC5 and digital processms. Bnt, when the low-power is an issue the full-cnstom deign can be used to M e the power of the circuit.

Another design methodology is the standard-cell approach (or semi-curtom design) . That is, several gates and functions are created in the library such as:

166 CHAPTER 4

NAND, NOR, XOR, AOI, OOAI, latches, buffers, multiplexers, full- adder, fipfiops, etc.;

Linear cells : low-battery detector, power-np reset, etc.;

MSI/LSI functions : ALU (Arithmetic and Logic Unit), countezs, magnitude comparators, ete.;

Compiled maemeellr : register file, FIFO (First In Fhrt Out), ROM (Red Only Memory), parallel multiplier, etc.; and

Macrocells : Sjle-bit microcontroller, 16-b fixed point DSP, UART (Universal Asynchronous Reedver/Transmitter), etc.

= m

rn

A &wit is designed by capturing the rehematie or thefanctional model (VBDL, Verilog, etc.) of the cells. The layont is generated by an antomatic placement and routing. An example of a CMOS standard cell library can be found in [lo]. In standard cell approach, the logic c& have the same height and the width is variable. In many libraries, the cells are available in two layout styles. In the area-optimized cell, the cells me made as small an possible. In the performance- optimized style, cells are optimieed for high-speed performance and, as a result, occupy more aces than the small cells. Even the height of the c& in the two styles is different. A typical standard cell layout for a NAND gate is shown in Fig. 4.32. This methodology providu lower cost and higher productivity than the fall-enstom one. For low-power applications, the small and large cells for the same function can be c a r e U y chosen to optimise the power in a complex design without degrading the timing requirement.

The third layout methodology is the gete array6. The gate arrays consist d i m - plemented cells and need only the personalination steps. Fig. 4.33illuetrates an example of gatearray core using Sea-Of-Gates structure. It consists of I/O and internal cell areas. The 110 cell area contains pads with input/output buffets. Theinternal cell array eontainsscontin~ousarray ofNMOS and PMOS transistors. Hence, the transistors and interconnects are & e d y predefined. The design of a logic gate consists of wiring the different tramistors using metal- lization and contacts. The isolation of a logic gate is performed by tying the polysilieon gates of the limiting transistors to Vss or VDD depending on the type of gate diffusion. Routing channels are routed over unused transistors. This methodology permits the reduction of the design cost at the expense of area, power and performance. Ont recent gate array nrchiteeture WVIU based on multiplexers with small sine transistors to maintain low-power characteristics 1111.


Figure 4.53 An cxunpk ofstandwd c e l l I s ~ o u l (NANDZ)

168 CHAPTER 4

7 I/O Cell area

VDD(metal)

Pdiffusion

Polysilican gates

N-diffusion

V (metal) ss

Comparing these layout approaches, the full-custom methodology offers the beat approach to minimive the power digsipation. However, for a complex d t sign, it is costly to use such a design strategy. The standard cells approach provides good performance and an improved design time. However, in many libraries the devices ate oversized for performance purposes and conrequently, the power dissipation would be high. To efficiently use the standard cells tech-

Low- Voltage Low-Power VLSI CMOS Circurt Deszgn 169

Figure 4.14 (a) CMOS kran.mis&on e t c i (b) and ( c ) rchrmatic symbols.

nique for low-power applications, the library should be expanded to include several versions of the same function with different driving oapabilities. In that case, powerful synthesis tools are needed to optirnim the power while maintaining the timing specificstions. Moreover, both the standaid c& and gate arrays stylu require new place and route took for low-power design.

4.5.8 Conventional CMOS Pass-Transistor Logic

Another alterndive to CMOS static complementary logic ir the conventional passtransirtor logic based on MOS switches. Fig. 4.34 shows a CMOS trans mission gate (TG) as primitive element. It u o n ~ t r o f a complementary pair connected in parallel. It acts as B switch, with the logic variable A as the control inpnt. If A is low, the gate is OFF and presents e high resistance between the terminals. If A L high, the gate is ON and acts as a switch with an on resistance of R,, and % in pamllel. The equivalent resistance of the TG is RTD = R,,llG. This resistance is ulways less than the smallest among R, and 4. This permits a fast switching characteristic. When the input I is at Voo, then the outpot F is quidtly charged initially by the NMOS, then at the

170

n

CHAPTER 4

vD;k; - PMOS ON

>" NMOS ON

TlIlE

end by the PMOS transistor as illustrated by the equivalent resistances of Fig. 4.35. In this figure, we assme that at V,, = 0, A and A are set to their final values. During this transient switrhing phase the NMOS is subject to the body while the PMOS is not. When a eero, at the input I , is to be transmitted then the PMOS is subject to the body &ct. The PMOS and NMOS transistors should be sbed such that they charge and discharge the output symmetrically. If VT. = IVT,~ and the body effect is symmetrical then we can size the devices such as P. = Pp. Sometimes, equal shed NMOS and PMOS devices can be used. It is easy to see that the delay of the TG gate in approdmately independent of the input level. This is not the case if the pass-logic YS~S a singlcchannel

Low-Voltage Low-Power VLSI CMOS Czrcurt Deszgn 171

transistor. A drawback of the CMOS TG is that it co~~sumes more area than a single-channel transmission gate (NMOS TG 01 PMOS TG). Thnr, if the area is ofprime concern, NMOS TGs are used.

Any CMOS TG logic (we call it here conventional pars-transistor logic) function can be implemcntcd using the TG primitive element described above. In such implementation the transistor count, hence the silicon area, is low compared to standard static CMOS implementation. This ishighlighted in the implementation of such functions BJ mdtiple-g, demdtipleldng, decoding and addition. Pi. 4.36 shows & 4 1 multiplmer, where the data lines A, B, C and D are contlolled by S1 and S2 such that

F = A S I S ? + B.S, .Sz + C.S& + D.S,.S2 (4.87)

Thm form of logic is used when the inputs and their logic complements are available. The implemenlation does not need VDD or ground liner. However, the implementation suffers from a number ofdrawbacks; the driving capability of the ckcnit is limited and the delay increa~es with long TG chains. Moreover, the eireait does not provide a restoration ofthe logic lev& i.e., the logic gates are passive with no gain elements. Pi. 4.37 shows an example on how to lestore the voltage levels in chained TGs. When 8 TGs are pnt in s u i e s . the output signal changes very slowly. However, when an inverter stage is added every 4 TG stages, the level is restored as shown in the SPICE voltage waveforms of Fig. 4.37.

The CMOS TG logic can be used in CMOS drcui t design offering an extra degree of eirenit design Beedom. A0 example is the full-adder. The adder Circuits dl be diseused in detail in Chapta 7. Fig. 4.38 shows the schematic of the XOR gate which is used by the adder. When the input A is low, A is high. The transmission gate TG is closed, then the output is equal to B. When A is high, A is law. The inverter formed by the transistors N m d Pis enabled, then the output is equal to A. The TG gate is open in this care. To implement an adder lets first review its functions. The boolean function ofa full-adder

S,, = A B B B Ci, (4.88)

C,, = A.B t &(A + B) (4.89)

A and B are the inpots, Ci, the carry input, S,, is the sum ontput, and C,, is the carry output. The truth table ofan adder is shown in Table 4.10.

The CMOS implementation ofa one-bit full-adder is 3hown in Fig. 4.39(a). It requires 28 transistors and has two gate delays. In this circuit the transistors

are:

172 CHAPTER 4

B

F

C

D

Low-Voltage Low-Power VLSI CMOS Crrcuzt Deszgn 173

n < I

controlled by the carry signal C,, should be placed dose to the output. This will _offret the body effect problem, since the carry is the latest arri-8 signal. An optimiaed implementation of the full-adder is shown in Fig 4.39(b) It uses only 18 transistors and is bared on the XOR function shown in Fig. 4.38 and the TG gates. Hence, this adder is more compact and farter and eonrnmer less power than the complementary static one.

174 CHAPTER 4

Figure 4.38 TG XOR gate.

A B C;., S,, C,, 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1

Table 4.10 Adder l h t h Table

4.5.9 CMOS Static Latch

Fig. 4.40 shows a mxs-cmpled CMOS static latch. In the storage mode (input LD = O), when the node A is high, B is low, PL and N, are ON while P2 and Nt are OFF. Similarly, when A is low, B is high, PI and N2 are OFF while P, and N1 are ON. The standby power &sipation of the ceU is very small. The state of the htch changed by turning the two transmission gates ON (LD high) and applying the input and its complement.

Lorn- Voltage Low-Power VLSI CMOS Circuit D w i p 175

176 CEAPTER 4

Figure 4.40 CMOS cros%couplcd static latch

4.6 CMOS LOGIC STYLES

CMOS logic har been known to have a negligible static power dissipation. How- ever, this is valid as long as VT is not too low. However, it has low-speed and consumes large area because for n-input, twice the number of transistors is required. As B result, it is sometimes desirable to have faster and smaller logic gates at the cost maybe of parameters such lls : noise margins, power dissipation, etc. This section discusses many CMOS logic alternatives to wm- plementary CMOS and also the clocking issuer in a VLSI system.

4.6.1 Pseudo-NMOS CMOS Logic

The gate area of complementary CMOS can be reduced if CMOS circuits u e designed in B way similar to NMOS circuit f a d e [IZ]. A PMOS device is used to replace the depletion-type device in NMOS family. This type of circuit is referred to as pseudo-NMOS, as shown in the inverter of Fig. 4.41. When the input A is low, the output is high and at VDD. When A goes to LL high level, N turns ON while P is still ON. I0 this cllse, the output never reacher zero and taker a value VOL determined by the ratio &/A and the logic is called ralioed. To examine V0h, we nre simple analysis. When A is at VDD, N is in the linear ~epion while P is saturated. By equating the currents using simple models, we have

Low- Voltage Low-Power VLSI CMOS Czrcwt Desrgn 177

Thus V0,, depends strongly on the ratio &/A,. For example, if we need B

VOL = 0 . 0 4 V ~ ~ and VT = 0.2V.~ , then the ratio &I@, should be equal at l e s t to 0.1. If the NMOS transistor is minimom she, the PMOS should be weak to provide adequate noise margins (low Voc). In this case, the rise time of the gate is too slow. If we improve the rise time, the ratio condition tends to inerurre the gate area a d hence the input capacitance.

Although this circuit offers a reduetion in total transistor count and ease of layout, it has the disadvantage of non-~ero static power dissipation. Since the pull-up PMOS is always ON, a current flows from VDD to ground whenever the pull-down section of the pseudo-NMOS is turned ON. This current is the source of the static power dissipation. When II pseudo-NMOS gate, with antput a t VoL, is driving another one, the d i v a gate, with OFF pd-down section, leaks a high eubthreshold cnrrent but still this cnrrent is lower than the one when the pull-down in ON. For a-input preudrrNMOS gate there ate (ntl) transistois. Fig. 4.42 illustrates an example of complex gate implemented in pseudo-NMOS style. This logic hns been used in many applications such 8.8

decoding logic for memories and PLA. Because of its high static power, it is not suitable for low-power applications.

4.6.2 Dynamic CMOS Logic

To reduce the area and improve the speed of CMOS circuits, another popular style e d e d dynamic iogie is used. Fig. 4.43 shows a dynamic CMOS gate. This logic is referred to as domino CMOS logic [13]. The domino gate shown in Fig. 4.43(a) consists of e dynamic CMOS drcuit followed by a static CMOS

178

1

CHAPTER 4

AiR Figure 4.41 PseudaNMOS complex laslc gab

buffer. The dynamic circuit consists of a PMOS prechargc transistor Pi , an evalnation NMOS transistor N,, a storage capacitor C , and an N-logic block which is a serie-parallel combination of NMOS transistors estivated by the inputs and implementing the required logic. The storage capacitance represents the parasitic et node A.

This circuit u4es asingle clock phase clk. DuMg theprecharge p k e ( c fk = O), the storage capacitance is charged through the PMOS pull-up PI to VDD and the inpats have no effect since there is no path to ground. The output of the buffer is precharged to ground. During the evaluation phase (cfL = l), A', is ON, and depending on the logic performed by the N-logic block, the node A is either discharged or it will stay precharged.

Fig. 4.43(b) shows an example of complex gate. In a cascaded set of domino logic stages, a5 shown in Fig. 4.44, the first stage evaluates and causes the next one to evaluate (like domino f a ) . The number of erscaded skages is limited by the evaluation clock phase.

Compared to psendo-NMOS, domino logic has the same k p n t capacitance snd improved iise time. However the fall time is affected since there is one more transistor in the pull-down section. Also the gate is suitable for high-fanout operation because of the CMOS buffer. Moreover, it is efficient in area for high fanin because n + 4 transistors are required compared to 2n for CMOS static gate.

Some limitations of the gete u e :

Low-Voltage Low-Power VLSI CMOS Cwcud Deszgn 179

T

180 CHAPTER 4

clk er Stagel sage2 stage3

Figure 4.44 Dormno logic chw

1 The domino gate has a problem called charge sharing OP redistribution. Fig. 4.45 gives an example to explain this problem. During the precharge, the node A is a t VDD and charge CVDD is stored on the capacitance C. We armme (worst-case) that the pararitic capacitance of nodes B and C, C, and C2 respectively, have iero charges. During the evaluation, the node A should stay at VDD, however, due to C, and Cz, charge sharing take place. Using the charge conservation principle before and after redistribution, we have

C V D D = (C + c, + C,)V. (4.92)

Hence the final voltage of node A is

C (4.93) c + c, + c, "DO VA =

Iffar example CI = Cz = 0.6C then this voltage wonld be VDD/Z. This voltage can alter the logic and provoke the CMOS buffer to dissipate high static power dissipation.

If the clock frequency is too lour, the node A leaks the charge stored on C due to the leakage cnizents. The dynamic node can leak its charge in n t h e of few hundreds of #r to few ma, depending on the temperature, the Starage capacitance and the leakage cnrrent. When

rn

Low- Voltage Lour-Power VLSI CMOS Czrcvit Design 181

Figure 4.45 Charge aharingin h - c CMOS l o p k

using power-down techniques, the dynamic nodes should not be left floating for a long time. If the leakage is high with low VT devices, the charge can be deleted in B t h e IU low s 100 RS. This problem is similar to charge sharing. Fig. 4.46 shows two alternates to solve the problems of charge sharing and leakage. In Fig. 4.46(a), a weak PMOS (low W/L) is added BL pull-up transistor. This circuit operates like pseudo-NMOS during evaluation phae. Hence it consumes some static power dissipation. If the circuit operates at high-fceqnency, the added Teak PMOS har no role because it does not have enough t ime to operate. Note that this weak PMOS inereares the ontpnt cappacitmee and then it slows this dynamic gate. To eliminate the DC path during evaluation, the gate of the weak PMOS can be driven from the output of CMOS buffer as shown in Fig. 4.46(b). This circuit adds another capacitance at the output ofthe inverter. A third alternate circuit which solves only the problem ofcharge sharing is shown in Fig. 4.41. In this chcoit configuration, intermediate nodes of complex gate are prccharged with additional precharge PMOS devices.

Another limitation of the domino logic gate is that it implements noninverting logic functions. Hovever, this is not a serious limitation and can be overcome, if the need arises, by "Jig CMOS static gates. The dedgnep can mix both stalic and dynamic CMOS logic circuits in a given design to optimize the overall performance.

rn

182 CHAPTER 4

Logic Block Block


Historically, dynamic design style have been devised f a low-power charaeter- istics because of the reduced device count. Moreover, dynamic gates do not experience short-kcnit pover &sipation and glitching problems as in rtatie &wits. However, to drive the docked transistors, a lluge dock dirtribation network is needed. This highly loaded network consumes a significant a m o u t of dynamic power particularly at high frequency of o p e r a t i d . The switching activities of dynamic gates are higher than those of static gates. In B dynamic gate the output maker a 0 - 1 transition during the precharge cycle only if the N-bloc discharges the autpnt during the evaluation phase. Hence, the probability of 0 + 1 transition is given by

Po-, = Po (4.94)

where Po is the probability that the output has a "0" output. For a two-input NAND dynamic gate, the output has only one zero for 4 input stater. So,

(4.96) 1 1 Po-, = Po = - ~ ~

2' - 4

For a NOR2 gate, we have

Another refinement oftbe domino CMOS logic is shown in Fig. 4.48 [14], where the CMOS buffer is removed. N and P logic blocks are alternated and each drke the other. When clk is low (0), the hst and third stage are prechsrged high and the second stage is precharged low.

Fig. 4.49 s h w s another NP domino logic called NORA (No Fbcce) [El. Two sections elk and elk are shown in Fig. 4.49. It is constructed by cascading N and P blocks followed by C2MOS (clocked CMOS) latch. CMOS buffers (inverters) ace nsed to provide logic inversion. When clk = 1 (evaluation phase in section dk), the CaMOS latch3 operates like aninverter. When clk = 0, the latch move* into hold state because the output NMOS and PMOS transistors ale OFF. In this case, the old data is latched at the output. This latch is used to avoid signal races. A NORA pipeline is shown in Fig. 4.50 and it consists of alternating elk and cik sections. Signal racer do not occur in this structure because of the use of C'MOS. Another logic hlrr; been proposed to oveicome charge sharing by using additional clocking signals. It is e d e d Zipper CMOS logic. For more details refer to [MI.

'Scr the ex-ple of the DEC Alpha Ehip in Scc~ion4.8.4.

184 CHAPTER 4

Block Block Block

Pigme 4.48 NP do-o I Q ~ E

An example of a pipelined full-addu (FA) NORA circoit is shown in Pig. 4.61. This cell can be used in many deigns such as B pipelined multiplier. The output C'MOS latches c a n only use three transistors rather then four. The NMOS and PMOS tramistor Pa and N, respectively, can be removed from the output C'MOS latches. The reason is that during precharge phase (clk = O), the outpnt nodes A and B are set t o ground and VDD re~pectively. Thus, the transistors PI and are tmned OFF. Benee, the clocked transistors P. and N, cam be removed and the FA cell is isolated from other sections during precharge.

4.6.3 Design Style Comparison

If we eompae the above discussed deign styles, static CMOS lo@ is the slow- est circuit, but the power efficiency is the best, particularly if minimum siae devices are used. Hence, it is snitable far low-power, m e d i m speed applications. Note that the static CMOS logic occupies the largest chip area because complementary functions are needed. The circuit designer can includc, in static logic, pas-transistor logic to improve the speed and B P ~ B . Pseudo-NMOS logic style can be fa te l than static CMOS logic, howeyer its rise time is long. This is limited by the low output logic level. Moreover, the most serious drawback of pseudo-NMOS logic is the high power dissipation in the standby mode. N-P domino logic is f a t , because it has small input capacitance Wre paendrrNMOS

Low-Voltage Low-Power VLSI CMOS Circvrt Deszgn 185

T T

\ ? 7 + To N-Block \?7 i:: (a) NORA clk-SeLdon

T T

To N-Block To I lock

186 CHAPTER 4

- clk-Section clK-sect,on clk-Section

Figure 4.110 NORA p l p e h r l o g x o .

C

Figure 4.61 Pipehod fd-addrr NORA c w c u t

logic and improved rise time. The power dissipation consumed by this logic Is high due to the hi& switching adi-ity of the clock even if the circuit is not used. However, power-down techniques can be used to control the dock of the logic. Using thi. style, requires from the desi@er to spend more dsip effort than the static style to solve all the problems of dynamic logic such 81: charge sharing, clock skew, preeharging, ate. Finally, we note that pass-transistor logic is very pxomising for high-performance low-voltage low-powez applications.


E

Figvre 4.51 Clock skew.

4.6.4 Clock Skew in Dynamic Logic

Clock skew is 8 critical design parameter in high-speed circuits. Fig. 4.52 shows the clock skew in single complementary-phase dock sipds. If & is generated &om e lk , clock skew is possible. The time skew is measured between the h&-VDD points of clk and & sign&. In the presence of dock skew, a glitch e m be transmittad from one section to another as illustrated in the example of Fig, 4.53(b). This structure cant- one stage between the two C'MOS latches, and a glitch can be transmitted to the last C'MOS latch. The example ofFig. 4.53(c) does not have this problem. It has been shown that to eliminate the signd race in N-P domino logic. an even number of inversions &odd be used between stages 1171. Moreover, the clock skew problem shonld be minimieed to improve the speed of dynamic circuits. One possible solution of single complementary-phase dock generation, with miaimd skew and p~oces - insensitive, is the one shown in Fig. 4.54 [18]. The delays clk. + clk and elk; - d k are equahed with special buffer sizing.

188 CHAPTER 4

4c:

4.7 CLOCKING

One way to synchronize thousands of sign& in 8. VLSI system is to employ a docking strategy. The clock controls the flow of data in the digital system and reduces the compl&ty of design.

Low-Voltage Low-Power VLSI CMOS Czrcuzt Deszgn 189

clock signal

repistcr input register register

Figure 4.65 do&dpip.lm. ayrtrm

Moat VLSI processors are constructed Using a set of functional blocks (ALU, shifter, register file, ete.) connected vis pipeline registers as shown in the example of Fig. 4.55. The clock signd can be split to one, two, three or four phases. Typically the phases are non-overlapping.

First we pesent the different storage elements (latches, registers), then we treat two doeking strategies : Jinglcphase and two-phsse with emphasi. on the former which is usually the main option available in standard cell and gat-array approaches. The doc$ distdbntion issues are discussed in Section 4.9.4.

190 CHAPTER 4

lateh

clock

Q

D i

Q :

4.7.1 Storage Elements

There are many types of storage elements. Some of the ones used in VLSI design are the fallowing:

4.7.1.1 D-Latch

Sometimes d e d level-sensitive latch. Its operation is shown in Fig. 4.56. The output changes with the input when the dock is high (case of positive level- sensitive latch). The D inpot must he rtehle within LL time window around the positive transition of the clock (Fig. 4.57). The input data is pasred to the output within B delay ti. The time window i s defined by two times; called setup'time t , , lrnd hold time h. Setup time, t., is the time needed for the D input to he stable, prior to the do& edge. More specifically, it is the delay between the input of the latch and the storage node. Hold time, t h is the time needed for the D input to he stable after the clock edge. This time relates to the delay between the clock input and the storage point.

There are a variety of implementations for this D-latch. Fig. 4.58 reviews some of the static versions. The circuit of Fig. 4.58(a) hhS a weak inverter used 85 feedback path for latch mode. The mltsge at node A is not changed by noise or leakage because the feedback inverter would keep the level. The feedback inserter should have low (Wjl) for NMOS and PMOS (weak inverter) compared to the transmission gate and forward inverter. This assures that the transmission gate is capable of overdriving the feedback inverter when data is being written to the latch. The feedback inverter should he carefully siaed to guarantee switching for all process corners and maximom fanout condition.

Low- Voltage Low-Power VLSI CMOS Circurt Design 191

The problem of rstioed design in Fig, 4.58(a) can bc avoided by using the modified version in Fig. 4.58(b), where B transmission gate in added in the feedback path. When clk = 1, the data is passed to the storage node and the feedback node is disconnected. When clk = 0, the feedback loop is dosed, and the latch is in store (latch) mode. Fig. 4.58(c) shows another version of Fig. 4.58(b), where the outputs are buffered. Thia latter latch is fonnd in the cells library of standard-cell and gate-array. All there described static latches store their state even ifthe clock is stopped. Note that these latches do not dissipate any DC power.

To reduce the size of the static latches, dynamic versions can be used as illustrated in Fig. 4.59, Fig. 4.60 and Fig. 4.61. Fig. 4.59 shows a simple dynamic latch, where the storage node A, temporarily stores the data. Note that latches have B property called "trampareney": output follows the input when the dock is asserted. Otherwise they are yopsqne". Fig. 4.60 shows two other latches [19]. The circnits of Fig. 4.60(a) is transparent when the dock elk, is high and latches the data (opaque) when the dock is low. This latch is positive level-sensitive. The negative level-sensitive is shown in Fig. 4.60(b). Note that these latches use one clock line ( c l k ) .

The circuits of Fig. 4.60 have redaced noise immunity. For example, for the circuit of Fig. 4.60(a), when the latch is opaque (elk = O), the node A may be tristated high with Q tristated law. The node A is isolated and may be surceptible to noise which reduces its voltage. The reduced voltage of node A can cause the PMOS PB leaking current, thereby deitwyhg the output Q. This problem was addressed with latches designed in DEC Alpha microprocerror PI] . For example the eircoit of Fig, 4.61 is an improved version of Yuan and Svenrron [19]. A weak PMOS device P3 is added to solve the problem of noise in positive level-sensitive latch. The operation of this latch follows. When clk

192 CHAPTER 4

weak invenci with small iwu ror NMOS and PMOS

clk

clk = 0


clk

Figvre 4.68 Simple dynamic CMOS single-dock latch

T T

b high, PI, NI and N3 function like an inverter. Pz, Nz and N4 function &a &e an 'bwerter. Therefore the latch p~3ses the input D to the output Q. If D falls to low, then A is high and Q is low. When clk is low, Ns and Nn are OFF. If D goes to high, Pi is OFF, while the nodes A and Q are tristated high and low respectively. The added P3, in this case, is ON and holds P2 OFF. This device supplies current to node A and counters any noise.

1 94

TT T

CHAPTER 4

Figure 1.81 Nan-inverting dynamic ktch with improved n&e immunity.

For R&bility reason many latches have been designed for DEC Alpha chip [Zl]. Some are illustrated in Fig. 4.62. These latches have been designed for all process corners and circuit conditions (supply Voltage, temperature, rise/faU times, etc.). The results showed no appmciable evidence of raccthrough for elk r isvj fd times at or below 0.8 ns. With 1-ns rise/fall times, the latches showed some signs of feilure. A 0.5 ns for rise/faU timer was set for the dock in this chip.

4.7.1.2 Edge-Triggered D-flip-jop, (E7DFFJ

Sometimes this fipflap is called edgetriggered register. Fig. 4.63 shows a static veisian (bnffered) of the D flipflop with positive edge-triggered, and the voltage waveforms. It is constructed by using two latches. The first one called master, is positive level-sensitive. The second one called slave, is negative level-sensitive. When the clock is low, the storage node A follows the input, while the node B stores the old data and is disconnected. Then, when the clock makes a transition from 0 to 1, the node A stores the input value during the transition. then ceases to sample any input data. When elk = 1, the master is in the the hold mode and the node A psraes the data to storage node B of the slave latch which is then passed to the output Q and Q. In this case, the outpvt is disconnected from the input D. Hence, the Ripflop doer not have the transparency property of the latch. When the clock returns b a d to 0. the slave k in hold mode. By reversing the two latches, B negative edge-triggered flip-flop can be constructed. This circuit can be found in standard-eeU and gete-array libraries and represents an important cell in synchronized design. With high operating frequency. it is desirable to balance the delay of clk and

Low- Voltage Law-Power VLSI CMOS Circuit Design 195

TT T

196 CHAPTER 4

cik locally, to reduce the clock skew problem. The dock skew, in single-phsc strategy can lead to invalid data storage.

A dynamic version of the positive ETDFF is shown in Fig. 4.64 [19]. The operation of this drcuit is Unstrated by the voltage waveforms. The d o e


T T

D i n n

of the hold time of this Ripflop is close to zero [ZO]. This dynamic flipflop, compared to the static one, needs only 9 transistors and one clock Line. The negative ETDFF is shown in Fig. 4.65.

4.7.1.3 MiscrlIoneous

Many other latches and Ripflops are available; Car example in gatearray Li- braries such as the JK Ripflop and the toggle (T) flip-flop. Fig. 4.66 shows the T Rip-flop with reset control. When elk = 1, the output Q is complemented, whereas when d k = 0, Q keeps its old state.Thir T flip-flop provides divide-by-2 operation. A J K flipflop is shown in Fig. 4.67. When J and K inputs are low, the outputs are meintainod on the positive edge of the dock. If

198 CHAPTER 4

T

6

T

J = 0 and K = 1, the ontput Q is set to 0, whereas when J = 1 and K = 0, the output Q is set to 1. When both J and K are high then the ontput are complemented.

4.7.2 Single-Phase Clocking

Generic singlephase finite-state-machine (FSM) is shown in Fig. 4.68. The storage element can be either a latch 01 a register (Bpflop). For the latch case, it demands more constrained design because of the transparency property of the latch. When the latch is transparent, thc statesignals can pass the logic block more than once during one dock eyele. To avoid race condition in this FSM, the clock width (of transpateney) has to satisfy B two aided-constraint [22]. Hence, singlephme with latches, in the case of FSM, is insidiously complex.

To reduce the complexity of timing constraint, single-phase ETDFFs can be used. The ilipipaop k never transparent. At the clock edge, the state is stored and it cannot pass the logic more than once during one dock cyde. D&& and synchronizing VLSI circuits with ETDPFr is rather simple and straightforward pazticukrly when nsing static Bpilops.

For high-speed CMOS applications it is necessary that the storage elements should be carefully designed with minimum delay, setup time and dock skew. In thia case, trktate dynamic latches can be used efficiently. Fig. 4.69 shows ~n example of using dynamic latches [21]. Notice that L1 and L2 arc tr-parent latches separated by random logic and are not simultaneously active. When


200 CHAPTER 4

Elk

J

K

Q

Q .. ...... ~i -

Figure 4.81 JK &p-tlop.

Low-Voltage Low-Power VLSI CMOS Circuzt Design 201

Combinational

clk is high, L1 is transparent, whereas when elk is low, L2 is transparent. The minimum number of logic gates hetween latches can be B ~ F O and the madmum k constrained by the cycle time.

202 CHAPTER 4

Fig. 4.70 shows another example of singlephase system using ETDFFs. This system is edge based and the minimum cycle time is given by [22]

t.q.l.,min = ttf,m.r + bsk,m~* + t..tup,m.* + t.inu.mnr (4.97)

where t i t , t ~ ~ ~ , ~ , ~ ~ ~ , t,.tup,m.r and i,~.lo,m.r are worst case ddsys of the flipflop, combinational logic block, setup time and clock skew. When designing with gatc-array and/or standard cell approaches, the single-phase clocking scheme using static ETDFFs is the oaly option available for the designer.

4.7.3 Wo-Phase Clocking

Two-phase "on-ovedapping clocking strategy iernove~ many constraints existing in single-phase discipline. However, the use of two-phase (or multiple phase) non-overlapping clock atructmes becomes more difficult as clock fre quendes and chip size increase. This is because of the increase in dock skew and clock interconnect wking. For high-speed applications, singlephare strategy is preferred and tends to be widely used in many VLSI systems' designs.

Fig. 4.71 shows an example of tw-phase non-overlapping docking scheme. The first latch LI is transparent when the clock elk, is high, ahereas 12 is transparent when d k a is high. The example of Fig. 4.71 is not the d y way to build 8 two-phase system. Latches C ~ R be replaced by two-phase master-slave flip-flops where the master latch is clocked by elkl and the slave latch by elk2. This latter structure does not have transparency property.


4.8 PASS-TRANSISTOR LOGIC FAMILIES

Sweral pms-transistor logic families, for logic circuit design, have been pra- posed for improving the speed of CMOS circuits. Such families me: the conventional CMOS pers-transistor logic, the Complementary Pass-transistor Logic (CPL) 1231, the Dual Pass-transistoi Logic (DPL) [24], and the Swing Re-

stored Pas-transistor Logic (SRPL) [%]. In this section, CPL, DPL, and SRPL logics are presented and compared.

4.8.1 CPL

The main concept behind CPL ia shown in the block diagram of Fig. 4.72. It consists of NMOS pass tranrktor logic network driven by two sets of eomple mentary inputs and two CMOS inverterr used as buffers.

Fig. 4.13 illustrates an example of ANDINAND gate built in CPL logic. At the node Q for exhmple we have

Q = A.B t B.B = A.B (4.98)

At the output of the corresponding inverter we have NAND function. The NMOS pass-transistor loaie network forms pull-up and pull-down functions. When the inputs ( A B ) have the following combination (ll), the voltage of the node Q is a t a voltage given by

VQ = VDD - VTdVQ) (4.99)

204 CHAPTER 4

Figure 4.71 Basic CPL l& circuit.

where VT,. is the threshold voltage subject to the body effect. So the invertiog buffers translate the swing of the output fram ground to VDD - VT,, to a full- rail logic swing (ground to V D D ) . The logic threshold voltage of the inverting buffers should be shifted to lower voltage than VDD/Z. Hence the 0 ratio of the inverter in this case should be higher than unity. This inverting buffer permits also to drive large load capacitance efficiently. When the output of logic networks are st Von - VT, then all the output inverters are driven by reduced $Wing, BS shown in Fig. 4.74. Hence, the DC power of the inverter increases because the pull-up PMOS device is not completely OFF. The VG, of the puU-mp PMOS is eqnal to -VTm. Moreover, the drive capability of the pull-down NMOS transistor is reduced particularly if the power supply voltage is iedueed. The noise margins are also affected. To solve the problem of DC power &$pation we can design NMOS transistors with lower VT than that of the PMOS transistor. Also, the body effect should be controlled. Another way to solve all the problems associated with the reduced high-level is to add to the CPL II PMOS latch 8s shown in the case of the ANDINAND circuit of Fig. 4.75. In this case, the two added PMOS transistors can be sised to be


minimum. as long 8s the high-level reacher VDD in the given cycle time. We call this style PMOS latch CPL. Careful design should be considered when the NMOS network has minimum size devices. Otherwise the high-level stored in the latch cannot be discharged.

Fig. 4.16 shows examples of CPL arrays for ORINOR and XORjXNOR fune. lions. With only 4 transistom we cm pmdnce many awo-kput functions with their complement. More examples are shown in Fig. 4.17 for 3-input ANDINAND and ORJNOR gates. In these examples 8 NMOS transistors are needed to generate the 3-input functions. Any complex logic function can be constructed easily using this principle of NMOS nework t~an&%tors. For e x m - Ple the full-adder circuit call be constructed wing wired CPL as shown in Fig. 4.18. The circuit is constructed using basic CPL primitives discussed before.

206 CHAPTER 4

(a) (h)

Figure 4.78 CPL ORINOR and XOR/XNOR

Low- Voltage Lou-Power VLSI CMOS Circuit Design 207

~ A;ti - ~~~

B

ii

B

- ~ -

ABC ABC A+BIC A+B+C

(a) (b)

Figure 4.71 CPL %input: (4 ANDINAND; (b) ORINOR loaic m a y s

Ako the sizes of the transistors are shown in this fignre for fast operation. The tr-istors of the NMOS net>mrk, far from the output, have larger size than those closer to the mtput. This is because the NMOS devices, closer to the output, pass a reduced swing. The siving of the transistors depends on the chcuit type, layout and device's parameters, Compared to full-dder implemented in standard static CMOS style, the adder of Fig. 4.78 is much fsstei and dissipater less power due to the low internal swing. Also the schematic of this CPL adder is structured resulting in simplified layout.

One drawbad assodated with the CPL logic is the driving capability which is limited and the delay increases with long pass-transistor chains. So buffering is needed to restore the transmitted level and improve the driving eapability.

4.8.2 DPL

The DPL is a modified version of CPL suitable foor law-voltage applications. It deviates the problems of CPL associated with the reduced high level. Example far ANDINAND gate is illustrated in the schematic of Fig. 4.79. It consists of NMOS and PMOS pass transistors in contrast to CPL gate, where only NMOS devices are used. In the example of ANDiNAND gate, the NMOS tranrktor m e used to pass the ground while the PMOS transistors are used to pass the high level (VoD) . The output of the DPL is full rail-to-rail swing owing to the addition of PMOS. However. th i s addition results in increased

208 CHAPTER 4

Low- Voltage Low-Power VLSI CMOS Czrcuit Design 209

A.5 A.B

Figure 4.18 DPL AND/NAND patc.

input capacitance compared to CPL. This wiU not limit the performance of DPL as will be explained.

Fig. 4.80 shows a comparison between the switching characteristics of CPL, conventional pus-transktor CMOS and DPL XOR gates. In the truth tables, the colnmn labeled *Pass" shows which signals are passed and perform the XOR function. There are some features of DPL

. The DPL gate h a s a balanced input capacitance. This reduces the dependence of the delay on the input data, contrary to the CPL and conventional CMOS pass-transistor logic where the input capacitances for the signals A and B are not the same.

In DPL, far any input combination, there are always two eurient paths driving the output. This compensates for any reduction in speed due to the additional PMOS. Fox example, when the inputs A and B are low, A is passed by a PMOS while B is passed by sn NMOS.

rn

A DPL fall-adder implementation is shown in Fig. 4.81. When d the input A, Band C arelow, for exampie, there are two current paths to the output buffer. This implementation uses DPL primitives such as ANDJNAND, ORINOR, XOR/XNOR and MUX to generate the carry and rum signals.

210 CHAPTER 4

Ciicuii

Table

CPL

B

A B XOR Pars

-"DO - "T,

Figure 4.80 Cornpariaon oi CPL, conventional CMOS TC and DPL P L I I ~ k-ister iogin for XOR gata.

4.8.3 Modifred CPL

Another technique which uses CPLlike st~lle suitable for low-power/low-voltege~~ h the Swing Restored Pass-transistor Logic (SRPL) [25]. Figure 4.82 show6 the b& of SRPL logic gate. One part is the NMOS network with the CPL style discussed previonsly and the second part, is B CMOS latch. The crors- coupled CMOS inverters (latch) permit to restore the logic levels. So, any logic function in SRPL can be implemented using CPL network and a CMOS latch st the output. The aieing of such a logic is critical fot speed and power dissipation issuer. Fig. 4.83 show an example of ANDINAND gate using SRPL. Incre-8 the sise ofthe NMOS traniistorr in the network, Wnctmm~

Low- Voltage Low-Power VLSI CMOS CtrctLit Design 211

OWNOR

Figure 4.81 DPL Iull-addcLr.

212 CHAPTER 4

NMOS CPL

improves the speed as shown in the simulation C U Y ~ of Fig. 4.84. It har been found that the rim of the latch should be minimum, for a fast operation, using the 0.8 p n device parameters of Chapter 3. If the siae of the NMOS transistors in the network k small, the autpnt of the SRPL gate fails to switch to ground becam the equivalent impedance of the network is lower t han the one seen by the output to VDO. Thk problem becomes wome when many gates are cascaded. Fig. 4.85 illostrstes this problem in 2 ANDJNAND cwcaded gates. When the input goes from VDO to ground, the nodes A and B, initidly at VDD, cannot be completely discharged.


750 I

I 4 6 8 10 12 14 16 18 20

4.8.4 Pass-Transistor Logics Comparison

The speed and power dissipation of the different pars-logic styles. so far presented, depend on the circuit type and the application of the circuit (cascaded gates, driving a fixed load, etc.). For the care of 8 full-adder, used in a multiplier array, B comparison is given in Chapter 7. In general, SRPL has the lowest power dissipation but careful design is needed when smaU device iim are used. The DPL consumes more power than SRFL and PMOS latch CPL. because of the higher transistor count.. Both CPL and SRPL Circuits have the smallest area and the fastest speed. In summary, CPL-like styles are promising, for law-power and high-speed applications.

214 CHAPTER 4

- 0 %+- I T

Part of thc lalch

4.9 YO CIRCUITS

1/0 circuits connect the on-cbip lo& circuitry to the external world. They play an impmtant role in the limitation of speed and power dissipstion of the whole chip. In thu section many 110 circuits are discussed such BS input and output buffers, dock distribution, clock buffeimg and low-swing 110. The power dissipation issuer related to there circuits are &o studied. Layout techniques for 1/0 circuits are not cclverd in this chhapter.

4.9.1 Input Circuits

To distribute en inpot signal to the in t end circuitry of a chip, BO input buffer is needed. It has its gate connected to the input pad. Excessive electrostatic charge, on the input pad, can break down the oxide and destroy the trandrtorr of the input buffer. For an oxide thiekmss of 100 A, the bieakdoxn voltage is ii 7 V. The voltage build on the gate, from the electrostatic charge, can be ss high 300 V [%I. Fig. 4.86 shows an example of electroatstk dkcharge protection. If the voltage, a t the node N, goes above V m or below ground, than the coupling diodes D, and D2 limit the voltage excureion of the node N w i t h -VBz and VDD + VBz. The role of the resistance R, is to limit the


YDD

peak current that flows in the diodes. %ical d n e s of R are few a hundred of and m e realieed using the diffusion layers. The input protection Circuit has

a pararitic RC time constant which can limit high-speed operation. It ranger from a few tens of ps to a few hundreds of pa.

The input buffer, connected to this input pad, consists in general of a number of inverter stages to drive the internal circuitry. The input buffer. for clock distribution, needs rpecid care and design and is discussed in Section 4.9.4.

4.9.1.1 Sfafic Power Dissipaliorr

When the input signal has TTL (Transistor-Transistor Logic) levels. the conventional CMOS buffer is used to translate these levels to CMOS levels. The TTL interface has historically specified input voltage levels of 0.8 V for the low-level input maximum, and 2.0 V for the high-level input minimum. The recently passed 3.3 V “Low-Voltage TTL (LVTTL)” standard is shown in Table 4.11.

The individual input inverters are designed by setting their W / L ratio such that the rwitebiog point of the buffer is near 1.4 V (middle of VIL and Vrx). To have thk switching point of 1.4 V at 5 V power supply voltage, the ratio W,lW, of the input inverter of the buffer should be at 2.9 using 0.8 pm CMOS technology. At 3.3 V, this ratio should only be equal to 0.7. However, since the TTL voltage swing is limited to 1.2 V, the input buffer is always dissipating

216

Minimum high output

CHAPTER 4

Madmnm Maximum high inpnt low output low input

+ Figure 4.81 TTL inpuL buffrr.

DC power, BL shown in Fig. 4.87, particularly if the VT of the devices is low. If the first inverter does not fully translate the input TTL levels then the second Stage dissipates some DC power. The static power dissipated by a TTL i n p d buffer is

PTTL = VDDIDTTL (4.100)

IDTTL = IDTTLL t I D T T L ~ (4.101)

IDDTTL is the average dissipated current for the CBLSEJ when the input is at low and high levels. At VDO = 3.3 V, the input buffer dissipates more static power when the input is high than when it is low. Fig. 4.88 shows the characteristics of the static power dissipation of the input buffer. Note that w h a VDD is sealed down the DC current is reduced beeanre the Vos of the pull-up PMOS of the input buffer is zedwed. If the number of TTL input pads is large, then the DC power of the input buffers could bc an important and limiting factor. A static power-saving input buffer fox reducing IDTTL for 5 V power supply voltage har been proposed in [21].

where

Low- Voltage Low-Power VLSI CMOS Czrcuit Design 217

Figure 4.88 Simdslcd static ~ o w r dissipation of input bvffcr

4.9.1.2 Dynainic P u w r Dissipation

The dynamic power dissipation of the input pad is mainly internal power. The total dynamic power of all the input pads (of the $ m e type of example) is

PI = ANs E*< f (4.102)

where A is the switching activity, N , the number of the input 'pads and Eii is internal energy of the input pad in Watt/Hz.

When the input signal has ECL levels, then an ECL input buffer, with ECL- **CMOS converter a ~ e nsed. In eeneral the" are imolemented in BiCMOS " technology and con~ume a DC power. An ECL-CMOS converter can be designed in full CMOS ps].

218 CHAPTER 4

4.9.2 Schmitt Rigger

When the input signal to a chip is slowly e g , a hysteresis circuit is needed at the input pad to generate B dean edge. A circuit called Sehmitt trigger can be used for this fnnetion. They are often found at the on-chip inputs.

Fig. 4.89 illustrates the transfer characteristic of ideal Schmitt inverter with hysteresis voltage Vx = VT+ - VT-. For 3.3 V power supply with 3.6 V for fast process and 3.0 far slow process, typical d u e s are : VT+,.,,.. = 1.7 V and VT-+* = 1.0 V. The Schmitt circuit switches at different thrrrholds. When the input is rising, it switches when En = VT+ and when the inpnt is falling, it switches when K,, = VT.. Fig. 4.90 shows an example of how the Schmitt t*gw turns a signal with a very slow transition into a Sign& with a sharp transition.

A CMOS version ofthe Schmitt trigger is shown in Fig. 4.91. When the input is rising, initially the NMOS transistois are OFF. The Vcs afthe transistor Nz is given by

v,,, = v;" ~ v m (4.103)

'


Y

vT+

vr. .... ~~ .... ........................

. . . . . . . . . . .

vDD\ Time

6 Figure 4.81 The CMOS Schmilt triggrrrchrrnstic.

When V,. = VT+, N, enters in conduction mode which means VGS, = V,, then'

V F N = vr+ - VT" (4.104)

'WI neglrct the body offast of N,

220 CHAPTER 4

The voltage VFN is rontiolled by Nt and N,. These transistors opelate in saturation because

VCSl = VT+ (4.105)

(4.106) VDS, = VFN = VT+ - VT* and

vG'cs8 = V D D ~ VPN

VDSS = VDO - VPW The drain currents flowing in N, and Na are equal. Then using a simple MOS model we have

(4.109)

(4.107)

(4 108)

z L b &(VT+ 2 ~ VTm) = ,(vDD ~ VT+)'

We have

where

(4.110)

(4.111)

This equation shows that the trigger point is independent of the process prs- remeters except for VT,. By symmetry, the trigger point for falling transition, ULO be deduced from the pull-up section. We have

where

(4.112)

(4.113)

If & = and VT. = -V, = VT, then

(4.114) "OD VT VT+ = ~ + - 2 2

v7.=--- VOO VT (4.115) 2 2

VH = VT+ - VT- = vr (4.116)

In this case the hysteresis voltage can be made equal to VT. The short-circuit power dusipation of the Sehmitt trigger can be very important since the rke/fd timer of the input signal is very long.

Lorn-Voltage Low-Power VLSI CMOS Circuit Design 221

Fig. 4.92 shows SPICE simulation of the circuit of Fig. 4.91 in 0.8 p m technology. In thla example, the load capacitance is 0.1 pF and the total power dissipation is 0.85 mW. The dynamic power &sipation, dne to the load and parasitic capacitances, is 0.40 m W . Therefore, the power dre to theshort-circuit iS 0.45 mW, which represents - 53 %of the total power dissipation.

4.9.3 CMOS Buffer Sizing

When the gate is intended to drive B large load capacitance (larger than the hput capacitance of the gate), the driving CapabilitY is limited and the delay is large. If we increase the i i e of the gate (driver configuration), we improve the nse/fall times but still the delay can be improved by putting several stager of buffering between the first gate and the load. The objective in B buffer configuration io to gel the input signal to the load as quickly as possible. Each stage in the buffer chain should have its transistor widths larger than the previous

(ZZ1.P)

Low-Voltage Low-Power VLSI CMOS Circuzt Deszgn 223

Question : What are the d u e s of the size ratio a and the number of stages n to op&e the deky ?

By differentiating t a equation with respect to a and then setting it equal to aem, we have

= o 2.1 (4.124)

The optimum number of stages ir

n,, = I.(Cf,/C,") (4.126)

In this analysis, we have neglected the pararitic output capacitance of each stage. Other stndies [30, 31, 32, 331 illustrate that the siee ratio a depends on the ratio of the parasitic ontput capacitance and load cspacitanee. In [34] B new approach for CMOS tapered buffers, with large Ch/Cs, ratio, was proposed. It uses B variable sise ratio between the stages.

The power dissipation ofa CMOS buffer is mainly dominated by dynamic power dissipation for large VT. The short-circuit power dissipation can be neglected 85 first-order analysis [34]. If we indude the parasitic outpnt capacitance. So stage i, has a to td ontput capacitance

c, = O'C., + a.-'Cp (4.126)

we assume that the parasitic capacitance of stage i is proportional to the size ratio a. The dynamic power dissipation at the output of glrte i is

Pi = c,v;,r = V&f(a'C, +a'-'cp) (4.121)

or P, = v;,fa'-'(ac." + C,) (4.128)

The total power is

Rence a" - 1

P, = V&f(aC,, t C,)- a - 1 (4.130)

The power efficiency of the buffer can then be defined as

224 CEAPTER 4

where P~isthepowe~dissipated, duetotheloadCL, whichissimply C=V&f. PT is the total power dissipated given by Equation (4.130). This power effi- dency, for a given Cc, C,, and C,, is afunction of only the factor a. The term 1 - characteriaes the additional power dissipation overhead, needed by the buffer chain to drive the load CL. For high values of a, the power efficiency of the buffer increases. In practice a can be in the range of 2-ta-10. This d u e of a can beret depending on speed, dday and power dissiphtion constraints.

4.9.4

U m d y when the dock is to be distributed on-cbip, input buffers me needed. The clock erenit hss to drbe wry high internal load with extremely h t fd/Jl/rise times. For example, in the CSLS of DEC Alpha chip [21] the dock load is 3.2 nF. If this load has to be driven by a large driver, in ~ i s e / W times of 0.5 ns when the clock frequency is 200 M B z [T.iOrr = 5 4, then the average transient current would be

Clock Drivers and Clock Distribution

(4.132) cE = 3.2 x 10-0 x 3.3 = 21 A At 0 . 5 ~ lo-* r,. =

OVDD = 3.3V power mupply. The corresponding dynamic power dissipation due to this clock lobding is

P = CV&f = 3.2 x lo-' x 3 . P x 200 Y 10's 7 W (4.133)

This example shows how the docking is an important design issue. A clocking strategy should be used to distribute the clock to the different functional blocks of chip with minimum clock skew and low-power dissipation.

The clock skew problem is due mainly to two iuuea

rn The difference in RC intercomat time constants: For example in Fig. 4.94 node A and node B have two different branch lengths to node C. In this case, the delays of the signals at node A and node B Vir a vk node C ace different. Therefore, the dock skew is eqoal to the time difference between these two signals.

Unbalanced loads a t different nodes: As shown in the example of Fig. 4.95, if the loads at the nodes A and B, Ca and CB respectively, are different. Then the skew between the signals at these nodes exists.

m

Low- Voltage Low-Power VLSI CMOS Cmuzt Deszggn 225

Clock Driver -T FFZ Block

Figure 4.95 B .

Clock skew due to the vnbaknced bad. at block A and block

226 CHAPTER 4

Several stmtegiea have been proposed to minimiee dock skew. The first a p proach is to use cascaded inverters (buffer) to ddve B lmge load and feed dl blocks as shown in Fig. 4.96. The buffer chain is designed by the approach presented in Section 4.9.3. In another approach, the clack distribution is ac- eomplirhed by using a tree of clock buffers well sized as illustrated by Fig. 4.97. Identical buffers are used in each level and each buffer sees the same load capacitance. Equalking clock buffer loads is possible by : 1) equalizing the interconnect lengths between the buffers of different levels, and 2) the addition of dammy bufferr st the slightly loaded bvffer ontput. The last distribution level has buffers which drive the functional elements such as registers. This structure results in very reduced skew and the only skew that exists is the one produced by variations in process parameters. To further minimile the skew, identical layout for all the buffers, should be wed. As an uample of tree approach is the following case. To distribute the clock signal to 64 elements (for example r e e k s ) . 3 stages (levels) of buffering with 1-to-4 tree structure m e required. A wuiety of software paekager have been developed for clock tree synthesis [35. 361.

To ieduce the high dynamic power dissipation (few Watts) in dock distribution at a fixed power supply. many techniques c a n be used such as:

1. Using a low capacitance clock routing Line such as metal3. This layer of metal can be, for example, dedicated to clock distribution only.

2. Using low-swing drivers at the top level of the tree 01 in intermediate levels.


Figure 1.87 Clock tree distribution,

For the second approach, a half-swing clocking scheme has been proposed 1371. Fig. 4.98 shows the half-swing dock driver which generate half VDD clock signals (four phases) to the elements (eg , latches). Using the charge shaiing principle, the node of haEVDD can he expressed by

H - V D D = V D ~ when clk is low (4.134) c, + c, + c, + cs

H-VDD = -VDD whenclk ia hwh (4.135)

where C, and CB me added Capacltms to the power liner. C, through C4 are the load capacitances of the driver. When CA is equal to CB and both ase large enough, compared to C,-C,, then H-VDD node is stabilized at V D D ~ ~ .

Fig. 4.99 shows the clocking schemes of the latches driven by the clock driver. Compared to the conventional scheme which uses two clock phases, the half- Swing scheme requires four clock phases. Two phases are for PMOSs and two are for NMOSr BI shown in Fig. 4.99(b). This scheme reduces the power by 75%. However, the delay of the latch is increased by the new docking scheme, which can be acceptable [37].

ca + c3 + c, + G B

4.9.5 Output Circuits

TO drive the output pad. a high drive capability driver is needed to achieve adeqnate rise and fall times. In this cme, inverter chain is used to handle the

228 CHAPTER 4


large load of the pad, package wiring, and off-chip load. This capacitance can be few tens of pF. A typical value of this capacitance is 50 pF. There arc many types of output pads swh BS tristate, bidirectional, I O W - V D ~ (3.3 V) to higb-VDo ( 5 V) output buffer and low-swing output.

4.9.5.1 Trisiafe and Bidirectional Circuits

Fig. 4.100 shows a tdstate circuit to drive large pad capacitance. When the output enable signnl is high, the output data is the same BS the input data. When the output enable signal is low, then the output of the pad is in high impedance state (Z). Bolh the otttput NMOS and PMOS transistors are cutoff. Fig. 4.101 shows the bidirectional I /O circuit which is quite useful when we need to save the nomber of 1/0 pads. Sometimes an input buffer is included in the bidirectional pad. The operation ofthis circuit is obvious.

4.9.5.2

The total power dissipation a t the output pads can be divided into the static power dissipation asd the dynamic power dissipation. The statk power dissipation is due mainly to the leakage curents (junction and subthreshold) if the ontput pads are driving CMOS logic. If the VT of the devices is large enough, then the static power dissipation of the output pads is neglected. However if VT is small, then the DC power, due to the subthreshold current, for the output pads is

P. = N . I D s , ~ . . ~ VDD (4.136)

where No is the number of output pads and ID5,mron is the average subthreshold current for both cases when the input is 1-w and high. For low VT the

Power Di,wiparion o/ Output Circuir

230 CHAPTER 4

1 Data-in

Figure 4.101 Biduraciiod pad.

IDS,-..* value would be important, beesnse the devices in the autpnt bnffer have large ske partiedrub the output transiston. ID,,.., should be corn- puted in worse case where the VT has its minimum value. Thus for future technologies where the threshold voltage is low and the nomber of output pads is large, thm static power dissipation would be very important and can be a limiting factor for low-power applications. Hence low-power eircuit techniques are needed for output buffers.

If the CMOS output buffer is intended to drive bipolar TTL inputs (not CMOS TTLinputs), thenMportanteurrentissn~. Fig. 4.102shows thefinalstageof the buffer driviog a TTL logic. Since, bipolar TTL inputs can sonrce significant amounts ofcnrrent, B CMOS ootpnt buffer must sink this current. For 3.3 V power supply, this current can be in the range of 1 mA to 12 m.4 depending on the strength of the ootput driver. The static power dissipated by the one output pad driving bipolar TTL inputs is

= VOLIOL (4.131)


output driver :

Figure 4.10'2 TTL output buRIr.

where lo& is the cmrent sunk by the output buffer and is equal to the I- of the cnxrent from d the bipolar TTL inputs. VOL = 0.4 V for 10- TTL output. This disspated power is due to the ontpnt NMOS pull-down transistor and can be an important issue s far BJ the chip heat is concerned. Note that the corresponding energy is not drawn from the internal power supply.

Another romponent of the total power dissipated at the output pads is the dynamic power. It is given by

Pen = A(N,E<. + N.C.V&)f (4.138)

where E;, is the internal switching energy of the output pad, and G, is the werage output load capacitance (including the pad load). As an example. 64 output pads switching vith an activity of 10% at 200 MHe dissipate 0.8 W (WDD = 3.3 V, E;. = 70 ) rW/MHZ and C, = 50 pP). This d u e is very important to take into account.

The total power dissipation of the bidirectional pads can be evaluated using the approaches developed far the input and outpot circuits.

4.9.5.3 3.3-10-5 v olllpul hzterface

When a 3.3 V chip is connected to a 5 V chip, zero DC power dissipation interfaces are needed. If the conventional CMOS is used to interface the 3.3 v 109;. to 5 V logic, the DC power would be large. Fig. 4.103 illurtrates this

232 CAAPTER 4

problem. For example, if the 3.3 V inverter driver high into the 5 V inverter, the Vos of the PMOS transistor P, is equal to 1.7 V. This value is larger than VT of the device and thus results in large DC power dissipation in the range of milliwattr. Since this power is for every 110, then for a whole ASIC chip it could be hundreds of mW. This situation is unacceptable for low-power application..

The circuit of Fig. 4.104 defines a solotion to the problem of DC pow% d ic sipation (381. The circnit has two power supplies, denoted VDDL and VDDB corresponding to Iow-VDo (erhmple 3.3 V) and high-VoD (example 5 V), r+ spectidy. For low input data, node A is at VDDL and node B is at aero. The NMOS transistor N is conducting and the output is at Vss. Since the output is %em, the feedback PMOS transistor. PI, is also conducting. The p a r NMOS transistor N,, is cutoff, thus the node C is palled up to VDDX. Then the PMOS transistor P is completely OFF. Hence no leakage is in this state except the junction leakage currents and the Subthreshold currents. For high input d a b , node A is a t sem and node B is at VDDL. In this cffie the NMOS transistor N is OFF and the pffis transistor Ne is condncting. Initially the feedback PMOS transistor Pj is ON and since Np i s conducting, then proper sising of PI and Nn (higher conductance of Np) dl permit node C to be discharged though N p . This canses P to eondnct, which in t u n charges the ontput to VDDH. Then the feedback device Pj is completely OFF. Thus this interface results in very limited leakage current and solver the problem of interface.

As mentioned, the transistors PI and Np should be sined properly so that the circuit does not hteh the prcvious data. Pj should be mvch smaller than

Low-Voltage Low-Power VLSI CMOS Circuit Deszgn 233

Xp. We we simple analyri. to find the relationship between the sizes of the two transistors. For high input data, initidly the node Cis at VDDX. Thns the NMOS Ng is in satmatian and the PMOS Pf is in the linear region. By 'ustoning that the drain current of N? is much higher than that of P f , we have

(4.140)

where & and opt are the 8s of the NMOS transistor Np and the PMOS transistor P f , respectively. The low-to-high voltage converter has jl negligible DC current when the input is stable since all the devices are completely OFF. Thin technique can be used to interface any lowvoltage to higher voltage.

4.9.6 Ground Bounce

When a high drive carrent CMOS driver switches, it generates high carrent spikw. This current can generate noise, as shown in Fig. 4.105. The current tlows through the impedance between the pad and supply node and produces a voltage noise. This noise is often called L$ or ground bounce. The I is due to the padrage inductance. The ground hounce is given by

di dt

V' = L- (4.141)

234 CHAPTER 4

, . C""*", j :

4 : I

p F y > T i n x j ' I

. . . . .

Time

V dl L = L- dt

Vi"

This noise problem can occur on power lead and is termed power bounce. We will use only one name to refer to this problem. Consider a CMOS output driver driving the output pad of 50 p F at 3.3 V in 2 ns rke/fall timer. It can be shown [39] that 2 is related to the fall/rise times by

(4.142)

The dijdl can be as high as 165 mA/m. If for example 8 drivers are dowed to switch rimnltaneoudy per eaeh VoojVss pads pair, the resulting ground bounce for 1 = 1 n B is 1320 mV. This value can be B problem, partieduly for low-voltage applications, since this ground bounce consumes a large fraction of the digital noise margins. Some of the problems encountered arc 1) fake triggering. 2) double cloddng, andjoz 3) missing clocked pulses.

Low- Voltage Low-Power VLSI CMOS Czrcurt Deszgn 235

110 buffers are not the only sonree of ground bounce in CMOS circuits. Clock bnffers llod slightly the cox logic can also cause serious ground bounce in the supply leads when driving large loads. Careful power supply routing should be taken when we power large buffes. The resistance of the metal should be minimieed so the voltage drop, due to the corrent spike, is reduced.

There are many techniques to reduce the ground bounce. One simple approach is to use separate supply pins for the ootput buffers. Some approaches, based on reducing L and d i l d l , are the following:

Multiple supply pads and pins iz O ~ E way to ieduce the indnctanee of the supply. A recent chip nses 121 power/gronnd pins oat of a total of 293 pins [40].

Placement of power and ground pins, adjacent one to the other reduces the effective inductance of power sod groond pins by mutual inductance. This approach cmses an inerutse in chip s i x and cost.

Circuit techniques to reduce the d i jd t of the output and dock bufferr, while maintaining sdeqwte performance. The simplest way is to control the rise/fsD times while maintaining the timing requirement. How- ever, this approach has a serious problem, since worst-ease-slow process dictates the buffer rising (worse~ase dclsy), while best-casefast process dictates the ground bounce l e d Benee the buffer design is constrained by the two extremes of process variations. Once the buffer is siaed to satisfy the worse~ase delay, the worsecase gronnd bounce may exceed the fired level. This problem can be solved by controlling the signal slope at the inpnt of the output transistors of the buffer [41].

For clock buffers, and in high-performance design, on-chip by-pass a- pacitmce are added between t,he power bur and the substrate as shown in Fig. 4.106. This capacitance lowers the impedance of the power s u p ply. On-chip bypass capacitance doer not reduce the noire produced by output buffers.

Another approach is to reduce the output d t q e swing of the large boffer.

.

rn

m

In eondudon, to reduce the ground bounce, all the techniques can be combined to reduce Land d i l d t The reader can refer to many other techniques to reduce the ground bounce [42, 43, 44, 451.

236

T'DDC

CHAPTER 4

- I f VDDBus

4.9.7 Low-Swing Output Circuit

With the advent of high-performance VLSI chips, which operate beyond 100 MHe and have over 100 I/Os on the same chip, high data rate CMOS 110 interfaces with low-swing signals are needed such BP ECL (Emitter Coupled Logic) 146, 47,481, BTL [4Q], GTL (501, and CMTL (Current Mode Transceiver Logic) (511. Conventional unterminated htecconneets (between VLSl chips) for CMOS-level sign& w u d y have poor signal quality with severe overshoot and rkghg. accompanied by EMJ (deetromag~tetie interference) and the possibility to trigger the lath-up.

Fig. 4.101 shows two chips connected to the bidirectional transmission line (50 R termination resistors) though GTL I/O (Gunning 110 ) transceivers. Bath ends of the transmission line are tezminated to prevent reflections. The load seen by each driver is 25 R. The termination voltage VTM is about 1.2 V. The output driver is an open-drain NMOS pull-down transistor and when it is inactive the output is at high-level signal Vox equal to 1 $ ~ . The input receiver uses a M e r e n t i d comparator with external reference voltage = 0.8 V.


Figure 4.101 CTL 110 with two chipa connected to transmirsionhe

Fig. 4.108 shows an output duver in open-drain confignration which indudes circuitry to reduce overshoot and the turn-off dildt . When K, is low, P, turns ON which itself turns Na and N, ON. In this C B J ~ , , the maximum output voltage is VOL,,,, = 0.4 V. The powei dissipated by the pull-down NMOS ir madmum and mainly static. The static current is equal to (VTM - V o r ) / R = 03/25 = 32 mA8. Hence, the marimurn static power dissipated on-chip is P = 32 n A x 0.4 = 12.8 mW for each I/O. % i d value of Vor. is 0.24 V, thns the nomind power dissipated by each active driver is 9.2 mW. When the input goes Lorn low to high, N, turns ON and Na is still ON because the signal through the two inverters I , and 1, is delayed by about 1 na. The transistor NI is weak, hence the output discharge ir controlled by N, and Ns. There transistors let the drain of N, connected to its gate as long BS V ~ s r i s higher than VT. When Ns turns OFF, then NI discharge. the gate of Nq to the ground. Thus, the turn-off of N4 is controlled. In this mse, there is no DC Power dissipated.

Fig. 4.109 shows the input buffer which employs B differential comparator. This circuit switches to high (low) V,, when IL ~ V,., > 50 mV (< -50 mv), respectively over process, power supply and junction temperature variations.

'"ole Lhat this ourrent ;s supplird by Vp, and DOL V,,

238 CRAPTER 4

Vi" (GTJ. levels)

YOU,

The average power dissipated by this input receiver is 5.5 m W at 5 V power supply.


4.10 LOW-POWER CIRCUIT TECHNIQUES

Remember that the total power dissipated by a circuit has three components. Two of them which are very important are : 1) the static power (P.), and 2) the dynamic powei ( P d ) . This section treats some of the circuit techniques for achieving law-power while maintaining performance. Techniques to reduce the power at rubrystem/rystem and architecture lev& will be discussed in Chapters 6, 7 and 8.

4.10.1 Law Static Power Techniques

One important source of static power dissipation is the use of low threshold voltage. With device sealing, the power supply voltage is sealed. If the threshold voltage is not sealed, and is equal or greater than one half VDD, the gate delay increases drastically [52]. The threshold VOhge should be less than 20% of VDD, in order to maintain puformance at law supply voltage. At 1 V power ropply, the thrwhold voltages can be as low as 0.1 V. However, rcdncing VT C ~ S ~ S serioos standby snbthreshold enrrent increase, dne to the exponential relation between the current and VT. With low VT the process fluctuation can increase this current more. For VLSI integration and future ULSI, the total standby current can be high and not acceptable for low-power spplications.

To reduce this subthreshold current, associated with low VT devices. there are many techniques. These techuiqms are based on the principle to reverse bias the VGS voltage of the MOS device (in the case of NMOS) in the standby made ofoperation, as ahown in Fig. 4.110. With Vcs =Vex , where Von is mgativc, the standby state of the device moves from state to state p . We d te two tcchniqoes using this principle:

4.10.1.1 Self-Reverse Biasing

This technique has been used mainly to reduce the static power dissipation in standby mode of the memory decoded-driver [53]. The drivers, in memory, have a lbrge number of circuits, arranged repeatedly, but only a few of them operate aimultaneoudy. The drcuit of Fig. 4.111 can drastically reduce the subthreshold current of the drkers. The technique simply consists of inserting a PMOS tmnsbtor P- with a size W. between the power supply VDO and the common source node A. AU the PMOS transistors (Pd,,Pd2, ..., Pdn) of the

' I C o l y L ~ t -nl tbcahold voltage.

240 CHAPTER 4

drivers have, in thk example, the s m c sivc Wd and common SOUICC (node A). The number of drkers R can be between a few hundreds to a few thousands. The MOS transistors in the ddvers have low iVTdl (e.g., 0.1 V). The PMOS transistor PG have a threshold d t tage IVT,I slightly higher than I V d (%., 0.2 ~ 0.4 V).

In active mode, the input S is low and the transistor Po is ON. For the drivers only one circuit is ON. In order that the PMOS transistor Pe does not affect the drive current of the driverg, its size W, should be larger than Wd, depending on the capacitance of the common murce, which is huge for high R. In standby mode, the input S is high and the PMOS transistor P, is OFF. The inputs of all drivers are set to high (VDD). Without the PMOS tiansirtor P., the total subthreshold emrent would be n timer the c u r d of each driver. This malres thk current very high. Hence Pc %educes and limits the sobtbrahold cnrrent. The voltage of the common source node A, is reduced by an amount AVsna (afew hundreds ofrmV). This CBUSOS the PMOS transistors ofell drivers to hsve self-reversebiasing gate-source voltage, which drastically reduces the subthreshold current. The time needed for the node to stabiliue to VDD - AVsns (or the time needed to switch from the active to stsndby mode) is called evolution time and can be very high (order of 1 mr) compared 10 the delay of the driver. The reason is that only the leakage and subthreshold cyzlents which


Slvndby mode s Active mode

Figure 4.111 Subthicrholdcurrmt reduction by self-revcrre hissing.

&charge the node A in this mode. This time can be undgnificant to low-power operation if the standby mode time is large enough s i n the case of many low- power applications. When the input S is turned low (active mode), the time needed for the coinmm source A to recover (reaches almost VDD) is too low and can be lower than the delay time. Hence. it doer not interrupt the start of normal operation.

Lets derive now the subthreshold current expressions before and after reduction by SXB technique. The total subthreshold current withont the self-reversebiasing techaique is given by

(4.143) wa -1vm I..*, = n.I-exp ~ wo Sjln10

w. -lvTcI I d 2 = la- exp ~ w, S/I.lO

With the lranristor P,, the subthreshold current is given by

(4.144)

242 CHAPTER 4

We assume that the devices have the s m e lo, Wo and S. By dividiog the current equations (4.143) and (4.144). ws have, for the subthreshold current, a reduction factor -,

Forexampleforn = 512, W. = lowd, (with this ratio thespeed irnot affected), VT, = 0.3 V, V T ~ = 0.1 V and S = 90 mVjdecode, the factory = 8.5 x 10'. So, the saving, in subthreshold current, is sufficient. The parameter AVsni, can be easily deduced. Note that this technique needs multi-VT technology.

4.10.1.2 Mulri-VT Technique

This techniqne is similar to the one discussed above, but it u ~ n be applied to any CMOS logit (54, 561. The basic idea is shown in the crsmple of the NAND gate of Fig. 4.112. Here the MOS transistors P and N have high VT (e.g., 0.6 V extrapolated) for 1 V power supply applications. Also the logic gate has MOSFETs with low VT (5 0.3 V). The signal SL is used to switch the gate in active or sleep (standby) mode. The virtual upp ply lines VDDV and Vssv are common for many gates. We call thb logic multi-threahold CMOS logic (MT-CMOS).

In the active mode, the signal SL is low, P and N are ON, so the vktoal supply lines VDDV and Vssv can be set to almost VDO and ground, respectively. Hence, the 10w-V~ logic o m switch effidently, bot cart shonld be taken in the siziing ofthe P I N devices compared to the logic. Fig. 4.113 shows the effect of aieing the high-& devices on the delay of the gate. The width of P I N rhodd be at least 10 timer larger than that of logic cells. This condition depends greatly on the pararitic capacitances of the Virtusl sopply lints CI 6nd C, [see Fig. 4.1121. If C, and C, are large then the width of P and N transistors can be reduced, because these capacitances tend to suppress the bouncing of VDDV and Vssv and henee improve the rpeed. The high-& MOSFET. can be cornmon for several logic g a t e s (q, 10).

In the standby (sleep) mode, the signal SL is high, then P and N are OFF. Hence, the subthreshold current is limited by that of these high-VT devices. In this ease, the static power dissipation is dramatically reduced in the sleep mode. The subthreshold reduction factor can be deduced using the analysis presented in the previous section. One problem associated with this MT logic is that the evolution and recovery times can be large.


'

H - VT Tr Gak Wid* lnormalizedcd)

Figure 4.113 Effect CMOS,

high.V, MOS width on thc p=dommce ol MT-

244 CHAPTER 4

The measured delay, as a function of the supply voltagc tor Zinput NAND gate with FO= 3 and wiring load of 1 mm (0.25 p F ) , is shown in Fig. 4.114. The technology is 0.5.pm CMOS with low VT- = 0.25 V, low V T ~ = -0.35 V, high VTn = 0.55 V and high VTp = -0.65 V. The MT-CMOS logic has almost the s-e speed ag the full 10w-V~ logic. The logic delay time is reduced by 70% at 1 V as campared with that af the high-v, one.

For holding the level of the output during the deep mode, a level holder is necessary 85 shown in Fig. 4.115. It consists o d y of cross-coupled inverters with high-VT devices powered from the power snpply VDD.

T h e source of the static power dissipation is not mly low VT devieer. Several other issuer eontribnte to static power increase. These are some Circuit design guidelines to ieduce the static power Mipation :

rn Avoid the use of pseudo-NMOS circuits in yaw design.


Figure 4.116 CMOS gatr with Icvrl holder.

m Avoid the w e of TTL-compatible I/O or devise low-DC current level converters.

Do not use low VT devices in the 1/0 buffers, otherwke the DC power increaser remarkably because the MOS transistors of the I/O buffers have large sines. If you do not have any option, then use the rubthresh- old reduction techniques.

4.10.2 Low Dynamic Power Techniques

ASIC. and VLSI processor elode are improving rapidly, reaehing the snb-GKa range [ZI, 561. The power dissipation of CMOS di@d circuits, operating at thew high-fxequeneies, increaser drastically and it can be the main performance limiting factor. Therefore, low-power circuit techniques are needed to reduce the dynamic power of digitd citcuitr. Moreover, low-power chip consumption is extremely important in order to extend the battery life of portable systems 1571.

In general the dynamic power dissipation of B gate (i) is given by:

Pas = rriC,v.VDDf (4.146)

where (I, is the gate activity, V, is the voltage swing, C, is the load and parasitic capacitances and f is the operating frequency of the system. Equation (4.146) demonstrates that there m e several ways to reduce P,:

246 CHAPTER 4

1. Reduce the power supply voltage. Seating VDD from 3.3 V to 1 V results in B power reduction factor of 11. However, tbia approach leads to speed degradation for a givcn technology. But if device sealing is applied, in a next generation technology, the delay will improve and henee the operating frequency. In a complex digital system local supply reductions een be used for non-&tical dreuits.

2. Redwe, temporarily, the clock frequency of unused blocks on a VLSl chip using an on-chip power management unit or reduce the gate BC-

tivity. These can be done a t the architectural level.

3. Reduce the output capacitance Ci. As a first order approximation thi. capacitance is composed of the intercomect capadtanee G.,, and the total input capacitances of the driven gates C;sv The latter caa be redwed Using low inpat tapa6tanee logic family [SO] such a CPL-like. Also u5ing minimum size logic gates in non critical parts of the dclign can reduce the dynamic power significantly.

When Ci,, dominates, &s in busses and high-capacitance intereonncctionr (in- terbloek wirer), then dreuit techniques, bwed on low-swing signal, while maintaining the power sopply voltage. can lead to power dissipation reduction 158, 591. With increasing chip dimensions and integration density, the capacitances of wirer will dominate. It is expected that the power &ripation associated with the busses and the interwnneetions in future ULSl chips waill reach half of the total power dissipation [58].

These arc some guidelines for the design of low-dynamic power eircnits :

rn Cho0.e the technology that has low junction and oxide capacitances for the same performance.

Avoid, if possible, the use of dynamic logic design style.

For any logic design, reduce the switching activity, by logic reordering and balanced delays through gate tree to avoid glitching problem.

Use low-input capacitance logic family

In non-critical paths, use minimum size devices whenever it is possible without degrading the overall performance requirements.

If pars-transistor logic style is used, uuefd design shodd be considered.

rn

. . rn


4.11 ADIABATIC COMPUTING

As discussed in Section 4.3.2, the energy provided by the snpply to charge a load CL of a driver during charging and discharging is

E = C,,Va (4.147)

where V is the power supply voltage ar shown in Fig. 4.116(a). Half of the energy is dissipated by the resistor of the pull-up PMOS device during the charging phare. A similsr argument applies Lo the discharge resistor of the pd-down NMOS transistor. This analysis is valid men if a step power supply voltage, V, is applied to the network. From Fig. 4.116(b), the Voltage drop across the resistor, Rp varies from V (supply voltage) to eero. Hence. the energy disripsted by Rp is given by

En = / e V . d Q = / e V n C d ( V - V x ) (4.148)

(4.149) 1 2

then En = 41.v’

En = C L V V . (4,150)

where 6 is the average voltage drop nerosr the resistor of the pull-up PMOS.

If the power supply voltage bar two half steps, ar shown in Fig. 4.116(c), the energy dksipated by the resistor is

(4.151) ER = -C,Va

So less energy is dissipated by the resistor, when the average voltage is reduced, while keeping the swing and load eapaeilnnce constant. This is the principle of Adiabatic Switching [61, 62, 631.

For multi-steps power supply voltage, BC shown in Fig. 4.116(d), the total energy dissipated is given by 1611

1 4

Va Ecmuant,msj

N N E = CL- =

and the one dissipated by the resistor is

(4.152)

(4.153) 1 vz En = 4 - 2 N

248 CHAPTER 4

Low-Voltage Lour-Power VLSI CMOS Czrcud Design 249

where N is the number of voltage steps uniformy distributed. Fig. 4.117 shows an example of a driver with uaiformy distributed supplies which are switched in surcesi~ely. The voltage V, is given by

To charge the load, Vt through VN are connected to the load in succession (by dosing switch 1, opening switch 1, dosing witch a, etc.). To discharge the load, Kx-1 through K are switched in the same way, and the switch 0 is dosed, connecting the output to gannd. Note that the supply voltage, with mnlti-steps, needs B longer time period than the conventional case to charge mp the load capacitance. This techniqne has been used for large loads.

Another variation is to use a supply voltage with a ramp form" [62]. In this case, the energy is drastically reduced if a long time period is used. For the inverter for example, pulsed power supplie~ (PPS) are applied to the circuit.

The adiabatic comput;oP becomes attractive only when the delay is not critical, becam in that technique the energy is traded for delay. The energy-delay product of the sdie.bbstic circuit is much worse than the conventional CMOS gates [64].

4.12 CHAPTER SUMMARY

This chapter has provided an introdnction to low-power CMOS desisn. The power dissipation components of a CMOS gate hsve been discussed. Tech- niques to reduce the different components, a t physical and circuit levels, were presented. Novel CMOS design styles such iu CPL, DPL, and SRPL were examined. Several issues in CMOS circuit design, such as clock distribution, ground booncing, etc., were reviewed. This chapter represents a base, for Chapters 6 , 7, and 8 , where subsystems and low-power architectures are discussed.

REFERENCES

[I] N. H. E. Weste and K. Eshraghian, "Principles of CMOS VLSI Design : A Systems Perspective,'. second edition, Addison-Wesley, Reading, MA, 1993.

[2] J. P. Uyemura, "Circuit Design for CMOS VLSI," Kluwer Academic Pub- lishers, Norwell, MA, 1992.

131 M. I. Elmasry, "Digital MOS Integrated Circuits 11", IEEE Press Book, 1993.

[4] R. M. Swansan and J. D. Meindl, "Ion-Implanted Complementary MOS 'hamistors in Law-Voltage Circuits", IEEE 3. Solid-State Circuits, "01. 7, no. 2. pp. 146-153. April 1972.

[S] H. J. M. Veendrick, "Short-Circuit Dissipation of Static CMOS Circuitry and Its Impact on the Design a l Buffer Circuits," IEEE 3 . Solid-State Circuits, "01. 19, no. 4, pp. 468.413, August 1984.

[6] S. M. Kang, "Accurate Simulation of Power Disripation in VLSI Circuits," IEEE J. Solid-State Circuits, vol. 21, no. 5, pp. 889-891, October 1986.

[TI G. J. Fisher, "An Enhanced Power Meter for SPICE2 Circuit Simulation," IEEE Trans. Computer-Aided Design, vol. 7, pp. 641-643, May 1988.

[8] G. Y. Yaeoub and W. H. Ku, "An Enhanced Technique lor Simulating Short-circuit Power Dissipation," IEEE J. Solid-Slate Circuits. YOI. 24, no. 3, pp. 844-847, June 1989.

[9] N. Meijs, and J. T. Fokkema, "VLSI Circuit Reconstruction From Mhsk Topology,'. Integration, "01. 2, no. 2, pp. 85-119, 1984.

[I01 D. V. Heinbruch, "CMOS3 Cell Library," Addison-Wesley, Reading, MA,

[I11 R. J. Landers, and S. Mahant-Shetti, "Multiplexer-Based Architecture for High-Density. Low-Power Gate Arrays," in Symposium on VLSI Circuits, Tech. Dig., Honolulu, pp. 33-34, June 1994.

1988.


[ l Z ] M. 1. Elmasty, "Digital MOS Integrated Circuits I", IEEE Press Book,

[I31 R. H. Krambeck, C. M. Lee and H-F S. Law, *High S p e d Compact Ck- cuitr with CMOS", IEEE J. Solid-State Circuits, vol. 17, no. 3, pp. 614-619, June 1982.

[I41 V. Friedman and S. Lio, "Dynamic Logic CMOS Circuits". IEEE J. Solid-

1981.

Stale Circuits. vol. 19, no. 2. pp. 263-266, April 1984.

1151 N. F. Conclaves and H. J. DeMan, "NORA: LI Race Free Dynamic CMOS Technique for Pipelined Logic Structures" IEEE J. Solid-state Circuits, vol. 18, no. 3. pp. 261-266, June 1983.

1161 C. M. Lee and E. W. Seeto, "Zipper CMOS," IEEE Circuits and Dcviccr Mag.. vol. 2, no. 3, pp. 10-17, May 1986.

[lT] N. Weste and K. Erhraghian, "Piinciplcr of CMOS VLSI Design : A Syr- temr Perspective." Addison-Wesley. Reading, MA, 1985.

[IS] F. Lu and H. Samueli "A 200-MH1 CMOS Pipelined Multiplier- Aeeumiilator Using a Quasi-Domino Dynamic Full-Addcr Call Design," IEEE J. Solid-Stale Circuits. VOI. 28, no. 2. pp. 123-132. February 1993.

[19] J. Yuan and C. Svenron, "High-speed CMOS Circnit Technique," IEEE

1201 M. Afghahi and C. Svensson, "A Unified SinglcPhare Clocking Scheme far VLSI Systems," IEEE J. Solid-state Circuits, uol. 25. DO. 1. pp. 225-233. February 1990.

J. Solid-state Circuits, vol. 24. no. 1. pp. 62-71, February 1989.

I211 D. W. Dobberpuhl e l al., '"A 200-MHz 64-b Dual-Issue CMOS Micro- proccs~or", IEEE J. Solid-State Circuits. vol. 27, no. 11. pp. 1555-1567, November 1992.

1221 H. 8. Bskoglu, "Circuits. Interconnects. and PacLaging lor VLSI," Addison Wesley, Reading. MA, 1990.

[23] K. Yam, el al., "A 3.8-ns CMOS 16x16 Multiplier u%htg Complementary PaJr-'Ihn8islar Logic", IEEE J. Solid-Stntc Circuits, "01. SC-25. no. 2. pp. 388-394, April 1990.

[24] M. Suaiki. e l .I., "A 1.5-ns 32-b CMOS ALU in Double Pars-Thnsistor Logic", IEEE J . Solid-Slite Circuits, vol. SC-28. no. 11, pp. 1145-1151, November 1993.

REFERENCES 253

[25] A. Psrameswai, 8. Eara, and T. Sakurai, "A High-speed, Low-Power, Swing Restored P a s s - T r k t o r Logic Based Multiply and Accnmulate Circuit for Multimedia Applications," IEEE Custom Integrated Circnits Conference, Tech. Dig., S a n Diego, CA, pp. 278-281, May 1994.

[26] L. A. Glasser and D. W. Dobberpuhl, "The Design and Analysis ofVLS1 Circuits", Addison-Wesley, Reading, MA, 1985.

[27] T. Kobayashi et al., "A Current-Controlled Latch Sense Amplifier and B

Static Power-Saving Inpnt Buffer for Low-Power Architecture", IEEE J. Solid-state Circuits, vol. SC-28, no. 4, pp. 523-527, April 1993.

[28] M. S. J . Steyaert, et al, 'ECL-CMOS and CMOS-ECL Interface in 1.2- pm CMOS for 150-MAz Digital ECL Data Transmission Systems", IEEE J . Solid-State CLcuits, uol. SC-26, no. 1, pp. 18-24, January 1991.

[29] C. Mead and L. Conway, "Introduction to VLSI Systems", Addison- Wesley, Reading, MA, 1960.

[30] N. C. Li, G. L. Haviland and A. A. Tureynrki, "CMOS Tapered Boffer", IEEE J. Solid-state Circuits, vol. SC-25, no. 4, pp. 1005-1008, August 1990.

[31] M. Nemes, "Driving Large Capacitances in MOS LSI Systems", IEEE J . Solid-state Circuits, vol. SC-19, no. 1, pp. 159-161, February 1984.

[32] N. Bedenstiema and K. 0. Jcppson, "CMOS Chcuit Speed and Buffer Opthiastian", IEEE Tram Computer-Aided Design, "01. CAD-6, no. 2, pp. 276-281, M a d 1987.

[33] A. J. Al-JShalili, Y. Zhn and D. Al-KhaIili, "A Module Generator far Opti- d e e d CMOS Bnffer", IEEE Trans. Computer-Aided Design, "01. CAD-9, no. 10, pp. 1028-1046, October 1990.

[34] S. R. Vemuru and A. R. Thorbjornren, "Variable-Taper CMOS Buffer", IEEE J . Solid-state Circuits, "01. SC-26, no. 9, pp.1265-1269, September 1991.

[35] J. Burlds, "Clock Tree Synthesis for High Performance ASIC?', in IEEE ASIC Intun. Conf. and Exhibit, Rochester, NY, pp. PS-8.1-PS-8.3, September 1991.

[36] P. D. Taand K. Do, "A Low-Power Clock Distribution Scheme for Complex IC System", in IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY, pp. PI-5.1-P1-5.4, September 1991.


[37] Li. Kojims, S. Tsnaka, and K. Sasski, ” Half-Swing ClocLing Scheme for 75% Power Saving in C l o c h g Circuitry,” Symposium on VLSI Circuits, Tech. Dig., Honolulu, pp. 2524, June 1994.

[381 J. S. Caravella and J. H. Quigley, *Thee Volt to Five Volt Intedace Cir- cuit with Device Leakage Limited DC Power Dissipation”, in IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY. pp. 448-451, September 1993.

1391 M. Shoji, “CMOS Digital Circuit Technology”, Prentiee Hall h c . , Englc wood Cliffs, NJ., 1988.

(401 F. Abu-Nofd et d., “A ThresMillion Ttanaistor Microprocessor”, in IEEE Iotenw&xal Solid-State Circuits Conf., pp. 108-109, February 1992.

(411 T. Gabars and D. Thompson, “Ground Honnee Control in CMOS In- tessted Circuits“, in B E E International Solid-state Circuits Cod., pp. 88-89, February 1988.

(421 T. Gahara, “Gronnd Bounce Control and Impromd Latch-op Suppression Through Substrate Conduction”, IEEE J. Solid-State Circuits, “01. 23, no. 5 , pp. 12241232, October 1988.

[43] M. HashLnoto and 0 - K Kwon, “Low dI/dt Noise and Refletion Free CMOS Signal Driver”, in IEEE Cuatom Integrated Circuits Conf., Tech. Dig., pp. 14.4.1-14.4.4. 1989.

[44] T. Wada, M. EiOo and K. Anami, ” Simple Noise Model and Law-Noise Data-Ontput Buffer for Ultra-High-speed Memories”, IEEE J. Solid-state Circuits, “01. 25, no. 6, pp. 15861588, December 1990.

[45l R. Sentba than and J. L. Prince, “Application Sp&e CMOS Out- put Driver Circuit Design Techniques to Reduce Simultaneous Switching Noise”, IEEE J . Solid-state Circuit, YOI. 28, no. 12, pp. 1383-1388,Decem- her 1993.

[46] T. Knight and A. Krymm, “A Sew-Terminating Low-Voltq,e-Swing CMOS Outpvt Driver”, IEEE J. Solid-State Circuits, 701. 23, no. 2, pp. 457-464, April 1988.

[47] H-J Schumseher, J. Dikken and E. Seevindr, “CMOS Subnanosecond True ECL Output Buffer”, IEEE J. Solid-State Circuits, “01. 25, no. 1, pp. 150- 154, February 1990.

(481 M. PedcrMn and P. Meta, “A CMOS to lO0K ECL Interface Circuit”, in IEEE International Solid-State Circuits C o d , Tech. Dig., pp. 226-227, February 1989.

REFERENCES 255

[49] J. Martinen, "BTL Transceivers Enable High-speed Bus Design", EDN,

[50] B. Gunning, L. Yuan, T. Nguyen and T. Wong, "A CMOS Low-Voltage- S-g Itansrnisrion-Line Transceiver", in IEEE International Solid-state Circuits Conf., Tech. Dig., pp. 58-59, Februay 1992.

August 1992.

[51] J. A. Quigley, J . S. Caravella and W. J . Neil, '"Current Mode Transceiver Logic (CMTL) for Reduced Swing CMOS, Chip to Chip Communication", in IEEE International ASIC Conference and Exhibit, Rochester, NY, Tech. Dig., pp. 452-457, September 1993.

[52] M. Kakumu, 'Process and Device Technologies of CMOS Devices foz Low- Voltage Operation," IEICE Trans. Electron., Vol. E76C, No. 5, pp. 672- 680, May 1993.

[53] T. Kawahara et al., "Subthreshold Current Reduction for Decoded-Driver by Self-Reverse-Biasing." IEEE J. Solid-state Circuits, vol. 28, no. 11, pp. 1136-1144, November 1993.

[54] S. Mutoh et al., "1 V Bigh-Speed Digital Ckcuit Technology with 0.5- pm Multi-Threshold CMOS," in IEEE International ASIC Conference and Exhibit, Rocherter, NY, Tech. Dig., pp. 186-189, September 1993.

[55] M. Eoriguchi et el., "SSI CMOS Circuit for Low-Standby Subthreshold Current Giga-Scale LSI'r", IEEE J . of Solid-state Circuits, Vol. 28. No. 11, pp. 1131-1135 November 1993.

[56] R. W. Badeau et al., "A 100-MAz Macropipelined VAX Microprocessor," IEEE J. Solid-state Cmcnits, vol. 27, no. 11, pp. 1585-1597, November 1992.

[57] R. Brodersen, A. Chandrakasan and S. Sheng, "Design Techniques for Portable Systems", in IEEE International Solid-state Circuits Conf., Tech. Dig., pp. 168-169, February 1993.

[58] Y. Nakagomeet al., "Sub.1-V Swing Internal Architecture for Futwe Low- Power ULSI's," IEEE J . Solid-State Circuits. vol. 28, no. 4, pp. 414419, A p d 1993.

[59] A. Bellaouar, I. S. Abu-Khater, and M. I. Elmssry, "Low-Power CMOS/BiCMOS Drivers and Receivers for On-Chip Interconnects," IEEE 1. Solid-state Circuits. vol. 30, "0.1, May 1995.

[601 A. Chandrakaran et al., ~~~~~-Power CMOS Digital Design", IEEE J. Solid-state Circuits, VOL 2, no. 4, pp. 473-484, April 1992.


[61] L. J. Svensson, and .I. G. Kollcr, "Driving a Capacitive Load without Dissipating fCV'," IEEE Symporiam on Low Power Electronics, Tech. Dig., San-Diego, pp. 100-101, October 1994.

1621 T. Gabara, "Pulsed Power Supply CMOS - PPS CMOS," IEEE Sgmpo- sium on Low Power Elcotronics, Tech. Dig., San-Dicgo, pp. 98-99, October 1994.

[63] J. S. Denker, "A Review of Adiabatic Computing," IEEE Symposium on Low Power Electronics, Tech. Dig.. San-Diego, pp. 94-97, October 1994.

[64] M. Horowita, T. Indermaur. and R. Gonadeu, "Low-Power Digitd De- sign." IEEE Symposium on Low Power Electroniw, Tech. Dig., Slm-Diego, pp. 8-11, October 1994.

5 LOW-VOLTAGE VLSI BICMOS

CIRCUIT DESIGN

BiCMOS technology offers enhanced performance compared to CMOS at 5 V power supply voltage. Many high-speed BiCMOS SRAMs, gate arrays, ASICr, etc. have been fabricated [I]. In this chapter, we present 8 variety of BiCMOS logic circnits suitable for 3.3 and rub-3.3 V. The potential gatel for digital applications m e identilied. The chapter starts with the introduction of the conventional BiCMOS (totem-pole) gate which is used in 5 V applications. The degradation of this gate, with supply voltage scaling, is demonstrated. In Sec- tion 5.2, we introduce the BiNMOS family suitable for low-voltage applications. Othec logic families, for low power supply voltage operation, are discussed in Section 5.3. Low-voltage digital applications of BiCMOS m e identified. The reader is referred to BiCMOS books [Z, 31 to get more familiar with BiCMOS circuits.

5.1 CONVENTIONAL BICMOS LOGIC

In this section, the eanvenlional BiCMOS logic family is introduced. This brnily has been used successfully in many applications at 5 V power supply voltage. The reason for the speed advantage of BiCMOS compared to CMOS is explained. At lawvoltage, the performance degradation of conventional BiC- MOS is shorn.

The CMOS inverter of Fig. 5.1 suffers from the limited current drive when the load capaeit,ance u large. To increase the drive capability of CMOS, I bipolar driver can he added at thc output of the CMOS inverter. Fig. 5.2 shows one possible configuration to construct what is called B conventional BiCMOS

258 CHAPTER 5

inverter. The addition of the bipolar driver stage to the basic CMOS inverter is responsible for the high current driving capability of BiCMOS over CMOS. As a result BiCMOS offers lower d e l q compared to that of CMOS especially at high loading capacitance.

The operation ofthis gate is straightforward. When the input is low, the PMOS P is ON and its d r a b current tmns the transistor Ql ON. The collector current of QI charger the output load capacitance. As the output reacher VDD -VBB,, where VBE, is the turn-on voltage of the bipolar transistor and ir about 0.7 V, Q, gradually turns OFF. During this period, the NMOS transistor N a is ON. Since Ndl is conducting, Q2 is in the cutoff region. Bansistor Nd2 can also be controlled by the output node. However, using the base node results in faster operation because the b a of Qt is p d e d up faster than the output node and because the voltage level of the b a a node is largei. If the input is high, the NMOS transistors N and Nd, are ON. Ql is OF€ while Q. turns ON to discharge the output node. As a result, the load capacitance is pulled down. As the output V. leaches VEB, transistor Q. turns OFF and the outpot stays at this level. The conventional BiCMOS gate provides high drive capbilitr, eem static power dissipation and h g h input impedance. More dincnssionr on this gate are given in the following sections.

Low- Voltage VLSI BiCMOS Circuit Design

"0 1 1 TCL

259

w CMOS

BiCMOS

L

Figure 6 2 Conventional BiCMOS hvcrkr

5.1.1 DC Characteristics

Fig. 5.3 shows the DC transfer characteristic of the conventional BiCMOS inverter of Fig. 5.2. When the input voltage to the BiCMOS inverter is s e r a both the bipolar tran&lurr azr OFF. The PMOS device P operates in the hear region with rero drain-source voltage. Due to the subthreshold current of the transistor N (- 10 p a ) , the base-emitter voltage of QI is around 0.45 V. As a result, the output voltage V, = 4.55 V (0 VDD = 5 V). The bilse of the bipolar transistor Q2 is at zero voltage because Nd2 is ON.

As the input voltage increases, the subthreshold current of N h u e a r e s causing VB,,~, to rise and the ontput voltage to fa. When the input voltage is around the mid-VDo. both the P and N MOSFETs are ON and operate in t h e saturation region. Also the bipolar devices are ON. At this point, the BiCMOS inverter is in the high gain region and the output voltage drops sharply towards its low level.

260 CHAPTER 5

5 3 ,-. j : 0

> z 2 1 t

Figure 1.3 V.

Thc DO tranafGr charactcrialic o f the convcntiondBiOMOS at 5

As the input voltage increases again, the base of Q2 Sollows the voltage of the output since N is ON. When the input voltage reaches VDD, the PMOS P is OFF. The discharge device, A',, is ON and the base ofQl is at uero. Also, the o n t p t is completely discharged and N is ON. Then, the base of Q, is at sera In this cme, the output voltage is %em end both the base-emitter voltages are aero.

5.1.2 Randent Switching Characteristics

In this section we study the transient behavior of the convent,iond inverter of Fig. 5.2. The purpose o f this analysis b threefold i) it serves to nndeEs1w.d the transient switching behavior of the gate, i) to develop a simple analytic model, and iii) also to show the superiority of BiCMOS compared to CMOS. The objective of delay analysis is to point out the important device and circuit parameters that affect the response OS the gate. The developed model is very simple and can be used BS a first order spproimation. We start with the

Low- Voltage VLSI &CMOS Circuit Design 261

Time (nr)

(b) e

-6

-8 0 1 2 3 4 5

Time (ns)

snalysis of the puU-op section. Then we show the difference in the case of the pull-down section. We asinme a step input.

262 CHAPTER 5

5.1.2.1 Tmnsient Lkhnvior

Fig. 5.4 shows the transient behavior of the BiCMOS inverter of Fig. 5.2. When the inpmt f& to gronnd, transistor P turns ON and operates initially in the saturation region. Its drain charges the parasitic capadtames et the base and when VBE,PI = VBErm, Ql turns ON. The emitter current increaser in a relatively short time to its peak to charge the output load Cr. as shown in Fig. 5.4(b). The ontput voltage is pulled-up following the base voltage of Q1 BI shown in Fig. 5.4(a). As the b- of Q, exceeds VT,, Ndl turns ON to discharge the base of QI to ground. But due to capacitive COUP^^. VB,,, tends to be pulled-up. When the base vokage is higher t h m VDD - VDS, .~ , where VDS..+ is the saturation voltage of P, the PMOS tramistor P enters the Linear zepion and the drain (base) current drops gradually. Consequently, the emitter current of Ql struts falling. As the output voltage V, approaches the theoretical limit of VDD ~ VBE-, Ql is expected to turn gradually OFF. However, due to the capacitive coupling between the bare and the output node, V, exceeds this limit as shown in Fig. 5.4(a). The same ieasoning can be applied when the input riser to VDD

5.1.2.2 Analytic Delay Mudel

A simple delay aoalysk is wried out in this section. The reader can refer to [4. 5, 61 for other detailed models. We talre iota acconnt the pararitic capacitances and the bipolar high current effects. We do not take into account the parasitic resistances since they have no appreciable effect with advanced bipolar technology. This model is based on i b j e model [TI.

Fig. 5.5 illustrates the transient equivalent circuit of the pull-up section (Fig. 5.2) of the conventional BiCMOS gate driving a load capacitance CI,. As we are interested in 50% rise time, the PMOS current can be modeled by the saturation current of the device. Thia current is given by Eqnstion (3.82) in Chapter 3

where Vcs is equal to (K*+j ~ VDD), where K,+ is the low level ofthe input. The capacitance C,, accounts for the parasitic capacitances of the MOS devices P, N d , and Ndz a t the base of the pull-up bipolar transistor. Therefore, it is given by

where Cd,p and Cd,Na, are the drain junction capacitances of P and Ndl and Ca,N., is the gate oxide capacitance of Ndl. The overlap capacitances of P

IDS,,* = ~pc~ ~ , ~ t , p ~ p ~ l v o s l - IVT?l) (5.')

c,, = C d , P + Cd,N*> + (5.2)

Low- Voltage VLSI BiCMOS Circuit Design 263

Bipolar large signal model -. -7- ~. . .\ . . . . . .T.. .

and N,, hie assumed negligible. The bipolar parasitic capacitance C,a of Fig. 5.5(a) is given by

The total load capacitance, C., shown in Pig. 5.5(b), i s given by Cpa = CC.Q> t CE.Q, (5.3)

(5.4) c, = c, t CS,Q1 +CC.Q,

where Cr. is the external load capacitance, C,,O, is the average collector- substrate capacitance of Qz and CC,~, is the average base-collector capacitance of Q2. R e c d from Section 3.5.3 lhat the base-emitter Murion capacitance is given by

(5.5) drc,Q,

co =if=

whew the q is the forward transit time subject to high-level effects.

The delay c m be divided into three components :

1. The first component, l,, in defined as the time required to turn QI ON. The model of Fig. 5.5(a) can be used in this case. Writing lhe current equation at the base node of QI, we have

264 CHAPTER 5

2

Solving that equation and assuming that initidly the bare-emitter of Qz is zero, we have

(5.7) VBB,a

t, = (CF +C,)- I.?,,.,

If the initial VBE is not eeio then the above expression should be corrected. Typical value of il is 17.5 ps for a total parasitic capacitance at the base node of 50 fF, V.j+,, = 0.7 V , and I D S , . ~ = 2 mA.

The second component, t2, is defined as the time required to charge the diffusion capmitame, CD,p,. Startingfrom t,, the collector current begins to quickly rise and then rexbes its peak value, I c p . The output voltage changes slowly (see waveformsofFig. 5.4). Sot. is then defined as the time required for the collector corrent to reach its peak. This delay component is given by

t 2 I D S d = T,IOCp ( 5 . 8 )

which means that the charge furnished by the PMOS is needed to charge diffusion capacitance. Therefore,

The peak collector current of Q1 can be approximated 'sing Equation (3.111) [Section 3.5.21. So we have

ICP = JBOIX,IDS..t (5.10)

where Po is the value of the pin for low-level injection and I x , is the forward knee current. Note that r, is incremed by the collector current [see equation (3.127) Section 3.531. Hence, an average value of the forward transit time should be used in the above delay expression. The initial value o f q is 12 ps and it can leach 50 pr when the collector current reaches, for example, 5 mA. For = 2 mA, typical value for ta is 78 pr (average forward transit time is 31 ps).

3. The third component, ts, is defined as the time required to charge the total load capacitance to the middle point of the output swing. If we assume that the voltage across the base-emitter of QI is almost constant, then we have the following approximation

(5.11)

Low-Vollage VLSI BiCMOS Circuit Design 265

I f w e assume that Ic ,pz is constant during this time [see Fig. 5.41, and the mid-point of the output is VDD/Z, then we have

(5.12)

The value of this delay vsries by more than an order of magnitude depending on the device’s sise and the load capaeitnnee. For example, for a load C, of 1 pF, this delay. t 3 , has a typical value a t 5 V power

voltage 400 p, while for load 100 f~ a typical value is 70 ps.

Hence, the total delay td can he written as

1” = I I i t a t t . (5.13)

The first delay is associated with the parasitics at the bare, the second one with thc forward transit time and the last one is a function of the load capacitance. For smdl loads, t2 and ti dominate. Bowever, for large output loads, the third delay term, t s dominates.

The exprersion of the pull-down time is similar to that of the pull-up time ucept for the value of the drain e m e n t of the transistor N [see Fig. 5.21. The saturation current ofthis device is given by

IDS..~ = K , C = U , G ~ W ~ ( V G ~ - V h ) (5.14)

The VGs far the NMOS during the switching is affFeted by V L Z ~ drop while the one of the PMOS is not. This voltage is given by

vos = y;.,h. ~ VBE (5.15)

So the effective gate-source voltage of the NMOS k lower than that of PMOS. The sizing of the NMOS and PMOS dwicer doer not follow the rule used for CMOS. It can only be determined from circuit simulation to get symmetrical risc/fa delay limes.

The slope of the characteriPtic delay-load of the BiCMOS gate is larger than that of CMOS, since it is equal to VDD/Z(~DS,~~ + l c p ) . For 8 CMOS gate, the slope is rimply VDD/~(~DS.~,). The saturation culient in the CMOS is slightly higher than that of BiCMOS because the CMOS inverter has D PMOS with slightly wider device (see next Section]. Houcver, the slope of the BiCMOS inverter is larger due to large Icp. Therefore. the BiCMOS gate h a s a higher ddvability than CMOS.

266 CHAPTER 5

5.1.3 CMOS and BiCMOS Comparison

Lets compare the delay of BiCMOS gate to CMOS gate, having both of them the same inpnt capacitances. We consider the case of inverters with the following riser. For the BiCMOS inverter, we have : W, = W, = 10 em, WN*, = WN,, = 2 fim, and the emitter ate8 is n2 the minimom area. For the CMOS inuerter, we have W, = 15 em and W, = 7 em. For unloaded inverters and from the delay cxprersion of the BiCMOS inverter discussed above, ~ ~ , C M O S < i d , B , o M o S because the BiCMOS circuit has more parasitics and requires an initial delay to turn ON the bipolar devise. For large loads, I ~ , C M O S > G,B;CMOS, as explained previously. Fig. 5.6 shows the simulated delays of the CMOS and BiCMOS inverters function of the fanout. Fanout is defined here as the ratio of the load seen by the gate to the hpni capacitance. In other wozdr, fanout is equal to the number of the gates connected to the ontput of the driving gate, all having the same input capacitance. The inputs axe driven by a small siae inverter of the same type to have t y p i d inpnt waveform falljrise times. For low fanout, 1-to.2, CMOS outperforms BiCMOS at 5 V powez supply voltage. However, when the fenout is greater than 3, BiCMOS outperforms CMOS; particularly for high loads. In Fig. 5.6, the uoss(~~er ea- pacitance (or fanout), denoted C,, is typically h the order of 100 f F . This c m ~ o v e r value is critical for the performanee of BiCMOS; particularly when the supply voltage is sealed down.

5.1.4 Power Dissipation

As discussed, the BiCMOS gste of Fig. 5.2 has no DC emrent path from VDD to Vss if the input has rail-to-rail swing. Hence the static power dissipation is negligible if VT of the MOS devices is high. The dynamic power dissipation of the gate can be estimated from the circuit diagram of Fig. 5.7.

It is estimated by

Pa = C,iV%f + Cp2Vizms=f + GVDD(VX - V L ) f (5.16)

The first term is due to the total peraritie capacitance at the base node of Qi where the swing is - VDD. The second term is also due to the parasitic capacitance st the base node of 4.. The swing at this node is limited to VBB.,... when the collector current reaches its peak. Finally the third term is related to the output load capacitance, CL, and the parasitic capacitance at the output. The swing is only V x - V ~ , where VH and VL are the high-level and the low-level of ontput, respectively. These levels ace affected by the output load.

Low- Voltage VLSI BzCMOS Circuit Design 267

Equivalent load capacitance (kF)

For small loads the power of BiCMOS is greater than that of CMOS, while for large loads, they have almost the same dynamic power. Table 5.1 shows the simulation results of the power dissipation for both gates at 5 V power supply. At a fanout of 1, CMOS consumes much lower power than BiCMOS and i t is h t e r . However at a Ianout of 10, the BiCMOS is faster (37.5% delay reduction) and it dissipater only 24% power more than CMOS.

When a BiCMOS gate is driving another BICMOS, or a CMOS gate, the driven gate exhibits a DC power dissipation. This DC current is nat acceptable, particularly when the circuit is in standby mode. Thk is due to the reduced $-Ping at the output of the first gate. Fig. 5.8 d o w r an example of BiCMOS gatedrivhgaCMOS gate. Iffor example theoutput ofthefirst gate (BiCMOS)

VBE, the Vos of the driven NMOS would be higher than ieio and around the VT, resulting in appreciable DC power. Furthermore, the drive current of the driven gate would be reduced; particularly a t low power supply voltagc. Another disadvantage of the reduced swing is the noire margin reduction.

268 CHAPTER 5

Table 5.1 f = 1 0 0 h m S

CMOS/BiCMOS powm disripotion v e r m ~ Land OVDD = 6 V and

Driver Fenout=l Fsnout=5 Fanout=lO

CMOS (mW) 0.67 0.83 1.26 BiCMOS (mW) 0.23 0.58 1.02

5.1.5 Full-Swing with Shunting Devices

Previously we have seen that BiCMOS &caits uhibit iedoced output s-g. To overcome these shortcomings, various types of BiCMOS gates have been devised. There are based on the conventional BiCMOS citcuits with baseemitter or collector-emitter shunting techniques or on other logic circuits which will be d~eusred in the following sections. Figore 5.9 shows some of the circuits bared on shunting devices. Fig. 5.0(a) illustrated one full-swing (FS) configuration called "FS type" gate [8] which uses MOS devices to achieve full-swing. For the charging phase, 8s the output exceeds Vx, Qi cemes to source current to the load, and the load capacitance is charged through the shunting PMOS transistor P,. When the input goes to HIGH, the load is discharged through

Low- Voltage VLSI BiCMOS Circuil Design 269

Fare 1 (BiCMOS) Gate 2 (CMOS)

Figure 5.8 DC eowcr dissipstim of the &ring p t c

N and N,. When V. falls below V,, Qa ceases to sink current from the load capacitance. Then the output is discharged to the ground through only the MOS transistors N and N,. The final charging and discharging phaser occurs through the shunting devices. Hence, these phases can be slow became the MOS shunting devices have low drive capabilities. When this FS BiCMOS gate L operating under high frequency, the output s-g can he reduced. An- other drawback of this circuit is that part of the current supplied by P ( N ) is wasted through the shunting transistors which weakens the bipolar drive. The shunting transistors P, ond N, can be minimum size.

The problem of the base drive inherent in the "FS type" BiCMOS gate can be overcome by using feedback (FB) from the output through an inverter as shavn in Fig 5.9(h). This eireuit is called "FB type" [9]. During the pull-up transition, the shunting device P, is initially OFF and the PMOS transistor p wpplied all its current to the b s e af Q,. When V, is approaching its high level, the inverter I turns ON P, which itself charger the output node to VDD. The pull-down transition can be explained similarly. The shunting devices P. and N , and the inverter I can be sived properly to achieve greater speed then the othei configurations, even the conventional BiCMOS gate.

270 CHAPTER 5

Vnn VDD r

& : CMOS inverter

Figure 5.0 (c) '"CErhlvlting type.

Fdl.swing BiCMOS gstr typal: (a) "FS type"; (b) "FB k y p i ' ' ;

Another full-swing configuration is the one shown in Fig. 5.9(c). It uses a parallel inverter from the input to shunt the collector-emitter (CE) of QL and Qa ontputs. The disadvantage of this gate is the increased input capacitance.

5.1.6 Power Supply Voltage Scaling

The output bipolar stage introducer VBE voltage losaes at the output node as discussed earlier. When LL BiCMOS gate is driving another BiCMOS gate, the conventional BiCMOS gate loser its superior performance ova CMOS at lower power supply voltage. The major c a w of this problem is the pull-down section of the BiCMOS gate. The VoS voltage of the driving NMOS transistor of the pull-down section is eqnal to VDD ~ 2VeB. As VDD is redoeed, VOS is signifinrntly reduced, resulting in degradation of drain current, hence the driving capability ofthe conventional BiCMOS gate. Fig. 5.10 shows the delay of a BiCMOS inverter in comparison to that ofs CMOS m the supply voltage is scaled down. The reported delay times were extracted from SPICE simulation by memuring the delay of the second gate in e. chain of identical inverters. AU gates were equally loaded by B load CL = 0.25 p F and one fanout. All the circuits have the same input capacitance. The BiCMOS invcrter fails to

Lour-Voltage VLSI BICMOS Czrcuit Design 271

1 . 4 ,

operate at 2 V power supply. The BiCMOS outperforms CMOS but for 3 and sub4 V it looser its superior performance.

The limit of operation of the conventional BiCMOS gate with the power supply voltage is determined by the NMOS device of the pull-down section. The drive current of this NMOS d e v k k (VDD -2Vs.s -VT..). Hence, VDD,,,~ - 2.2 V. Therefore, high-performance BiCMOS circuits, at low-voltage, are needed that minimize

m Teehnology/procesn complexity;

rn

m

rn Power dissipation.

Circuit complexity by osing less device count;

Area occupied by the gate; and

272 CHAPTER 5

5.2 BINMOS LOGIC FAMILY

BiCMOS technology can gain much of its performance edge o ~ e r CMOS with c k u i t techniques that mk-e or eliminate the effects of VBB loses. To overcome the problem of dday degradation in conventional BiCMOS with supply voltage, many navel circuits were proposed. In this section, a practical family suitable for 3.3 V and sub-3.3 V operation regime is outlined.

Fig. 5.11 shows the BiNMOS family of BiCMOS &<nits. The b&c circuit technique used in BiNMOS [lo] is the use of the NPN bipolar transistor only in the pull-up section of the output stage [Fig. 5.11(&)]. The pull-down seetion is kept as CMOS. In CMOS circuits, the PMOS transistor is twc-tc-three t i e s slower than an NMOS transistor, when same sbes are compared. In the BiNMOS circuit, the use of the PMOS, with the bipolar driver in the pull-np section, will halanee the unsymmetrical response of CMOS.

In the basic circuit of Fig. 5,11(a), the output reachs only VDD ~ VBE level. This increaser the delay and power &sipation of the subsequent gates. If a resistor (in this case the gate is called BiRNMOS) or n grounded gate PMOS transistor is inserted between the emitter and the base of the pull-up bipolar transistor. the output achiever fd-swing. However, this will degrade the speed of the gstc because the base current is bypasaed by the inserted element and hence is reduced.

Many alternatives have been proposed such ar BiPNMOS [Ill, and PBiNMOS [I21 to realist full-swing output. The BiPNMOS is shown in Fig. 5.11(c). A small rise PMOS transistor and an inverter ale added to the bark BiNMOS gate. The PMOS device realiees full-swing output when the output changes from low to high. The Sdded PMOS, P, turns ON only when the output rewhches the threshold voltage of the feedback inverter. Hence, the bare curreat supplied by the pull-up PMOS transistor is not affected by this added PMOS transistor. Consequently, the BiPNMOS gate has higher performance than conventional BiNMOS and BiRNMOS. One drawback of the BiPNMOS is the increased output load capacitance due to the inverter I.

The PBiNMOS gate eonfiguration shown, in Fig. 6,ll(d), uses a small sine PMOS device in parallel with the bipolar p d - u p transistor to r&e full-swing output. This configuration results in better performance compared to the other circuit structures but slightly increases the input capacitance of the gate. In this section, we show that a properly optimiied PBiNMOS gate is faster than CMOS, even a t low power supply and load.

Low- Voltage VLSI BiCMOS CiTCUit Design 273

274 CHAPTER 5

5.2.1 Bih‘MOS Gate Design

In this section we discuss the effect of the circuit parameters available to the designer to optimine the PBiNMOS gate for low fanout fast operation ming the 0.8 pm BiCMOS device parameters discnssed in Chapter 3. We optimie the design of the inverter. Then, the teeh*que can be extended to more complex gates.

Finding the proper sieing of the inpct MOSFET’s P and N (W, and W, respectively) is not tdvial. The sizing of Na and P, [see Fig. S.ll(d)] k not critical. For typicd applications, it is enough to use near minimum size devices. When the delay of the PBiNMOS is plotted versus the width of one of the devices P or N, for different fanouts, a common optimum width exits as shown in Fig. 5.12(a) with a fiattaed region. This optimum is due to the fact that when inerebdng the size, the dr in t i i ty of the gate increases. However, the equivalent ontpnt load also increase.. Then at a certain siee, an optimum delay exits. &om this figure, the optimum W, is 9 p m and W, = 11 p m (particularly for low-fanout). Note that in Fig. 5.12(8), we have chosen W, ii 0.8Wm. This is explained in more detail below.

When the BiNMOS inverter is used as a driver of a fixed losd (e.g., bus), instead of d d ~ g gates, then we should consider the delay of the driver, including the delay of the stage that drives it. In Fig. 5.12(b), the total delay of the PBiNMOS driver and the CMOS inverter that driver it is plotted for two fixed loads: 0.2 p F and 0.5 p F . The CMOS stage has a minimnm dae. The minimum delay is around the point determined previously for the knout cese

The choice of the emitter area in this gate depends on the technology and the load. For the 0.8 pm BiCMOS at 3.3 V power supply voltage, it was found that using the minimum emitter ares (AB x 1 = 0.8 x 4 pm’) gives the minimum delay for the range of loads 5 1pF.

Fig. 5.13 shows that the optimal W,/W, ratio is the same for different fanonts and is equal to 0.8. This point &o gives almost symmetrical f d j d s e delays. So wen if the fanont is unknown, the optimnm gate is fixed and the size. depend only on the device parameters. This result is very important for standard cells and gate arrays where the cells are ddgned with unknown loads.

Low-Voltage VLSI BICMOS Czrcutt Design 275

1411, I

- - I 2201 6 8 LO 12 14 16

276 CHAPTER 5

..... .... ...... .... .... VD0 = 3.3 v wp +W,,=201im

......., 340

n 2x0

240

228.2 0.4 0.6 0.8 I 1.2 1.4 1.6 1.8 2 2.2 2.4

wpmn ratio

Figure &.I$ n fired input capacitance.

The &lay of PBiNMOS inverter Y I ~ U B the ratio of Wp/W. for

CMOS.--.-

500 a .... - $ 4 0 ...... - ' 300

200

IwI 2 3 4 5 6 7 8 9 1 0 Fanout

Figure 6.11) Comparison of the CMOS md PBiNMOS delays for the same input ce,p~ciLancc funslim of the fan..uk.

Low-Voltage VLSI BzCMOS Czrcuzt Deszgn 277

5.2.2 CMOS and BiNMOS Comparison

Fig. 5.14 shows the delay of CMOS and PBiNMOS inverters fnnction of the knout. Both gates have the same input capacitance. The impmtant result of this plot, is that the PBiNMOS gate is always h t a than CMOS, except for B fanout of I , where PBiNMOS is slightly Carter. For a fanout of 3, which is II typical value in many designs, the delay is reduced by 20%. For a higher fanout, tho delay is reduced by 25.40%. This result ir quite different from the e a ~ e of conventional BiCMOS where B high fanout (or load) is required for BiCMOS.

Let us compare the power dissipation of the gates for different fanoot. Table 5.2 shows this comparison for s m d fanouta. The power dissipations of both gat- are comparable and are the same for e. fanout (> 3). The small rize additional bipolar in the BiNMOS gate does not result in sigaificant power dissipation overhead. This result shows that the BiNMOS family is an excellent choice fo? law-powcr and high-speed operation. However for D fanout 1-2, still the CMOS can be used.

TableS.2 CMOS/PBWMOSpow~i.di..ipationsarvfanovtBV~~ = 3.3Y f = 100 MBx.

Driver Fanouk2 Fanout=3 Fanout=5

CMOS (pW) 149 192 277 PBiNMOS ( p W ) 171 203 287

5.2.3 BiNMOS Logic Gates

Since the PBiNMOS is used extensively in 3.3 V digital integrated circuits, some logic gates a e presented. Combinational PBiNMOS logic circuits *re ewily constructed using the basic PBiNMOS inverter of Fig. 5.11(d). Two- input NOR and NAND gates are shown in Fig. 5.15(a) and Fig. 5.15(b). The logic function is implemented using the PMOS and NMOS blocks a5 in CMOS technology. The bipolar device Ql is osed as a current drive. More complex functions c m be implemented wing standard CMOS gate formation theory. The layout of the PBiNMOS inverter is shown in Fig. 5.16. The BJT consumes area in the PBiNMOS gate. However, when complex gates are implemented with more MOS devices, &he extra area of the BJT is reduced.

278 CHAPTER 5

Figure 5.16 NANDZ.

Cir-uit rchhcmslier of: (a) PBJNMOS NOR2 j (b) PBiNMDS

One technique to reduce the area penalty of the BJT is to use merged N-well bipolar and PMOS device..

5.2.4 Power Supply Voltage Scaling

For fntare technologies, the power snpply voltage will be sealed below 3.3 V. Fig. 5.17 shows the delay of PBiNMOS and CMOS inverters for a fanaot=3 versus the power supply wltage scaling. The reported delay times were extracted from SPICE simulation by measuring the delay of the second gate in a chain of identical inverters. In this case, the full-swing operation, at the input of a PBSMOS inverter, is provided by an identical gate, where a shunting PMOS is used. Fig. 5.17, shows that PBiNMOS is faster than CMOS down to 2.5 V. At 2.5 V the delay reductinis 15%. The crowwer power supply vdtage between PBiNMOS and CMOS is around 2.15 V. Note that in this comparison we used 8 0.8 pm BiCMOS technology aptimked for 5 V operation. In this case, to compare the BSMOS to CMOS at low-voltage, deepsubmicron technology should be osed. From the device Iwd point of view, scaled technology is expected to improve the performance of BiNMOS a t low-voltage. However, 2 V is the limit of the use of BiNMOS, since almost half of the swing a t sub2 V is provided by the poor shunting PMOS device.

In summary, BiNMOS family provides the follorving advantage:

Low-Voltage VLSI BiCMOS Circuit Design 279

- _ I - (N-Well BN-Plug m N + Diff nP+ Dif f

$$$Gate m P - B a s e a M e t a l 1 UMetal 2

~ContactlX]VlA I UEmitter

280 CHAPTER 5

. Simple gste compared to other BiCMOS logic circuits;

Good performance at 3.3 and 2.5 V power supply voltage generations even at low-fanout; and

rn Needs simple BiCMOS process

The only disadvantage of BiNMOS is its poor performance for sub-2 V operation. The small area penalty of BiNMOS is not a problem since for complex gates the overhead of the bipolar device is miaimiued.

5.3 LOW-VOLTAGE BICMOS FAMnIES

In this section, several BiCMOS logic circuits proposed for low-voltsge high- speed digital applications are reviewed [13]. Many of these circuits have not been widely used in BiCMOS products. However, some of the logic circuits presented in this section exhibit high-performance at low-voltage down to - 1 V.


For fast operation at low-voltage the fd-swing operation should be realized with bipolar devices. Otherwise, the techniqnes based on shunting devices do not provide high drivability

5.3.1 Merged and Quasi-Complementary BiCMOS Logic

In this section two circuit techniques to overcome the shortcomings of the conventional BiCMOS gate are discussed and compared. These gates are intended to be nsed for sub-3.3 V operation. luso they me devised to solve the pmb- lem of ming PNP transistor (see next section on Complementarg &CMOS). In all there circuits, the improvement is done mainly on the poU-dourn section of the conventional BiCMOS, since it is the major can~e of speed degradation at low-vdtage.

5.3.1.1 Merged BiCMOS (MBICMOSJ

To improve the performance of the pd-down seetion of the conventional BiC- MOS circuit, with power snpply sealing, PMOS/NPN pd-down BiCMOS gate has been proposed [I41 as shown in Fig. 5.18. In this pull-down canfig- “ration, a PMOS transistor Pa, is “red to drive the NPN bipolar trsnsistor, 8,. The gate of the PMOS P, is tied to the base of Q,. The CMOS inverter formed by the transistors P, and Ndl supplier rail-to-rail voltage swing to the pull-down PMOS. Henee, the VGS voltage of the driving PMOS transistor is not affected by VaE loss s i n the ease of conventional BiCMOS. This gate is d e d Merged BiCMOS (MBiCMOS) because of the advantage of the gate for possible PMOSJNPN device’s merging.

The pull-up section is similar to the one in conventional BiCMOS. The operation of the pull-down sections is BS follows. When the input is high, Na p u b the bare of Q1 down to ground and P, turns ON. The transistor Pz supplies the base elurent to Ql. The bipolar tramistor Q2 discharges the load capacitance to lover voltage equal or Iw than Vgaon.

Stin this structure suffers from the 2 VaE hrser. The only improvement in MBiCMOS, compared to conventional BiCMOS, is the higher drive current of the pull-down section. If the N-well of the pull-down PMOS transistor is tied to the VDD rail, its threshold voltage will experience a degradation due to the body effect during the pull-down transient. As a result, the drivability of the pull-down PMOS transistor is degraded. A simple solution to eliminate this problem is to shunt the IOUC~ and the substrate of the PMOS transistor, P2.

282 CHAPTER 5

Figure 5.18 Tho MBiCMOS r t r

It was shown that this configuration (with shunted source/substrate) is fsJter than its CMOS counterpart down to 2.2 V supply voltage "sins sub-0.5 pm BiCMOS technology [15, 161.

5.3.1.2 Qunsi-Complrme?zfory BiCMOS

Another variation of the MBiCMOS is called "Quasi-complementary BIG MOS" [17]. A "quasi-PNP" connection is generated in the pull-down section of the conventional BiCMOS as shown in Fig. 5.19. It consists of PMOS and NPN tranaktors (Fig. 5.1S(b)). This configuration resembles the MBiCMOS gate of Fig. 5.18. The QCBiCMOS has two attractive features. The first one is that the drain curtent of the pull-down section does not suffer the ~ V B E losses as in the case of conventional BiCMOS. The second one is lhat the pull-down waveform is steep, dae to the good Ehsrge retention capability of the bipolar tramistor. The feedback circuit formed by the two cross-coupled inverters, 1, and Iz, permits the discharge of the bere of the pun-down transistor immediately after the p&down transition.

The QCWiCMOS gate keeps its superiority over CMOS down to 2 V. At 2 V it has better performance than BiNMOS logic circuit. However for sub-2 V, it looses its performance. Furthermore, it consumer large area and needs a relatively large fanout to outperform CMOS.

Low-Voltage VLSI BiCMOS Czrcuit Design 283

5.3.2 Emitter Follower Complementary BiCMOS Circuits

Full-swing operation can aLo be achieved by using what is called the Com- plementary BiCMOS (CBiCMOS). The n ~ e of complementary BiCMOS has been encouraged by the recent advances in bipolar technology, which led to high-performance PNP transistors. It is expected that the NPN and PNP transistors will exhibit dose performance when the de~cices are scaled doam and the base doping inerearer. In this section, we study the emitter-follower (EF) CBiCMOS.

Fig. 5.20 shows the use of complementary bipolar output stage to form the bnsic complementary BiCMOS circuits [18, 191. The pun-op section is similar to the conventional BiCMOS. The pull-down section is symmetdcal to the pull- np. The cnrrent of the NMOS transistor N does not sdfer of VBS reduction doc to Q. as in conventional BiCMOS. T h e static swing varier between VBE- and VDD ~ VBB-. However, m explained in Section 5.1.2, the actual swing might bs larger than the static design. The balanced transconductance of the PMOSINPN and NMOSIPNF makes it ensier to obtain symmetrical fall and rise time. Hence this circuit eliminates the degradation of the pull-down delay with power supply voltage of the conventiond BiCMOS.

284 CHAPTER 5

Figure 6.20 SEhrmsti. of Lhc basic CBiCMOS

The gate of Fig. 5.20 can be modified to achieve full-swing operation by using emitter-base shunting devices. Fig. 5.21(a) shows EF CBiCMOS with shunting technique. The shunting MOS transistors of the base-emitters permit rcstor8r tion of the full logic level of the output. But still the full-swing is achieved with the two dow MOS devices. Some of the base current can be consnmed by the shunting devices which weakens the drive of Ql and Qz. To O T C I C O ~ ~

this problem, the feedback technique can be used as shown in the circuit of Fig. 5.21(b). The turn ON of the shunting devices is delayed by the feedback inverter, I.

There CBiCMOS drcuits have two drawbacks: poor performance at 2 V power supply voltage and less, and high proce-g cost because of the high performanee PNP device needed. This low performance, at low voltage, is due mainly to the fact that 2Vse outpot swing is generated by the two shunting transistors.

5.3.3 Full-Swing Common-Emitter Complementary BiCMOS Circuits

So far all the presented full-swing circuits, such as PBiNMOS, CBiCMOS, MBiCMOS and QCBGMOS, achieve the rail-to-rail swing by using resistom or MOSFETr that apcrate in the linear region. These techniques are effective


Figure 6.11 SrhematicofEF CBiCMOS g s k r xithshvnilngdcrirsr

only when the operating frequency is low, where the gate can complete its full- swing operstion and/or when the load capacitance is small 1201. FuU-swing circuits with full bipolat drive are needed. In this section, CBiCMOS variation suitable for sub-2 V operation, called Ttmsient Saturation (TS) is presented.

Fig. 5.22 shows the basic common-emitter complementary BiCMOS ( C E CBiCMOS) circuit. The circuit is symmetrical and has symmetrical fall and rise times. When the input goes to high, N turns ON to rink the current from the base of the PNP transistor Q2. When the base voltage ofQ2 falls to V D ~ - V ~ ~ ~ , Q. turns ON to soum the current to the output load capacitance. Q 2 eventually saturates and the output node ir pulled-up to VDD - Vcs..,. A1 the end of charging the MOS device is still consuming current. The operation of the pull-down section can be explained similarly. Hence, the operation of CECBiCMOS is "on-inverting and the gate needs an extra CMOS inverter at t.he input to achieve complement fnnction. In this circuit, the MOS trsn- rktors operate in saturation, hence they supply high cnrrent for the bipolar transistors. Furthermore, the output swing has near rail-to-mil wing (VCB,.~ to VDD - V,o,.r). This circuit offers high-speed at low-voltage, but har two drawbacks; (i) the high-static power dissipation, due to the DC cwrent flowing through the bave of either QI or Q a , and (ii) the excess delay due to the slow procesr of turning the saturated BJTs OFF.

286

"DO

T

CRAPTER 5

4 Figure 1.22 Common-*mitt* CBiCMOS $eL.

These two problems have been salved with several implementations [21, 221. One possible implementation is shown in Fig. 5.23. It is cslled Transient Satmation M-Swing (TS-FS) BiCMOS. This logic nses the principle of CE CBiCMOS described in Fig. 5.22. When the input f a , we - m e that the output is charged high, then Pa is ON. Pz tmns ON and the base of QL is charged throngh Pa and Pa [Fig, 5.23(b)]. Consequently, Ql discharges the output (load) down. When the octput voltage approaehs eero, the inverter Z, turns Ps OFF and N4 ON [Fig. 522(c)]. The base voltage of Q1 falls below VBE, causing it to torn OFF. Although 91 Jatutates, this does not slow the n u t pull-up transition because the excess minority carriers of Q, are discharged immediately after the pull-down operation. Thus, the bipolar transistor ra1mst.a transiently. The circuit is symmetrical, hence the operation of the pull-up section can be explained W a r l y . The PMOS transistor, Pa, cuts off the the DC enrient path during the pull-down transition to avoid any static power dissipation. The small sine ontput latch, composed of the inverters I, and I,, holds the output level because in steady state there is no path between thc ontpnt and the supply h e s .

Compared to the BiCMOS logic circuits so far presented, TS-FS is faster below 2 V supply, when the load is relatively large (- 1 pF). At 1.5 V it is twice as fast s CMOS for large loads. Although this circuit solves the problem of speed degradstion of BiCMOS a1.5 V power supply, it still has several drawbacks:

Low-Voltage VLSI BiCMOS Cixuit Design 287

(a) (C)

Figure 6.13 sicnt saturation opcrstion for the pd-down srclion.

(a) Circuit configuration af TS-FS BiCMOS: (b) and ( c ) tram.

process complexity due to the PNP bipolar transistor; large area; relatively high crossove~ point with CMOS (- 0.4 pF); and it is a noninverting circuit.

5.3.4 Bootstrapped BiCMOS

An alternate way to avoid the negative effect of VgB loss in BiCMOS is simply to use a second supply voltage equal to (VDD t VBB). Bowever, this approach is costly because of the additional wirer needed to distribute across the chip and the need for the second supply voltage. Another approach is to use boat- strapping technique to pull-up the base of the pull-up bipolar transistor to (VDD + VBB) and hence the output to VDD. The generation of voltages higher than the power supply at the gate level adds an extra degree of freedom to BiCMOS. Schottky BiNMOS/BiCMOS circuit configorations using the boat-

288 CHAPTER 5

strapping have been proposed to overcome lhe negative effect of VBE loss [ZO]. The full-swing operation is performed by saturating the bipolar transistor of the pull-up section with jl base current polse. After which, the base is isolated and bootstrapped to a voltage higher than VDD. These Schottky circnits ont- perform all exjsting BiCMOS families in snbW regime down to 2 V, but they need a BiCMOS tcehnology with good integrated Schottky diode. Other examples of a such technique are the bootstrapped BiCMOS circuits published by [23, 24. 251. The main advantage of the bootsttrapped circuits is that they can be realized in conventional BiCMOS process with CMOS and NPN transistor only. In this section, we present one bootstrapped circuit which overcomes many drawbacks of the BiCMOS logic families discussed previously.

S.3.4.1 Basic Concepr of Operarim

The Bootrtrapped Full-swing BiCMOS (BFBiCMOS) inverter is shown in Fig. 5.24. It consists only of CMOS and NPN transistors. Benee, it can be built in a non-complementary BiCMOS technology. The pull-down circnitry is identical to that of TS-FS and was explained previously. The operating principle of the pull-up section can be explained as follows. When the input is high and the output is low, the PMOS transistor Pd is ON. In this w e , the bare voltage of QI is precharged to VTP which is less than VBS- but close to it. The prechsrge PMOS transistor MP, is ON to charge the bootstrapped capacitor Cawt to the level VDD (piecharge cycle). When the input goes to low and Pi tnrm ON, the bipolar transistor Ql turns ON almost instantaneously becanse its bMie-emitter junction is piecharged near Consequently the initial turn-on delay of the pull-up section is reduced. This has an impaet on the minimum fanout required by BFBiCMOS to outpetform CMOS. Once QI turns on, the output node starts to charge the load capacitor CL toward VDD. Since Pp is OFF, the node nl is disconnected from VDD and is floating. Thus as the output voltage V. rises to VDD, the voltage at node nl also rises towards VDD + V B S ~ (bootstrapping eyde).

When the inpnt is low, the gate of the PMOS transistor Pp turns OFF (almost instantaneously) during the bootstrapping cycle to prevent dkehsrging the bootstrapped node through reverse current Corn 01 to VDD. This is achieved through the use of the pseudo-inverter formed by P( and Nj. During the bootstrapping cyde (the input is low), Pt t u n s ON and the gate of the preeharge transitor Pp is pulled up towards the voltage of nl. Thus, P,, is completely OFF when the voltage at nl exceeds VDD. Furthermore, the PMOS transistor Pd is OFF completely because its gate is driven by the boosted voltage through P..

Low-Voltage VLSI BiCMOS Circuit Design

"OD

T 7

I" G t

289

Figure 5.24 The boolrtrtippd full-swing BiCMOS in~er te r (BFBiCMOS)

Compared to the Bootrtrapprd BiCMOS (BS-BiCMOS) [23] af Fig. 5.25, the BFBiCMOS has several advantages. First, the bootstrapped capacitor ir driven by the outpnt rather the input as in the BS-BiCMOS. In BS-BiCMOS, the gate of precharge transistor, Pp is driven to VDD and the node nt to VDD + VBE. Hence, when VT is lower than Vss, the boolrtrapped node leaks its charge and resalts in less efficient bootstrapping. Third, a PMOS transistor Ps is used to discharge the base to a pxcharged level VT, resultins in improved performance. Furthermore, it has a high cioisover capacitance and less performance than the BFBiCMOS.

290 CHAPTER 5

Figvre 5.15 The BS-BiCMOS inucrtcr

The simulated waveforms at 1.5 V power supply of the BFBiCMOS inverter aze shown in Fig. 5.26. The base of QL goes to (VDD t VBB) when the input is low. Note that when the input is high the base voltage falls to VT.

5.3.4.2 Design Issues

As a first orda analysis, the minimum d u e of Cb,,, necessary for the bootstrapping condition, can be obtained as follows During the piecharge cyde, the charge of the bootstrapped capacitor is VDDCS~.~ and the charge on C,, the parasitic capacitance on the node nt, is VDDC,. The total charge on nl during the precharge cycle is

Qni = VDDC~..~ + VDDC, (5.17)

In order for V,t to reach VDD, V,, must reach VDD t VBE- (during the bootstrapping cycle). Thus the charge on C, is (VDD + VBE,)~, and the

Law- Voltage VLSI BiCMOS Circuit Design 291

charge on Cbo,t is V~a,Ca~.c. The new charge is given by

QI, = V s ~ ~ C a ~ ~ i + (VDD + V B S ~ ) C ~ (5.18)

The charge necessary for the base is

Q b = Q-1- 461 (5.19)

As an approximation Qs can be given by

Q h = I & (5.20)

292 CHAPTER 5

where I , is the average base current of Q1 and t, is the rise time of the output. From Equations (5.17-5.20) we find that

This equation indicates that Csomi has to be increased as the power supply is scaled down. When power supply scaling is accompanied with device scaling, 1, improves and as a result ChOot can be kept smsll. At 3.3 V, a typical value of C,,,, is I00 IF, while at 1.5 V, without technology sealing, it is equal to 250fF.

The bootstrapped capacitance can be implemented using a NMOS transistor with its IOUC~ and drain connected together. In this cme, the capacitance is related to the area and gate oxide thickness of the MOS transistor. Simnlations have shown that for 1.5 V power snpply voltage, the width and length of this bootstrapped NMOS are equal to 13 fim and 6 pm, respectively. A typical area increase for B two-input NAND gate due to Cb,, is 10%.

As shown in Fig. 5.24 of the BFBiCMOS inverter, the N-well of the PMOS devices Pp, PI and P* is connected to the bootstrapped node nl. This prevents their source/drain-well junctions to turn ON during the bootstrapping cycle. Also, it pzevents any latch-op which might be eaosed by the parasitic SRC when the drain/sowce-well voltages are forward-biased. The PMOS tiansistor Pa &o has its well connected to its source. This eliminates the body effect of the transistor and prevents any leakage during the bootstrapping.

5.3.4.3 BiNMOS Configuration

Fig. 5.2T shows the BiNMOS version of the bootrtinpped circuit. The pull- down section uses an NMOS transistor (N,) as CMOS.

The p d - u p section of this BFBiNMOS configuration is slightly different than BFBiCMOS, where a small-size PMOS transistar (4) is Sdded. Withont this PMOS device, the base-emitter voltage of Ql would be equal to VBB- when the m t p t reachs VDo. For low output load, if the k p n t goes to high, the p d - down NMOS device, X I , discharges the output faster than the PMOS transistor Pd does for the base. Thus, the bipolar transistor Ql can turn ON to supply the output. This results in 8 high fall time delay. The added smd-sire PMOS transistor, Pf, in the pull-up section solves this problem. It permits, through the US^ oiinveiter I , , to set the voltage of nodes nl and B1 to Voo at the end of the bootstrapping. Hence, the bareemitter voltage of QI is almost equal to eem at the end of the bootstrapping. m e n the base is discharged from

Low-Voltage VLSI BiCMOS Circuit Desiqn 293

T

Figure 5.2T Bootstrsppcdfull-swing BiNMOS inverter (BFBiNMOS).

(VDO t VBB,) to VOO by the PMOS P j , inverter I2 holds the output level a t VDO. Withoot this inverter, the output falls down to a level equal to (VDD - VBE) due to the baseemitter coupling capacitance. The simulated waveforms of the different voltages are shown in Fig. 5.28.

For an n-input gate implementation, the BFBiNMOS requires 4n input transistors. Whereas, the BFBiCMOS and the BS-BiCMOS require 5n and 6n input transistors, respectively. The E ~ O S S O W ~ load capacitance represents one of the important parameters in circuit comparison. It is B measure of the load where BiCMOS circuits start to have speed advantage over that of CMOS. In the range 1.2-3.3 V. BFBiCMOS/BFBiNMOS circuits require almost an e q o i v d d minimum fanont of 5. The BS-BiCMOS have a higher cmssavm capacitance.

294 CHAPTER 5

Figure 52.4 bareofQ1.

Voltage w w o f o m of the inpvt (in), the output (out). and the

5.3.5

In this section, a brief comparison ofseveral BiCMOS logic circuits is presented ning II gene& 0.35 pm BiCMOS technology given in Table 5.3. For moxe detailed comparison, the i d e r can refer to [25].

Two-inouts NAND gate confirruration wlls chosen to evaluate and com~are the

Comparison of BiCMOS Logic Circuits

" - performance of the circuits shown in Fig 5.29. The logic families compared are: CMOS [Fig. 5.29(a)], PBiNMOS [Fig. 5.29(b)], TS-FS [Fig. 5.29(c)], BS-BiCMOS [Fig. 5.29(d)], BFBiNMOS [Fig. 5.29(e)], and BFBiCMOS [Fig.

Low Voltage V L S I BCWOS Carcurt Deszgn 295

296 CHAPTER 5


Teble 6.1 Kay demicc parametrrafm 0 55 BiCMOS PROCESS

0.35pm 0 35pm o a3pm 0 34pm 4.9 mA 2 4 m A B V”. = V n F = 3 3v w = 10 /,m

52 fF 73 fF 30 5l 37 R 28 n 31 R 265 R 280 R

5.29(f)]. The simulations were carried out using a chain ofgatcr. The reported 50% delay timed m e those of an intermediate gate.

Table 5.4 shows the delay, the awage power dissipation and the power-d&T product of the different NAND gates at two sopplies; 3.3 and 1.5V. The rimu- lation was carried out at a typical load capacitance of 1 pF.

The bootstrapped family consumes more power than CMOS because of the higher internal node capacitance. However, they provide a high speed of operation, particularly the BFBiCMOS, where il has a factor of 3 speed advantage compared to CMOS at 1.5 V. Moreover, the delay-power product of the boot- strappcd family is lower than that of CMOS. Notice that at 3.3 V, PBNMOS has the lowest delay-power product and less delay than CMOS. BiNMOS at 1.5 V is slower than CMOS and is not reported in the table. These rwulta also indicate that the m e of the bootstrapped BiCMOS/BiNMOS gate would improve the delay-power product when VDO is scaled dawn to 1.5 V.

298 CHAPTER 5

Logic Type Delay Power DelayxPowei (PSI ( P W W B Z ) (fJ/MH.)

TS-FS 20.0 18.5 26.4 7.6

TS-FS BS-BiCMOS BFBiNMOS

Delay Power DelayxPowu

962 3.84 3.1 1175 4.60 3.2 686 3.50 4.1

5.3.6 Conclusion

We have demonstrated, during all the previous sections, that the b e t family to use for B fanout higher than 5 , is the bootstrapped BiCMOS for the r q e of power supply 1-to3.3 V. Bowe~er, due to its higher area occupied, it can be used m d y in high-speed digital applications. Note, when the load is large, in the range of - 1 p F , the bootstrapped f d y provides a Q h speed and a good dday-power product. One drawback of this f d y , beside the large =ma, is that the bootsttapping is sensitive to the shape of the inpot voltage. One practical gate which can be used in several applications, even when the fanout is low, is the BiNMOS family. It has good performance for 3.3 and 2.5 V power supplies. Also it provides a better delay-product than CMOS. In the next section, many digital applications baed on BiNMOS family are outlined.


5.4 LOW-VOLTAGE BICMOS APPLICATIONS

In this section, we present the applications of BiCMOS digital circvitts in the implementation of digitd building blocks, microprocessors, memories, digital signal pmuessors, and gate arrays. BiNMOS f d y and its ntiliaation in pmc- t ied design at 3.3 V is emphasized. Many of the circuits cited are discuued in detail in Chapters 4, 6 and 7.

5.4.1 Microprocessors and Logic Circuits

BiNMOS logic have been nred in several microprocessors [26, 271. In this application, BiNMOS can be used in critical path delay reduction without increasing .hip area since BiNMOS needs a low-fanout to outperform CMOS. Among the critical paths, we cite

m

m

m

m

Decoders in the register file and the cache memory;

Sense amplifiers and output buffers in the register file and the cache;

Booth's encoder. Wallace tree, and the final adder in a multiplier;

Arithmetic and lopi. unit in a rnio~optoce-x data psth; and

Critical path of the control unit.

In the microprocessor of [26], the PBiNMOS logic family is used a t 3.3 V power supply. The critical ps th ofthe control onit is reduced by 36% ovei CMOS. The BiNMOS gates keep their speed advantage even in the worst ehre (VDD = 2.7 V and T = 125 C).

BiCMOS logic is not only limited to conventional gates, but many other logics can be devised. One such example is the pass-transistor BiNMOS used in the design of a 64bit adder [28] similar to the CMOS CPL logic family discussed in Chapter 4. Fig. 5.30 shows an urdnsive ORINOR gate uriing the pass- transistor BiNMOS gate (abbreviated PT-BiCMOS) wing donble raiL The outputs of the pass-traoristoi network a m connected to the bases of the bipolar transistors Q, and Q2 to reduce the intrinsic delay. The PMOS transistors Pl and Ps are crorr-coupled to restore thc high level of the pass logic to full Voo. The PMOS transistors, P2 and P4, charge the oatput to full-swing. These transistors are subject to body effect, hence they turn ON later during transitions.

300 CHAPTER 5

-Pars-transistor network

Fig. 5.31(a) compares the delay of exclusive OR and NOR gates using PT- BiCMOS, TG-type CMOS, and CPL-type CMOS using 0.5 pm BiCMOS process at 3.3 V power supply voltage. The fanout=l is equivalent to jl capacitance of 35 IT The PT-BiCMOS gate is faster than the CMOS gates for any fanout. The power-delay product is &so shorn in Fig. 5.31(b). The TG gate has the best delay-power product for a fanant lower than 3. However, for B fanout greater than 3, the PT-BiCMOS sate is better.

This PT-BiCMOS has been used in the dcsign of e. &bit adder [28]. It is used mainly in the P, sum and carry blacks. A delay time of 3.5 ns was obtained for the 64-bit adder at 3.3 V, which is 25% better than the CMOS version. The area and power dinsipation penalties of the PT-BICMOS adder, compared to the CMOS, were 13% and 14% respectively. The speed advantage is kept down to almost 2 v.

5.4.2 Random Access Memories (RAMS)

One of the largest applications of BiCMOS is in RAM design, particularly Static RAMS (SRAMs). The first BiCMOS SRAM was proposed in 1985 [29], then many BiCMOS SRAMs were reported [30, 31, 32, 33, 34, 35, 36, 371. The major applications of fast BiCMOS SRAMs a x cache for workstations and msin memory for super computers. Many BiCMOS SRAMJ are in production


BNl

VD".,., Y 7w

006 0 12 0 I* 0 21

Load Capacitance (pF)

Low-Vo7tage VLSI BiCMOS Circuit Design 303

complexity. BiCMOS war limited to some periphery circuits due to layout- pitch matching. It WIU used in the 110 buffers, decoder and drivers, main sense amplifier and voltage down converter. In general BiCMOS SRAMs and DRAMS are not suitable for low-power applications.

5.4.3 Digital Signal Processors

High-performance DSPs are needed in many applications such as video signal processo~~, convolvers, filters. etc. BiCMOS technology has been used E U C C ~ S S -

fully in DSPs operating at B frequency of 300 MHs [41, 421. These DSPs operate at 3.3 V power supply voltage using BiNMOS logic family. Among the characteristier of there BiCMOS DSPr, we cite:

Parallel, pipelined architecture;

m High-performance and high density of integrstion; In this ewe, critical data-path functional blocks are customized; and

304 CHAPTER 5

rn BiNMOS is used in the blocks such as: SRAM, ROM (Read Only Memory), ALU (Arithmetic Logic Unit), multiplier, and clock driver, etc.

Fig. 5.33 shows a block diagram of a DSP [41]. This architecture can ~ E O C ~ S B

any signal processing operation. The BiNMOS inverters me used as dock buffers to reduce the clock skew at 300 MHu clock frequency. The dock is distributed to about 1000 registers. High clock frequency increares drastically power and reduces the power supply voltage due to the powor noise (effect of high disripsted current). The BiNMOS inverter, used in the clock distribution, is the conventional one which h= a high level of VDD - VBE. Bence, the dynamic power of the clock network is rednced by 17% compared to CMOS when rising BiNMOS.

Also the BiNMOS logic is used as:

rn

0th- drivers.

Ootput buffer of the Booth encoder of the convoluer/multiplier blodr;

Decoder driver of the register file; and

5.4.4 Gate Arrays

Gate arrays became very popular for a wide spectrom of applications becsnse of their low cost and short turn-around time. Gate array chips consist of s large number of identical sites 01 basic cells which are usually placed in rows. The rows are separated by routing channels. The core of rows and channels is surrounded by 110 cells at the chip periphery as illustrated in Fig. 5.34.

Each of the basic cells is typically made up ofa nnmhez of transistors which can he connected to form a two input NAND 01 NOR gate or B simple latch. The only p ~ ~ e ~ ~ h g step that can be cnstomiaed is the metalhation. The nser of a gate array can implement the system by specifying the required connections between the devices in each cell and then the connection between the various cells. This is done a u t o m s t i d y using CAD tools. The number of metal levels used for wiling varies from 2 to 4. The first one or two levels are used for internal Wiring of the cell and the upper levels (0.g. third and fourth) for wiring between the cells in the harbontal and vertical directions [43].

Low- Voltage VLSI BiCMOS Circuit Design 305

24-bit fl-

BiCMOS technology has been used extensively for building gate arrays and channelless gate arrays (sea-of-gates) [43, 44, 45, 461. At 3.3 V power supply voltage, BiNMOS logic f d y has been wed [lo, 111. In [ll], BiPNMOS logic gste has been proposed for the Chamelless gate array. Fig. 5.35 shows a layont ofa BiPNMOS basic c d on 0.5 pm BiCMOS technology. A bipolar transistor and a md size MOS transistor are added to the pnre CMOS basic c e l l Thew transistors are not only used to implement BiPNMOS gates but also Eip-flopn, memory macros (RAM, ROM, and CAM), etc. A BiPNMOS two-input NAND gate has 36% delay reduction compared to a similar CMOS gate for B fanout of 7. The speed advantage is maintained down to 2.5 V.

306 CHAPTER 5

110 PADS

I": R

Figure 5.54 ~ . t . A - . ~ d+.floeqian.

5.4.5 Application Specific ICs (ASICs)

In order to realiae high-performance ASICr, fast standard cell library macros for rapid design are important. This library contains custom functional maems such as: adder, Programmable Logic Axray (PLA), register file, RAM, cache, Table Look-aside Buffer (TLB), and controller, ete. PBiNMOS logic has been used for such a standard een library [12]. The cells of logic gates are d-ad in CMOS and PBiNMOS for the same logic functions. T h e PBiNMOS gates are used for a relatively high fanout and load, whereas CMOS gates are used for a mall fanout. A CAD tool can be utiised to choose the most appropriate cells in the design.

Lou- Voltage VLSI BiCMOS Circuit Design 307

Bipolar 0 Resinlor I PMOS

I Ma F 3 S

NMOS

5.5 CHAPTER SUMMARY

In this chapter, we have demonstrated the advantage of using BiCMOS over CMOS in terms of speed. We have shown the historical evolution of the different BiCMOS logic families. A vmiety of alternative circuit techniques for low-voltage operation have been outlined and compared to the conventional BiCMOS. Also we have shown how optimized BiNMOS are faster than CMOS even if the fanout is low (greater than 1). The design techniques c8n he u- tended to more complex gates and building blocks such as flipilops, and adders, ctc. Vsdety of applications where BiCMOS, particularly BiNMOS can be used at low-voltage are reviewed. The addition of the bipolar to CMOS to devise new structures enhancer the performance of ICs. This feature improver the access time of memories, register files, ALUs, DSPs, ete. Notice that a large portion of a BiCMOS IC is implemented in CMOS, while bipolar transistors represent a s m d portion ( 0 5 4 % ) for driving or sensing p u p o s s . The power dissipation of BiCMOS circuits, compared to their CMOS cannterpartr, in- aea5es drruticdy if ECL is nsed because of the DC current. However, if m l j BiCMOS logic gates m e used, the powez inccease is not significant compared to speed enhancemcnt. In some cases, like clock didribution network, the power dissipation is reduced when using BiNMOS.

REFERENCES

[I] A. R. Alvsree, %CMOS Technology and Applications," Kiuwer Academic Pnb., MA, Second Edition, 1993.

[Z] S. H. K. Embabi, A. Bellaouar and M. I. Elmarry, "BiCMOS Digital In-

[3] M. 1. Elmasry, "Design and Analysis of BiCMOS ICr", IEEE Press, 1994.

[4] G. P. Rosseel, and R. W. Dutton, "Muence of Device Parameters on the Switching Speed of BiCMOS Buffers,' IEEE Journal of Solid-State circnits, vol. 24, no. 1, pp. WB9, Febmary 1989.

tegrated Circuit Design", Kluwer Academic Pub., MA, 1993.

[5] P. Raje, K. Chan, and K. Saraswat, "BiCMOS Gete Performmcc Opti- mieation wing Unified Delay Model," Symposium on VLSI Technology, Tech. Dig., pp. 91-92, 1990.

[6] S. H. K. Embabi, A. BeUaouar, and M. I. Elrnsrry, "Analysis and Opt-ration of BiCMOS Digital Circuit Structures," IEEE Journal of Solid-state circuits, vol. 26, no. 4. pp. 676-679, April 1991.

[TI P. A. Raje, K. C. Sarsraat and K. M. Cham, "Performance-driven Sealing of BiCMOS Technology", IEEE Trans. an Electron Devices, ED-39, no. 3, pp. 685-693, March 1992.

[8] 3. Gallie, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE Journal of Solid-state Circuits, vol. 25, no. 1, pp. 142-149, February 1990.

[9] Y. Nishio, et d., "A BiCMOS Logic Gate with Positive Feedback," Inter- national Solid-State Circuits Conference, Tech. Dig., pp. 116117, Febrosry 1989.

I101 A. E. Gamal et al., "BiNMOS a Basic Cell for BiCMOS Logic Circuits", in Custom Integreted Circuits C o d , Tech. Dig., pp. 8.3.1-8.3.4.. 1989.

[ll] B. Ham et al., "0.5-um 2M-Transistor BiPNMOS Channelless Gate Ar- ray", IEEE Journal Solid-State Circuits. "01. 26, no. 11, pp, 1615-1620, November 1991.


[12] H. Hara ct al., "0.5-um 3.3-V BiCMOS Standlrrd Cells with 32-kb Cache and Ten-Port Register File", IEEE Journal Solid-State Circuits, vol. 27, no. 11, pp. 1579-1584, November 1992.

[13] M. I. EImary, and A. Benaoosr, "BiCMOS a$ Low-Supplg Voltage," in IEEE Bipolar/BiCMOS Circuits snd Techoology Meeting, pp. 89-96, Oc- tober 1993.

[14] P. Rsje, et al., "MBiCMOS: A Device and Circuit Technique for Sub- micron, sub2 V Repjme." Internetiond Solid-State Circuits Conference, Tech. Dig., pp. 150-151, 1991.

[15] P. G. Y. Tsui et al., "Stndy of BiCMOS Logic Gate Configurations for Improved Low-Voltage Performance", IEEE Journal Solid-State Circuits, vol. 28, no. 3, pp. 371-374, March 1993.

[I61 S. W. Sun et al., "A filly Complementary BiCMOS Technology for Sub- Half-Micrometer Microprocwror Applications", IEEE Trans. Electron De- vices, vol. 39, no. 12, pp. 2733-2739, December 1992.

[171 K. Yano et el., "Quasi-Complementary BiCMOS for Sub-SV Digital Cir- cuits", IEEE Journal Solid-State Cizcuits, vol. 26, no. 11, pp. 1708-1119, November 1991.

[IS] A. Wataosbe et d., "Future BiCMOS Technologies for Scaled Sopply Volt- age", International Electron Devices Meeting, Tech. Dig., pp. 429433, D e cember 1989.

[I91 A. J. Shin et al., "Full-swing CBiCMOS Logic Circuits", in IEEE Bipo- lar/BiCMOS Circuits and Technology Meeting, Tech. Dig. pp. 229-233, September 1989.

[20] A. BeUaouar, I. S. Abu-Khater, M. I. Elmasry, and A. Chekims, "W- Swing Schottky BiCMOS/BiNMOS and the Effects of Operating Frc queney and Supply Voltage Scaling." IEEE Journal of Solid-State Circuits, vol. 29, no. 6. pp. 693-700, June 1994.

[21] S. H. K. Embabi, A. Bellaonm, M. 1. Elmsiry, and R. A. Hmdaway, "New FoU-Voltageswing BiCMOS Buffers", IEEE Journal Solid-State Circuits, vol. 26. no. 2, pp. 150-153, Febrnary 1991.

[22] M . Hiraki et d., "A 1.5-V FuU-Swing BiCMOS Logic Circuit", IEEE Jour-

[23] R. Y. V. Ch& and C. A. T. Salama. "1.5 V Bootsttapped BiCMOS Logic Gate", IEE Electronic Letters. Vol. 29. No. 3, pp. 301-309, February 1993.

nal Solid-State Circuits, "01. 27, no. 11, pp. 1568-1574, November 1992.

REFERENCES 311

(241 S. 8. K. Embabi. A. Bellaouat, and K. Islam, "A Boatstrapped Bipolar CMOS ( B 2 C M O S ) Gate for Low Voltage Applications," IEEE Journal of Solid-State Ckcuits, "01. 30, no. 1, pp. 47-53. January 1995.

(251 A. Bellaouar, M. 1. Elrnsry, and S. H. K. Embabi. ' Bootstrapped Full- Swing BiCMOS/BiNMOS Logic Circuits b r 1.2-3.3 V Supply Volta8e Regime," IEEE Jaurnsl of Solid-State Circuits, 701. 30, no. 6, June 1995.

('261 J , Shuta, "A 3.3 V 0.6pm RiCMOS Suprrscalar Mic.roproccssor,' IEEE In- ternational Solid-State Circuits Conference, Tech. Dig., pp. 202-203.1994.

[27j F. Murabayarhi, ct sl., -3.3 V, Novel Circuit Techniqnea for a 2.8-Miion- Transistor BiCMOS RISC Microprocessor," IEEE Curtom Integrated Cir- cuit Conference, Tech. Dig., pp. 12.1.1-12.1.4, May 1993.

[28] K. Ueda, H. Suziki, K. Suda, Y. Tnsujihnshi, H . Shinohsra. "A 64-hit Adder By Pass Ttandrtor BiCMOS Circuit,' IEEE Curtom Integrated Circuit Conference, Tech. Dig., pp. 12.2.1-12.2.4, May 1993.

(291 K. Ogiue, et d.. ?4 15 ns/ZSO mW 64K Static RAM," in ICCD. Tech. uig.. pp. i~-z0.1985.

[So] H. Tran o t al., "An 8.m 1-Mb ECL BiCMOS SRAM with a Configurable Memory Array Sine,' Internationol Solid-State Cireuila Con<. Tech Dig., pp. 36-31, February 1989.

pi] M. Matrui et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International

(321 Y. Maki et al.. "A 6.5-0s 1 Y b BiCMOS ECL SRAM," International Solid-

(331 M. Takada e t al., "A 5-ns I -Mb ECL BiCMOS SRAM," IEEE Journal of

Solid-state Circuits Cod., Tech. Dig., pp. 38-39, February 1989.

State Circuits Conf. Tech. Dig., pp. 136-137. February 1990.

Solid State Circuits, VOI. 25, no. 5 , pp. 1051-3062, October 1990

134) A. Ohbn et al.. "A 7-ns I-MI) BiCMOS ECL SRAM with Program-Free Rcdundancy," in Symp. VLSI Circuits Conf. Tech. Dig.. pp. 41-42, May 1990.

(351 Y. Okaji ia et &I.. "A 7-nr 4-Mh BiCMOS SRAM with a Parallel Testing Circuit," International Solid-state Circuits Conf. Tech. Dig., pp. 5455, February 1991.

136) N. Tamba el sl., '"A 1.5 nr 256Kb BiCMOS SRAM with 11K 60 PI Logic Gates." International Solid-State Citcuits C o d , Tech. Dig., pp. 246-247, Februaiy 1993.


[37] K. Nakamvra et al., "A 200-MHz Pipelined 16-Mb BiCMOS SRAM with PLL Propmtional Self-Tim'mg Generator," IEEE Journal of Solid-State Circuits, vol. 29, no. 11, pp. 1317-1322. November 1994.

[38] G. Kitsukawa, et al., 'An Exp-ental I-Mb BiCMOS DRAM," IEEE Jonrnal of Solid-State Circuits, vol. S C Z Z , no. 5, pp. 657-662, October 1987.

[39] S. Watanabc, et al., "BiCMOS Circuit Technology for High Speed DBAMs," Symposium on VLSI Circuits, Tech. Dig., pp. 79-80, 1987.

1401 G. Kitsukaws, et al., "Design of ECL I-Mb BiCMOS DRAM," Electronics and Communications in Japan, Part 2, vol. 76, no. 5, pp. 89.102, 1992.

[41] M. Namura et al., ''A 300-MH8, ]&bit, 0.5-em BiCMOS Dsital Signal Proeesror Core LSI," IEEE Cnstom Integrated Circuits Conference, Tech. Di. ,pp. 12.6.1-12.6.4,Me.y 1993.

1421 T. Inoue, et al., "A 300-MHe 16-bit BiCMOS Video Signal Proeersor,", IEEE Journal of Solid-State Circuits, vol. 28, no. 12, pp. 1321-1329, De- cember 1993.

[43] F. Mdurabayshi, et al., "A 0.5 micron BiCMOS Channellcss Gate Amy," IEEE Curtom Integrated Circuits Conference, Tech. Dig., pp. 8.7.1-8.7.4, May 1989.

[44] E.Hara,etal., YA350p~50X0.8micr~nBiCMOS GateAnaywithShared B i p o h Cell Structure," IEEE Custom Integrated Circuits Cenferenee, Tech. Dig., pp. 8.5.1-8.5.4,Msy 1989.

I451 J. D. Gallia, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE Journal of Solid-State Circuits, "01.25, no. 1, pp. 142149, February 1990.

[46] T. Hanibuchi, et al., "A Bipolar-PMOS Merged Basic Cell for 0.8 micron BiCMOS Sea of Gates," IEEE Joarnal of Solid-State Circuits, vol. 26, no. 3, pp. 427-431, March 1991.

6 LOW-POWER CMOS RANDOM ACCESS MEMORY CIRCUITS

Low-power Random Access Memory (RAM) h a s seen a remarkable and rapid progress in power reduction. Many circuits techniques lor active and standby power reduction in static and dynamic RAMS have been devised. In this chapter we study low-power memory circuit techniques which are very interesting for several other applications. Among these circuits, we eramine memory cells, sense amplifiers, precharging circuits, ete. Circuit techniques for 1.r V power supply are also discussed. The voltage targets using NiCd and Mn batteries are 1.2 and 1.5 V respectively. The minimum voltage of a NiCd cell is 0.9 V. Also we consider the Voltage Down Converters (VDCr) which are used in memories and processors. No consideration is given to the detail of designing B complete memory chip because a single configuration requires an entire book.

6.1 STATIC RAM (SRAM)

Today, workstations, computers and super computers are demanding high- speed and high-density SRAMr, e.g., cache memories. These systems started to use 4-Mb fast SRAMs and will require, in the future, larger density meme nes with faster access time. Many I-to-4-Mb BiCMOS SRAMs [l, 2, 3, 4, 5. 61 have achieved access times of 5 to 10 ns. In these SRAMs, the power dissipation is 275 to 1000 mW. which is not acceptable in many applications. On the other hand, high-density, low-pawe~ SRAMs are needed for applications Such as hand-held terminals, laptaps, notebooks and IC memory cards. Table 8.1 shows examples for high-density SRAMr with low-power characteristics. The standby current is in the order of 1 @A snd rub+A which is suitable for battery-backup operation.

314 CHAPTER 6

Memory size (Ref.)

Power CMOS Access Power supply technology time dissipation

1-Mb [‘f] 4 M b [8] 4 M b [9]

16-Mb [lo] 16-Mb [Ill 16-Mb [I21

The power dissipation iednction in SRAMr is not o d y due to power supply voltage reduction, but &o to low-power circuit techniques. In this section we review some of these circuit techniques for low-power applications.

SRAMs have several advantages O Y ~ T Dynamic RAMS (DRAMS) such BS:

rn

m

No refresh operation of the memory cells are needed.

The speed of an SRAM is higher because of the differential pair of bit-lines.

The operational modes are simpler because the row and eolamn address signals are simultaneously loaded.

A low data retention current which is required by battery applications.

However, S U M S have the great disadvantage ofa large memory eeU eompered to DRAMS. For this reason, their capadties rue smaller than that of DRAMs.

3.0 V 0.35-pm 7 ns 140 m W C3 100 MHa 5.0 V 0 50-pm 23 ns 100 mW d 10 MHz 3.0 V 0.60-pm 68 ns 21 mW d 10 MHa 2.5 V 0.25-pm 15 ns 120 mW @ 20 MKs 3.0 V 0.40-pm 15 ns 165 mW 0 30 MHz 3.3 V 0.35-pm 9 nr 238 mW d 30 MHz

6.1.1 Basics of SRAMs

In order to treat the different circuit parts of an SRAM, it is important to understand some characteristics of there memories. In general the pins of B SRAM are :

1. Addresses (Ao ... An); which d e h e the memory location;

Low-Power CMOS Random Access Memory Czrcuits 315

2. Write Enable (m); which selects between the read and write modes;

3. Chip Select (m); whkh selects one memory out of several within a

4. Output Enable (El?); which is used to enable the output buffer; and

5 . Input/Output data (I/O).

6. Power supply pins.

system;

A timing disgram during read eyde is shorn in Fig. 6.l(a). Daring this t ime the data stared in a specific SRAM location (defined by the address) is read out. For a read cycle, two times are shown in the figare; the read cycle time, ixc, and the address access time, IAA. Fig. 6.l(b) shows the write cycle which permits change to the data in an SRAM. Two timer are indicated. the write cyde time, f w c , and the write recovery time, ~ W R . Same of this information is used in this chapter. For more detail on the timing, the reader can refer to any memory data book.

A typical SRAM mchitecture is shown in Fig. 6.2. The memory array contains the memmy cells which a x readable and writable. The row decoder (X- decoder) selects 1 out of n = 2’ rows, while the column decoder (Y-decoder) Selects I = 2’ out of m = 21 columns. The address (row and column) are not multiplexed as in the ease ofa DRAM. Sense amplifiers detect small voltage variations on the memory complementary bit-line which reduces the reading time. The conditioning circuit permits the preehaige of the bit-lines. The a- ces~ b e is determined by the critical path from the address input to the data output as shown in Fig. 6.3. This path contbins address input buffer, row decoder, memory cell array, sense amplifier and output buffer circuits. The word-line decoding and bit-lines sensing delay timer am critical delay compo- nentr. To reduce the sensing time during a read operation, the swing on the bit-liner should be as small as pamible.

For an aspchronons’ S U M , a s p e d Circuit called an Address Detection Transition (ATD) permits the generation of internal pulses. These pulses are of two types; activation and equalieation. Activation pulses selectively activate particula circuits, while equalization pukes permit the reduction ofthe delay by restoring and equalking differential nodes prior to being selected. In t h m section we treat only asynchronous SFLAMr.

‘Not docked crternoily.

316 CHAPTER 6

- CS (Chip Select) ;

- OE (Output Enable) I \

ktnn-

Data Out

- r- CS (Chip Select) I tWK \ - WE (Write Enable )

Data in Dafa valid \\\

(b)

Figure 0.1 Typicd timing of a SRAM: (s) mad q d e ; (b) w i l e cydc.

LlC

318 CHAPTER 6

Input Addmr Row decoder Memory

cell idnver address mpnt buffer

6.1.2 Static RAM Cells

The memory cell is an important circuit in the design of low-power and high- density SRAMs because the memory size is dominated by the cel l area. There are various static memory cells. The cell of Fig. 6.4 has six transistors, in the form of two inverters, cross-coupled with two pars-transistors, connected to two complementary bit-lines BL and B. The pas-transiston are controlled by the signal W L (word-line).

During the read cycle, the bit-lines are held high (prechsrged). Assume that a "0" is stored at node A an& "1' is stored at node B. W h e n the cell is selected; i.e., WL set to "I", BL is discharged through N1 and N3.

To write in the cell, one of the bit-liner is pulled low and the other high and then the cell is selected by W L , Assume that B is set to "0" whil e mltlally ' ' ' a "1" is stored at node A ("0" at B). N1 and P1 should be riaed such that node A is pulled down enough to turn P2 ON. This in turn causes node B to be pulled np. The crosssoupled inverter pair have a high gain to cause the nodes A and B to switch to opposite voltages. The data retention (standby) current of thk cell can be 85 low BS 10-"A. Although this full-CMOS cell has low retention current, the cell area is so large that it does not allow high-density SRAMs. A typical cell area using a 0.8 ~m design rules is 75 p d ,

The stability of the memory cell is its sbility to hold a stable state. Fig. 6.5(a) ahows the transfer cumes of full CMOS S U M S . The box between the two

Low-Pomuer CMOS Random Access Memory Circuits 319

Figure 6.4 CMOS memory c d M i r h PMOS laad

characteristics (I and 11) defines the Static Noise Margin (SNM). Static noise is DC disturbance, such ffi offsets and mismatches, due to the pioeesskg and variations in process conditions. The SNM is defined as the maximum value of V, (static noise IOOIC~ ffi shown in Fig. 6.5jb)) that can be tolerated by the cross-coupled inverters before altering state. An important parameter in SNM is the memory cell ratio, I, defined by

where transistors N , and N, sre the a c e m and driver NMOS transistors shown in Fig. 6.4. An andysk of SNM for memory cells is given in [13]. This static noise margin parameter incremes with the ratio 7. However, it k limited by the cell area constraint. The stability of the cell iS maintained even if VDD is scaled down.

Another mcmory cell configuration is shown in Fig. 6.6. This cell is similar to the full CMOS memory cell, except that the PMOS pull-up devices are replaced by high-iesistance polysilicon loads. The memory cell area can be

320 CHAPTER 6

"DO

I

about 30% to 40% smaller than the CMOS &-transistor memory cell, because the two polyrilieon resistances can be formed on top of the two NMOS driver transistors. The High Resistive Load (HRL) memory cell har been used in several SRAM generations from 4Kb. The high state storage node of Fig. 6.6 ulll be p d e d down with time due to two kinds of leakage current; the I d a g e current ofthe drsin junction and the subthreshold current. The voltage drop BCZOBI the resistance R prevents iegvlac cell operation, if the leakage current reacher the l e d of the poly-Si remtor current. In several SRAMs generations using BRL memory cell, the total standby current w w act to 1 p A per chip a t room temperature for battery-backup applications. Thus, for each memory generation with quadrupled density, the polyJi resistance value is also quadrupled. For 4 M b chip which h a II total standby current less than 1 PA,

Low-Power CMOS Random Access Memow Cwcuzts 321

I

typical d u e s of &'stance me in the 5 x 1 P 0 range and the resistance current is limited to 10-laA. This current should be mvch larger than the total leakage current of the storage node of the cell to improve tho data retention margin. The leakage current cannot be scaled because, fist, the subthreshold current per channel width, tends to increase; particalerly with the trend to decrease the threshold voltage for low-voltage. Second, the leaksge current of the drain jonction per area unit tends to increase with technology scaling. Moreover the junction area is shrank with a rate lower than the SRAM density increase rate. In [14], it wm determined that the maxim- SRAM capacity for low-power applications, using an ERL memory cell is 4 Mb where the retention current is 1 @A.

Note that the high-level node voltages of all poly-Si load memory cells are (VDD - VT) after mite cycle, where VT is the threshold voltage of the access transistor, subject to body effect. These nodes need a time of several ms to charge np to VDD. The SNM of the ply-Si load memory cell L more sensitive to cell ratio 7 , than the full CMOS cell 1131. A typical valne of I is 3. Also the cell stability is drastically degraded when VDD is 3 V or less. The transfer curves in the read mode can be easily plotted for different VDD to flnd out that the cell cannot store the data a t a certain low-voltage.

322 CHAPTER 6

I p-Suhsmle I

Low-Power CMOS Random Access Memory Circuzts 323

For 4 Mb and higher density SRAMs, the polysilieon load cell starts to be replaced by a polysjliean PMOS load called PMOS Thin Film Damistor (TFT)

for low-power applications [S, 9, 151. Fig. 6.7 shows a cmss section and k c n i t diagram of the poly-Si PMOS load memory cell 181. The TFT device is fabricated from amorphous silicon (a - Si). This material has a grain size of 2 ~ r n while that of the conventional poly-Si material is 0.03 pm. The thickness of this a - Si is 100 nm and the gate oxide thickness of lhe TFT is 40 nm. This technology rerulls in improved ON/OFF currents compared to the one using poly-Si. The Ni drain area of the NMOS transistor ia used ar the gate electrode for the PMOS TFT. To obtain a small area, the polydimn PMOS must be stacked on the NMOS driver. The second palysilicon Iaye~ farms the channel regions. The TFT memory cell area is more than 40% s d e r than the fall CMOS one.

Fig. 6.8 shows the drain curzcot of B PMOS TFT used in a 4-Mb SRAM as a function of the gate voltage. An ON current more than W 7 A is obtained at a supply voltage of 3 V, while an OFF current of lO-"A is attained. The ON current is larger by more than six order of magnitude than memory cell leakage currents which b much better than the current of the HRL cell Thos, it results in an excellent data letentian characterbtic. Moreover, the very low OFF current results in a standby current less than 1 p A for 4-Mb SRAM. This current is low enough for battery back-up operation. At 1.2 V power supply, the current flowing in the PMOS TFT is more than one-and-a-half order of magnitude larger than the OFF current. Thk demonstrates the ability of this teehnoiogy for iow-voitsge operation.

Afier write cyde, the hgh-storage node voltage in the cell becomes VDD - VT. The time needed for charging up this node to VDD is

(6.2) C,VT

4 t,h = -

where 4 ir the current flowing in tho load device and C, is the total parasitic capacitance of the node. Using 4-Mb data for TFT memory cell, VT = 1 V , C, = 10 fF and 4 = 10 p A the to& is around 1 me. For poly-Si load this chage-np time is larger than 100 mi because h k low iy~ 0.1 PA. The average interval time between two word-line selections (for the same word-line) is given by

Nlcy,rr 1. = ~ M where N is the number of memory ceUr per SRAM chip, M is the number of memory cells pel word-line, and (or noted tnc ) b the operating cycle time. For CMb, a typical value oft, is 4.5 ma when the cycle time is 70 na and

(6.3)

324 CHAPTER 6

M equ& 64cell/word-line. Comparing t . to t.k for poly-Si load and PMOS TFT we have

t,* < t, For P M O S TFT (6.4)

t o * > 1. For p l y - S t Lond (6.5)

Thus, the high-storage node, in the ease of PMOS TFT sell, is charged-np qvkkly to VDD. For this rearon, the Soft Error Rate (SER) of the PMOS TFT cell is much lower than that of the poly-Si cell [El.

6.1.3 Reamri te Operation

Fig. 6.9 shows a simplified readout circuitry for an SRAM. The circuit has static bit-line loads composed of pull-up NMOS devices N , and N2. The bit- lines are pulled-up to a voltage (VDD - h), where V!, is the threshold voltage

Low-Power CMOS Random Access Memory Circuits 325

326 CHAPTER 6

"OD

W L

Figure 8.10 Power reduction by pulsing the word tine.

mbjett tu body effect. When the word-line W L is asserted, one word is selected. At this time, the bit-line B L is p d e d down to s level determined by the pull-up NMOS HI, the word-line transistor N., and the driver NMOS transistor Nd ss shown in Fig. 6.9(b). The voltage at the node A should be low (mar ground) to not alter the RAM content during this read operation. A small swing change on BL is dwirable to achieve the high-speed readout, particularly if CnL is high. The Sense Amplifier (SA) amplifies the small swing, AV on the bit-line. Typical values 0fAV-J are 100 mV wd.L?& respectively. It should be noted that t&FA phould provide a wide opemting margin over all pmcess, temperature, and voltage cornerr.

If the W L signal stays asserted, all selected eolamns consume a DC current flowing through the NMOS devices N,. N. and Nd. Thus, the shortening of read mode duration is necessary to reduce the power dissipation during this active mode. This is possible by pulsing W L with enough time to read the cell as shown in Fig. 6.10. The generation of pulsed W L signal is possible owing to the Address Transition Detection (ATD) technique as will be discussed in Section 6.1.5.

Fig. 6.11(a) shows asimplified circuit configuration for SRAM write operation. For II write operation the memory cell state should be Ripped. When the write signal W E is asserted, the input data and its complement are placed on the bit-lines. If for example, a vero has to be stored in the node A initially at VDD, the voltage at this node should be below the threshold voltage of the coll, as shown in equivalent circuit of Fig. 6.ll(b). The bit-line in thia crse is pulled-down to almost 0 V. The design of write circuitry should provide a wide operating margin ova all process, temperature, and voltage corners. Note that B DC current is consumed during a write mode, hence the W E signal should


WL ~

BL

&o be short to cut this current at the end of the write operation. In high-speed SEAMS, write recovery time is an important component of the write eyde time. It is defined BE the time necessary to recover from the write cycle to the read state after the WE s i g d is disabled. Note that the swing on bit-lines after mite operation is large. Thus, an equalizer circuit is needed to reduce this s-g, so that the read operation is performed qoidrly.

Fig. 6.12 illustrates b simplified achematic of an SEAM with xead/write circuitry. At the end of the memory cycle a differential voltage existed on the bit-lines. A PMOS equalizing device is used to equalise the bitliner after each read and write operation. The differential voltages on the bit-lines are restored

328 CHAPTER 6

Dafa-i"

%D

0 WE

0 WL

@.@ x T

Lou-Power CMOS Random Access Memory Gircuzts 329

Bil-line conBLioning column 1 md COlvm" m

AQ 1M

/ a%

9 X3LdVH3 OEE


rn The decoders (row and column);

The memory array. Ifm memory cells are connected to the ward-he, the active power of memory array (in read mode) is given by

Pmm-ma, = m P d + (n - l)m&ab + mrDcAtfVDD (6.6) Where P.., is the power dissipated in active mode when selecting the m cells and ~ I . . I , is the data retention (standby) power of the unselected mekory cells in the m Y n array. The second term is neplipible. The third term is due to the DC current, ID, , dadng the read operation. At is the activation t ime of the DC eonr-g parts and f is the operating frequency (f = 1Jinc). An example of such a current is the DC current flowing Gom the bit-line load to the ground through the memory cell;

Sense amplifiers. They m e dominated mainly by a DC current; and

Remaining periphery such as input/output buffer, write circuitry ete.

Note that the power dissipated by the pads is not included. The power dissipation of the components, other than the memory array, depends on the total capacitances, the opersting frequency and the internal voltage swing. It can include a DC component with a major contribution from the sense amplifier.

To reduce the active power consumption many techniques can be used and are summatized 85 follows :

rn

m Reducing the capacitances of the word-line and the number of m cells connected to it. This is possible by osing Hierarchical Word-Line (HWL) techniques.

Reducing the DC current by using the pulse operation technique for the word-tine and the periphery circuits (including sense amplifier).

Use of multi-stage static CMOS decoding to reduce the AC current.

Lowering the operating power supply d t a g e .

- rn

The standby power (or Sometimes called retention current) of an SRAM has a major contribution from the memozy cells in the array if the sense amplifiers are disabled in this mode. It is given by

P s t c d b v = mnprcar (6.71)

332 CHAPTER 6

One way to reduce the standby current is to reduce the operating voltage. How- ever, note that the data-retention cnirent will increase with memory capacity. Moreover, the leakage current, per cell, tends to increase because the threshold voltage is expected to be reduced for low-voltage operation.

In the following sections, many key circuits in an SRAM are reviewed. The circnit techniqocs and memory organisation to reduce the lrctive and data- retention currents are presented.

6.1.5

To generate the different t-ng signals for word-lines, equalisation and sensing, an on-chip pulse generator, which detects the address change, is needed. It is baaed on address transition detection technique. The ATD is a key technique to reduce the active power of memories. Fig. 6.14(a) shows the schematic diagram of an ATD pulse generator. Short pulses are generated with XOR circuits when the address changes from "L" to 'H" or "H" t o " L " ; then summed through an OR gate. The overall pulse width is controlled by the RC delay line shown in Fig. 6.14(b). The corresponding waveforms are shown in Fig. 6.14(c). The d m o pulse is usually stretched out with a d&y circuit to generate the different pulses needed in the SRAM. Note that the CS signal is also included as m input to the ATD generator.

Address lkansition Detector (ATD) Circuit

6.1.6 Decoders

Usually the decoding in an SRAM is performed by using complementary CMOS. Two kinds of decoders arc used ; the row Fast static decoders are based on OR//NOR and ANDINAND gates. Fig. 6.15 shows an example of a two-bit input address EOW decoder. The input bnffers have to drive the interconnect capacitance of the address lines and the input capacitance of the NAND gates. To match the pitch of the memory cell and to perform decoding for severals blocks, twostages decoders ale used. The first stage performs predecoding and the second one performs the final decoding function [Fig. 6.161. The twostages decoder circuit has other advantages over the onc Stage decoder such as to reduce the number of transistors and fanin. Also it reduces the loading on the address input buffers. This predecoding teehnique optimiiaer both speed and power. In the last stage an additional signd 4, is included in the AND gate. This signal is generated from an ATD pulse generator to enable the decoder and ensue the pulse activated word-line. There

and the column decoders.


(h)

Address ii 6

334 CHAPTER 6

Address h e r - Word line dtivcr : r

Low-Pourer CMOS Random Access Memory CirczLita 335

Predecodcr Final decoder

are several ways to build mw-decoderr and it depends on the R.AM architecture division.

The column decoder permits the selection d l out of m bits of the accessed TOW.

Fig. 6.17(a) shows the circuits involved for column selection uskg an example of 4 columns. The selected gate permits the transferring of the data from the bit-lines to the common data-lines I jO . The signals Yi are controlled by the ANDINAND colum decoder BS shown in Fig. 6.17(b).

336 CHAPTER 6

Low-Power CMOS Random Access MemonJ Czrcuits 337

6.1.7 Bit-line Conditioning Circuitry

The NMOS bit-lines' loads [Fig, 6.181 have been used in many SRAMs at 5 V pow= supply. They provide a precharge level on the bit-lines of VDD ~ VT. The threshold voltage of the load, VT is subject to the body effect. A typical valne of this precharge level for 5 V power supply is 3.5 V. This level is suitable for voltage-type sense amplifiers to provide large gain and fst rensiog delay.

To reduce the DC current, during the write circuit, a variable bit-line load tdmique can be employed [Fig. 6.191, It realizes fast sensing in the read cycle and B short wdte pulse width in the mite cycle. For fast sensing, the voltage swing of the bit-line shodd be small. To achieve this, the load impedance should be low. On the other hand, to obtain a low current dndng write cycle, the load impedance of the bit-lines shonld be high. As shown in Fig. 6.19, during the read operation, all four NMOS transistors N,, Na, N,, and N4 are turned ON. The bit-lines are switched into a low-impedance state so that the Voltage swing of the bit-lines is limited to R small value (e.g., 100 mV). During the write operation, the NMOS devices N, and NI arc witched OFF and only the small she transistors N, and N, are turned ON.

338 CHAPTER 6

NI i Figure 6.19 Variable load bit-hrs.

T


As the power supply voltage is sealed down to 3 V, the preeharge level can be lower t h q 2 V, Thus, d-g r e d operation the high-level node of the memory cell can t;,f&e equal to the bit-line d t s g e . Hence, the noise margin of the memory cell is drastically degraded and consequently the cell stebbility and soft error are degraded. Therefore, at 3 V power supply voltage, a PMOS trsnsktor can be used w bit-liner' load [Fig. 6 . 201. The bit-lines precharge voltage is V b ~ . Far law-voltage bit-liner precharge voltage, special ~ense amplifiers should be used because conventional sensing circuits have poor voltage gain (less than 10). A variable impedance bit-line, using PMOS transistois, can &o be implemented.

6.1.8 Sense Amplifier

When reading II memory cell, the bit-lines are initially precharged. then one i f the two bit-lines goes down, while the other stays high. The operation of poll ing down the bit-line is very slow because the discharging MOS device, in the memory cell, is small and the bit-line capacitance is high. This results in very slow memory read time. Sense ampliiiers are used to detect the small "adation on the bit-lines and amplify it to get at the end fuU-swing signal. A dmple anbalanced inverter with a high logic threshold voltage can be used. j i c e its input is single and has very small noise margin, it ir very sensitive to noise on the bit-line. Thus, sense amplification, for the data-liner, is a key to aehieve fast access time and low-power dissipation. In general, the delay of B

sense amplifier (from the time of word-line activation) represents 30 to 40 %of the whole read aserr tie.

Various kinds of sense amplifiers have been devised for fast sensing operation and low-power dissipation. Fig. 6.21(a) shows a ringlcend sense ampliser with an active current-mlror. Thin structure forms the basin for ~ n y SRAMa' sense amplifier circuits. It has two differentid inputs, DL and DL. The noise equally affects both the two inputs and only the difference is detected. The transistor N, acts as a curent source. Before the signal $ 4 . ~ is asserted, the data-lines D L and DL are high. AU the nodes, A, B and C, a x high. The signal &A is aserted when DL starts, for example, to drop slowly. In this m e , the NMOS transistor N, is ON. The output voltage (node C) drops suddenly to a ca t& voltage. Thus, the input signal is amplified by the gain of this differential amplifier.

Fig. 6.2l(b) shows the voltage waveforms of the single-end sense amplifier uskg SPICE simulation. The signal is generated with an ATD pulse. It i s

340 CHAPTER 6

Low-Pourer CMOS Random Access Memory C~rcuets 341

asserted for a time, enough to amplify the small variation (few hundreds of rnV) on data-lines', then it is disadivated. In this scheme the DC cnrrent consumed by the sense amplifier is cnt off. Usually the sense amplifier is common to msny columns through the common data-liner. The smal l Signel gain of this amplifier is given by

* = 9-- (6.8) 90

where y'mn is the transconductance of the driver NMOS Nd and go is the corn- bioed output conductance of the PMOS load and the NMOS driver.

In many SRAMs multi-stage sense amplifiers are needed to attain large volte.gge gain. In this case, the daublbend sense arnpLifier is used a6 shin Fig. 6.22. This circuit h s often been wed in many SRAMs. To attain high-speed data sense, a two and three-stage sense amplifier technique a n be adopted. Fig. 6.23 shows a two-stage amplifier structure. An equalisation technique is used for the data-lines, using the equalization pulse 4sq, which is generated with an ATD pnlse. It is indispensable, not only to attain faster data transfer

'Thc auipui of the srme ampmcr k then iatchcd.

342 CHAPTER 6


I S

Figure 8.14 PMOS cross-couplid sense nmplrficr

during read operation, but also to suppress incorrect data before the comect data appears in the sense amplifier [17]. For low-powei applications and &o due to the plastic packaging limitations of static memories, this type of sense amplifier can result in high power dissipation for high-density memories even if the current source is pulsed.

Many circuits have been proposed to reduce the power of the sense amplifier while improving their sensing delay time. One of them is the PMOS CIOSS-

coupled amplifier [I81 shown in Fig. 6.24. The PMOS loads, P, and Pz, are cross-coupled and the Merent id outputs S a m S are connected to their girtes. The positive feedback in this latch amplifier permits much faster sense speed than the conventional one. In this circuit the equalization technique is used for the reasons discussed above. Fig. 6.25 rhawr the senre delnys of both the PMOS cross-coupled amplifier and the double-end current-mirror amplifier as 1 function of the average current of the amplifier. The input voltages simulate

344 CHAPTER 6

0 6 prn CMOS

Convenuo~ai current -mrrror SA - 1 2 3 4 5 6

' d


the common data-lines' voltages and the sense delay id is defined as the delay time from the crosso~er point ofthe input voltages to the point when the ontput reacher 1 V difference. The PMOS cross-coupled amplifier has less than half the delay of the conventional current-mirror sense smplifrer. Moreover, this latch amplifier consumes less than one-Mth ofthe power of 6 current-mirror amplifier. The PMOS cross-coupled latch amplifier requires much more accurate timing for +., to optimize the sensing delay [la], Thin circuit also has low-power property compared to the current-mirror amplifier since it has nearly full-swing outputs with positive feedback.

346 CHAPTER 6

When the voltage is sealed to 3 V power supply, the data-line voltage is near VDD, then a level shifting can be pedormed. Fig. 6.26 shows a two stage sense amplifier wed for 3.3 V mpply. The first stage is a cross-coupled NMOS amplifier which also performs level shifting of the common data-line voltage. In the second dage, a conventional sense amplifier is used which operates at the maximnm 9.;. point since the l e d on SA a d YZ =re medium leutlr.

Fig. 6.21 shows another sense amplifier developed for low-voltage power supply [IS]. This circuit is mcd when the bit-tines are close to VDD, where the gain of a conventional current-mirroi amplifier is poor. The circuit is composed of a level-shift circuit and a conventional current-mirror amplifier. The level-shifter shifts the bibline voltage to a medium voltage; 0.6 to 0.7 V, (@ 1 V power

Low-Power CMOS Random Access Memory Czrczlits 347

supply voltage) where the gain IS maximum. Low-VT NMOS devices NL and N2 are used to provide these medium levels. There devices are subject to the body effect.

Recently current sense-amplifiers have been proposed to overcome the gain reduction of voltage amplifiers a t low power supply [T , 121. Alao they reduce the power diiaipntion of the sensing operation compared to voltage sense amplifiers at the same delay. There circuits require wry careful dengn.

6.1.9 Output Latch

In low-power SRAM, the pulse technique for word-line and seme amplifter ir indispensable in order to reduce the DC Current. In such B pulse mode. a data- latch circuit is required to Store the amplified data by the sense amphfier from the memory cell for the data output circuitry. Fig. 6.28 shows an example of an output latch placed after the sense amplifier. The requirements of such an ontput latch are the following '

m The latch circuit must not delay the mad access time. Such a requirement is attained by connecting the latch with data-bus lines in parallel. One input transmission gate, controlled by 41, is used to enter the data to the latch. Another transmission gate, controlled by 40, is used to put the dat. back into the det-bnr.

The latched data must not be destroyed by the noise entering the SRAM. A noise in an SFAM is generated and propagated by the following mechanism. On the system board, 8 ground noire can enter the SRAM. When the peak level of the ground noise becomes large enough for the first gate of the address buffer to change the logic value of the address input, an ATD pulse noise is generated. This noise pulse could turn on the word-line and the *erne amplifier for a short time resulting in an expected signal on the data-bus. Therefore, the Latched data conld be destroyed if the inpnt Gp.1 is ON. To avoid such a problem, two circuit techniques m e included in the eireuit of Fig. 6.28. The first one is the generation of Qr only when the pulse width of the ATD is large enongh, compared to that of the noise. The other circuit technique is to place latch-protecting invertem [Fig. 6.281 in the front of the output gates. The inverterr prevent noise from entering the output gates.

rn

348 CHAPTER 6

1 The new data must be quickly latched into the data-latch. The circuit of Fig. 6.28 can be optimbed for fast operation.

6.1.10 Hierarchical Word-Line for Low-Power Memory

With the increased memory size, the word-line delay and the column power increase. To solve this problem, B Divided Word-Line (DWL) structure was proposed [ZOr. The concept of DWL is shown in Fig. 6.28. The cell array and the word-line are divided into ng blocks (rub-arrays). If the SRAM has no columns, each block has n o / n ~ columns. The divided word-line of each block is activated by the main word-line and the corresponding block select signal. Consequently, only the memory cells connected to one divided word- Line w i t h a selected block are accessed in a cycle. Hence, the column current


Global row decoder Block 2nd Block nBch Block - n- -

Elnck sdcct l i l l C

n in CI,IIIIlI"S

(rneniory cells) C B

Figure B . m Divided Word-Linc (DWL) concept [ZD]

is reduced, since only the selected columns switch. Moreover, the ward-line selection delay, which is the delay time from the address input to the divided word-line, is reduced. This delay is composed ofthe main word-line select delay and the divided word-linc select delay. The main word-line selection delay is reduced compared to the conventional one, because the total capacitance of connected transistors is reduced. In a conventional S U M , the word-he has all the row memory cek' gates of B row connected to it. The insin word-line delay increases as the number of blocks increase because the number of block select gates increases. On the other hand, the divided word-line delay decreases as the number of connected cells i s reduced with the increasing number of blocks. Consequently, the word-line selection delay has a minimum for a certain number of blocks.

Fig. 6.30 shows the effect of the number of blocks in DWL structure on the word-line select delay and the colvmn power for 64-Kb SRAM [lo] . In this example. a number of blocks of eight can be chosen. The ares penalty for this case is only 5%, compared to the conventional memory. AE an example, for I-Mb SRAM, the cell array is divided into 16 blocks and each black consists of 612 OWE by 128 columns. 9-bit address ( ,4 . . .Ae) is used to select B I O W within

350 CHAPTER 6

I 2 16 32

Number of Blocks

a block using two-stage row decoder. Global block selection is done using &bit address.

The DWL structure has been widely used in high-density SRAMa for its low- power. high-speed characteristics. However, in high-density SRAMs, with a capacity more than 4Mb, the nomber of blocks in the DWL structure will have to increase. Therefore, the capacitance of the global word-he increases cansing the delay and power increase. To solve this problem, the concept of Hierarchical Word Decoding (HWD) was proposed in [21] as shown in Fig. 6.31. The word select line is divided into more than two lev&. The number of lev& (hierarchy) is determined by the total load capacitance of the word select line to efficiently distribute it. Hence. the delay hnd the power ayt reduced. For 4-Mb, three levels of hierarchy haw been used with 32 blocks; each block having 128 columns by 1024 rows. Fig. 6.32 shows the delsy time and the total

352 CHAPTER 6

capacitance of the word decoding path comparison for the optimized DWL and HWD strmtures of 256-Kb, 1-Mb, snd 4-Mh S U M S . For 256-Kb SRAM there is no significant advsnthge of HWD over DWL. However, for high-density SRAMs the perfounance, of HWD in terms ofpower and delay, becomes dear. The three-levels scheme can be used efficiently for 16-Mb SRAMs.

6.1.11 Low-Voltage SRAM Operation and Circuitry

There are several applications which need a 1.2 V battery power supply. For such B application 1 V SRAMs are needed. At 1 V power supply, B stable operation is targeted and it is very important that the noise is reduced. Moreover, the active and standby powers should be reduced to meet the requirement of battery operation.

For 1 V power supply, a full CMOS memory cell has a lower power dirripation in standby mode and greater immunity to transient noise and voltage variation than other cells. It can also operate at the lowest supply voltages. Although a full CMOS cell operates well at ultralow-voltage, its area is almost double of that of PMOS TFT. Henee it is not suitable for high-density memories (sine > 4Mb).

When the full CMOS memory cell is operated at 1 V power ropply, a typical cell ratio is 3 for stable operation. The SNM of this cell, at 1 V, can be h o s t the same as for a poly-Si load memory cell at 5 V. When nsing the fnU CMOS 4 no boosting of the wad-line is needed to write a high voltage level in the cell. However, the PMOS TFT cell requires a boosted voltage (V.h > VDD) on the word-line during the write cycle 1191. If the voltage of the word-line is raised only to VDD in the write cycle, the high node B of Fig 6.33 is initially at VDD - VT, where VT is the threshold voltage of the access device subject to the body effect. This low-level (VDO - I+) of the node B em not charge up to V0o because of the poor drimbility of the PMOS TFT device.

When the boosted word-he tedrniqne is applied to the PMOS TFT cell during a write cycle, a problem can aGe. The unselected cells connected to the boosted common word-he suffer from an instability problem because a large current flows through the low node of the cell. This large current is due to the high voltsge on the access transistor. Consequently, this technique is not suitable for 1 V operation.


Figure 8.54 Twertep t.Ehniq\is for 1 V operation [is].

354 CHAPTER 6

Word driver

Low- VT MOSFET

- Din WE Din

(a)

Figure B.55 (a) TSW m d l w i t e ~imuitm [is]

A TwrrStep Word (TSW) voltage technique has been proposed by Ishibarhi et al. 1191 to solve the cited problem. Fig. 6.34 shows the block diagram of the proposed memory. The boosted-level generator' generates a voltage V,, = 1.5V for VDO = 1V. The word-line voltage har two-steps, one is VDD and the other is K h . The circuitry for the TSW method is shown in Fig. 6.35(s). When Q, goes to zero, the signal W L is raired to V,, = VDD. Then when .$ch is mserted with a high l e d , equal to Vch, the transistor Pi tnms ON and then the W L level is increared to V,, = Vch. In this e a e , the low threshold voltage device N, tun. OFF and the inverter formed by the transistors Pa m d N, is isolated to reducc m y leakage current.

Fig. 6.35(b) shows the voltage waveforms for the TSW circuitry in read/write modes. During the write cycle, the high node A is first charged to a low voltage,

'The boostcdLvel8~lcratorir prcsentcdin ScetionB.2.11.


then raised to Vms. The bit-hes are initially floating, then prechaged at the end of mite cycle. In the next read cycle, the b i t - k s are floating. Before the word-line voltages rise to V,,, the cell discharges BL through the low node B . Thus, when the word-line has risen to Vwt, current does not flow in the cell and the node B stays at low level voltage. Note that this technique requires mdti-V, CMOS devices and causes delay in writing because the bit-lines are discharged before writing.

However. the low-voltagge S U M S discussed above require a relatkely high threshold voltage VT 2 0.5V. Thus, their speed is qnite slow. As an example. a 258-Kb SRAM with full CMOS memory cells attained 3 ps access time at 1 V power supply using 0.8 pm CMOS technology [22]. The active power at 0.1 MHa is 0.2 mW and the standby power is 5 nW. Another example is a 1-Mb SRAM with fuU CMOS memory c c b which achieves 200 n s access t h e at 1 V power supply using 0.5 p n CMOS technology 1231. The active

356 CHAPTER 6

cuprent at 1 MHs is 0.1 mW snd the standby current is 10 nW. Note that if the tbrerhald voltage is too low for ultra-low voltage applications, all the eir- wits composing the SRAM will suffer from the subthreshold current leakage. Thus, the retention current increases drastically cansing B sedous problem for low-power applications. Moreover, the temperature effect and the threshold voltage variation enhance this current. So far, no practical solution has been proposed.

6.2 DYNAMIC RAM

The first dynamic RAM (DRAM) was introduced in 1970 with a capacity of 1-Kb. Since then, the density has quadrupled every three years (one generation). Recently, some wperimentd 256-Mb DRAMs were reported [24, 25, 261. At p'esent, low-voltage 16-Mb DRAMr run in high-volume production. The development of there higher densities have made DRAMs the cheapest per bit compared with other types of memories. They are widely used as the main memory of mainframes, PCs, and workstations. The access time har been de- creased from few hundreds of ns for 4-Kb DRAMr to less than 50 ns for 256-Mb. Also the power dissipation has been reduced by an order of magnitode from 4 K b capacity to 256-Mb capacity reaching 50 mW at 1.5 V power supply. The area of the memory cell has been reduced from more than 100 @ma for 64-Kb DRAM to 1.28 @ma for 64-Mb DRAM.

In addition to the trend for higher-density standard DRAMs, there are two other trends: Low-Power (LP) DRAMs, and high-speed DRAMr. The high- speed DRAMs sacrifice the retention current ar well as density for faster access time. Low-voltage low-power DRAMs are becoming important particularly for battery operation. LP DRAMs extend the time of the battery operation as well as battery back-up operation. The active current of LP DRAMS has been lowered. The data-retention cuiient has also been reduced but rtii it is about one order of magnitude higher than those of SRAMs'. The 5 V power supply standard has been used for many DRAM &enmations from 64Kb to 16-Mb externally. This was followed hy 64-Mb DRAM powered with external 3.3 V not only to reduce the power dissipation, but &o to emme reliability. The gate oxide reliability limits the msldmum voltage which is related to the boosted voltage inaide the chip. Regarding the internal voltage, the 5 V can be used to a maximum DRAM capacity of 4-Mb. At 16-Mb generation, the internal voltage is 3.3 V while maintaining external 5 V with on chip voltage

'This comparison is msdc for I - M b mernezicr.


LIMITER

WL SWING 6 -

5 -

4 - - ?

t; ? I - -

w 0 3 - Li 4 -,

4 Mn 1 - - 4 NiCd

0 I I I I I I DENSITY 1M 4M 16M M M 256M Ic (hi0

F E A T . S l z E 1 . 3 0.8 0.5 0.3 0 . 2 0.1 ipim)

Toi 25 20 I 5 10 7 5 (nm)

LIMITER

WL SWING

5

4 - - ?

t; ? I - -

w 0 3 - Li 4 -,

4 Mn 1 - - 4 NiCd

0 I I I I I I DENSITY 1M 4M 16M M M 256M Ic (hi0

F E A T . S l z E 1 . 3 0.8 0.5 0.3 0 . 2 0.1 ipim)

Toi 25 20 I 5 10 7 5 (nm)

Figure 8.38 Trends of DRAM upp ply [ Z B )

down converter [see Section 6.31. Howevez the 3 3 V externill power supply wlll dominate.

Recently, activities to redre 1.5 V battery-operated DRAMs are accelerating the trend in lowvoltage operation [ZT. 28. 291. Fig. 6.36 shows the trend of DRAM supply [ZS]. In battery operation, the chip must be operated on B variety of batteries with various supply voltages for a long-term and under supply fluctuationr.

358 CHAPTER 6

- CAS \ /

6.2.1 Basics of a DRAM

In general the pins of a DRAM are :

m Address; which is seprrrated in time with two separate fields. There fields are the row and column address.

Row Address Strobe (m). The row address is docked by this signal.

Column Address Strobe (m). The column address on the multiplexed pins is clocked by this signal.

1

rn

rn Write Enable (m).


m Inpnt/outpot data pi... . External power supply pins.

It is dear that the multiplexed address penalims the access delay so for fast DRAMr separate address input pins can be used. The multiplexing permits the reduction of the pin count and the cost of packaging. An example of DRAM timing, ndng the addresa multiplexing during read mode, is shown in Fig. 6.31. Some important times are shown, such as the access time from low, tmS, the row addxss strobe cyde time (or cycle time), tRC, and the row address strobe low-state time, 1x1s.

Fig. 6.38 shows B gene& 4 M b DRAM architecture. It uses almost the same circuit techniques as SRAM except for memory army. Some additional circuits are needed such es a Back Bias Generator (BEG), B Half-Voltage Generator (BVG), an optiond Voltage-Down Converter (VDC), a R,eference Voltage Gea- erator (RVG), and a boosted voltage generator circnit. The substrate back-bias voltage is indispensable for stable operation of the DRAM array. The half- voltage generatar permits generation of the precharge level for the bit-lines to half-VDD as it is explained in the following sections. The reference voltage generator ir needed for the VDC. The boosted voltage generator uses b chargepump circuit and permits overdriving of the word-line WL to a voltage higher than VDD. More details on these circuits, composing the DRAM, are given in the following sections.

6.2.2 DRAM Memory Cell

CMOS DRAMr, with threetransistor and four-transistor cells, were used in 1- and 4-kb generations. One-tranristor (IT) cell offers smdei chip size and low cost. These justify the process complexity to fabricate the IT ccU, particularly its capacitor.

A &hematic of B 1T DRAM cell is illustrated in Fig. 6.39(a). The charge is stared in capacitor C,. To prevent loss of the stored information, the capacitor must be refreshed within a specific time with spedal circuitry. The bit line has a capacity CBL induding the parasitic load of the canneeted circuits. Typical values for the storage and the bit-line eapaeiton are 30 f F And 250 fF, respectively. The ratio R = CBL,’C, is very important for the sensing operation.

360 CHAPTER 6

--- RAS CAS WE

r 9. 102 . .

I'

Low-Power CMOS Random Accrss MemonJ Circuits 361

Doring the read operation (WL is selected) the bit-line wltage changes by

where (VMC - Vm,) is the difference between the memory cell voltage and the bit-line voltage before the selection ofthe cell. A typicd value of the difference is V D D , ~ Hence, we have fog the hit-line renre signal

( 6 3 )

For 3.3 V supply voltage, and using a rstio E = 8 far 16-Mb DRAM, the sense signal V, = 180 mV. This r m d voltage change, of the bit-line, requires sensing circuits. For low-voltage operation, V. decreases, thus a low ratio R is required. This is possible by reducing CBL and increasing C,.

C, was implemented ming a simple planar-type capacitor a~ rhom in the structure of Fig. 6.39(b). Thi structure WBS used in DRAMS with capacity up to I-Mb. With the increased density, many threedimensional approaches were used for DRAMs with capacity higher than I-Mb. One approach is to stack the capacitor over the access transistor (STC cell). Another approach is to m e a trench capacitor. For more details on advanced cell structure the reader can consult 130, 311.

The signal charge (Q.ig = C.AV,) transferred to the bit-line during a r e d operation should have enongh margin agsinst noise. The sources of noise are the following :

rn bit-line noise; which is caused by capacitive couplings and other sonr~eei

leakage charge; which is mainly due to the leakage in the junction of the NMOS trmsistor of a IT memory cdl; and . a-particleinduced soft errom

In the early DRAM, the plate of the capacitor WBS grounded to reduce the noise injection from the VDD power supply. However, for multi-Mb DRAMs, a VDD/Z bias €or the eeU plate was nsod. This scheme has several advantages such as, the reduction of the stcess on the thinner oxide of the atorage capacitor, and the reduction of supply voltage noise. Many I-Mb DRAMs have used this cell biasing scheme.

362 CHAPTER 6

For Gb DRAM cell design with redneed VOD, the ratio R should be rednced. This L possible by reducing the bit-line capacitance, Csr. and increasing the storage capacitance C.. On the other hand, the area occupied by C. should be rednced to increase the chip capacity. One solution for C. reduction is the use or* capacitor insulator with extremely high permittivity 6 such BI Ferra- electric materials nuch as BoSrTiOJ film. Consequently B simple planar-typo capacitor can be nsed in that c a ~ e

Low-Power CMOS Random Access Memory Czrcurfs 363

6.2.3 Reamri te Circuitry

Fig. 6.40 illurtrstes the Merent circuits for read, write precharge, and equalisation funotions. The read operation is performed as follows. Initially both the bit-lines ( B L and BZ) are precharged to V, which is equal to VDD/Z and eqndized before the data reading operatirm. This hali-yoo preeharge technique permits the reduction of the active power disdpation 89 discussed in Section 6.2.9. The signal W L is seleded by the TOW decoder. The high level of the word-line voltage har to be greater than VDD to increase the stored chaise in the memory cell. The selected memory cell is connected to one bit-line. Then AVBL (100 to 200 mV) appears between the bit-lines, immediately &her the word-line rises. Then it is amplified by the latch-type CMOS sense amplifier

364 CHAPTER 6

which is connected to both bit-liner. After the sensing and the restoring o p erations, the voltage levels of the bit-lines bsve a full-swing condition. The bit-line differential voltage signal is transferred to the differential output-lines (0 and d), through a read drcnit. The signal YR i selected h o s t at the 8-e time with WL. The parasitic capadtance of the output-line is large (a typical value 2 pF for 4-Mb DRAM), and the readout circuit would need a long time to amplify the ootput-line signal. A main sense amfler is used to read the output-liner, then the data is selected among several main SAs connected to different sub-arrays. Finally it ia transferred to the output buffer.

The DRAM cell readout mechanism is destructive, and hence the same data must be wsdtten to the cell on every read access. Consequently, on each bit- line pair, a CMOS mpifier is needed to amplify and restore the level. This mechanism is not needed in SRAMs since the lead operation is non-destructive.

In the write made, the YW Jignd is selected by a column decoder as shown in Fig. 6.40. In this ease, the write control signal is actiTated. The selected bit-lines are connected to a pak of wdte-liner W and W and the data are transferred to the memory cell when W L goer HIGH.

6.2.4 Low-Power Techniques

Fig. 6.38 can be osed to identify the different sources of power dissipation in B DRAM. For simplicity we asmme that the internal supply voltage is the same compared to the external one. The total power dissipated is the addition of two components; the active power and the data-retention power. The active power is the rum of the power dissipated by the following components;

The decoders (row and column);

The memory army. This is the dominant one. If m memory e d s ate connected to the word-line, the active power of memoly array is &ken

P.,,sm.a,,ov = m x Poem (6.11)

Where Pmctm is the power dissipated in active mode when selecting the m cells. It is given by

by

Pacam = CmAVmVDDf (6.12)

m The sense amplifier;

Low-Power CMOS Rondorn Access Memory Circuzts 365

= Other circuits such as refresh circuit, substrate back-bias generator, boosted l e d generator, B voltage reference circuit, and a half-VDD generator. These circuits &a dissipate a DC current;

The rest ofperiphery such BS main sense amplifier, input/antput buffers, write circuitry etc.

m

Note that the power dissipated by the pads is not included.

To ieduce this active power, many techniques can be used and a m smnmarieed as follows :

rn Reducing all capacitances; particularly the bit-line and word-lines <a- paeitanees. As seen from Equations (6.11) and (6.12) m Y Csr. should be reduced. Techniques which permit this are partial activation multi- divided bit-line and shared I jO [see Section 6.2.7]. Also to *educe the word-line capacitance, a techniqne such as partial activation of mdti- divided ward-line can be used [see Section 6.2.81;

Lowering the internal VDD. This i n d u d e the generation of half-Voo for precharging the bit-lines and reducing the external supply voltage; and

Reducing the DC power required by periphery circuits. This is possible by using static CMOS decodes and puke operation technique using an ATD circuit (as in SRAMs).

The data retention power in a DRAM is mainly due to refresh operation and the DC power ( I D c ) due to peripheral circuits such 8s BBG, BVG. VRG, HVG. The refresh process is performed by reading the m cells connected on each word-line and restoring them. Thus, n refresh cycles are needed for n x m DRAM. It can be estimated by

where 9 is the total dynamic energy (f is the operating frequency) and n/fvejrS,b is the refreah time of m c e b . To reduce the power dissipation due to the ieLwb mode, one obvious technique is to increase f,<j,<,h and decrease n. P, L the AC and DC power dissipated by the other circuits such BS VDC, BBG, RVG, BVG, and boosted level generator. To redoee this power m y

366 CHAPTER 6

Figure 8.41 Static CMOS .mrd-linc dr>vrr

techniques can be used. One of them is to reduce the frequency of operation of circuits which have high-power during active mode when operating in data retention mode. Another one is to reduce the DC current of there ckcuits using, for example, dynamic concept.

In the following sections, the circuit techniques to reduce the active and data- retention power dissipation are presented. Also, different circuits conrtitnting a DRAM are described and low-power issues of these eirenits are discussed.

6.2.5 Decoder

In a DRAM, the static CMOS NAND decoders are used. The power is reduced by ‘sing the predecoding technique. This topic is discussed more in Section 6.1.6 for SRAMs. Fig. 0.41 shows astatie CMOS word-line driver. The boosted level, K h , generated by an intunsl charge pump circnit, is used in the output stage. When node A is high at (VDD - VT), the antpnt inverter le& a high DC ourent because this is lmw then Vrh by 8% least two threshold voltages, sobjeet to body effect. Therefore, a small size PMOS transistor PI is used to restme the level of the node A to K, l e d . Also this transistor permits the latching of the low output level (ground). Thc Xi signal, when selected, is normally at Voo. The unselected X, is discharged to ground in the selected block before the row decoder selection.


6.2.6 Sense Amplifier

The main sense amplifier is the main source of DC current during the x- t h e mode. It employs the same sense amplifier discussed in Section 6.1.8 for SRAMs. The DC enrrent can be shut down using the ATD technique.

6.2.7 Bit-Line Capacitance Reduction

Redocing the bit-line capacitance not only reduces the power dissipation but slso improves the signal-t-noise ratio of the memoiy cell. This is possible by two approaches :

1. Reducing the number of memory cells n per bit-line. In this ease, multi-divided bit-line technique is used.

2. Redncing the jnnctian capacitances of connected transistors such 8 s access devices. One possible solotion is the back-bias of the substrate cant- these devices. A negative voltage on the substrate permits to reduce the junction capacitance. In addition, the we of the trench isolation technique for CMOS devices rather than the LOCOS isolation results in almost 50% ieduction in capacitance,

Fig. 6.42 shows the principle of multi-divided bit-line architecture for the memory array. The m x n m a y is now divided into m columns by k snbarrays. Each subarray contains n/k word-lines. In this scheme the bit-line capacitance CsL is reduced by dividing it into k sections. Also the signal-twmise of the cell is improved. Fig. 6.43 illustrates an example of I-Mb DRAM [32]. The memmy is divided into two parts; upper and lower. One part is divided into N = 16 sub-arrays and the total number of rubarrays i s k = 32. Two sub- bit-lines share one amplifier which are selected by isolation sign&, I S 0 and ISO. Thus, a partial activation is performed by selecthg only one SA along the bit-line. The switeh SW is controlled by the Y signal from the shared e o l m decoder. This signal runs in parallel to the bit-linw and uses metal-2. Thos, the 1/0 is shared by two sub-bit-hes. Thk principle results in reduced power dissipation and chiprize. It has been used foz many DRAM generations up to 16Mb.

-

6.2.8 Multi-Divided Word-Line

368 CHAPTER 6


Row decodri

- - - _ _ Bit-lineinmetal-l ._ - - _ - - _

(meid-2)

Figure (1.45 eolum.dccodrr[Zl].

Multi-divided bit.8ne orchilceturr with shard SA, I/O snd

370 CHAPTER 6

,,,R ._ .. -._ ._

Fig. 6.44 shows the hierarchical word-line structure proposed for a 256-Mb DRAM [26]. This scheme resembles the one used in the SRAM. The DRAM cell array is divided into several blocks and each o m itself is divided into sub arrays. The SnbWord-Line (SWL) circnitry is embedded in the subarray. Only one S W L is activated by the Main-Word-Line (MWL) and the 109" select Jignd. It is common to two sub-mays as shown in Fig. 6.44. Thus, only two cell rubarrays are activated which represents B very small portion of the total cell arrays. In the case of the 256-Mb, the active cell array rise is 1/1024 of the total number. This ntrosture results in reduced active current and ground bounce.

Lorn-Power CMOS Random Access Memory Czrcoits 371

6.2.9 Half-voltage Generator

One efficient technique to reduce the memory anay operating current is half- VDD bit-line precharge [33, 341. During the sensing operation, one bit-line switch- from V D D / ~ to VDD and the other switches to m o . This resnlts in L powex swing of almost h a , compared to the fd-Voo precharge ease, BS well as peak current. Note that the reduetian in peak current leads to suppression of noise. In addition, the precharge time is reduced and the cycle time is shortened. This preeharging technique has been used darting from I-Mb DRAM generation.

A simple circuit which permits the generation of this half-VDn is shown in Fig. 6.45. The HVG CLcait is composed of two stager. One stage L B bias generator which generates two voltagelevelr; (VDD/Z+VT) and (VDD/Z-VT). The second one is the push-pull output stage which generates the level V D D / ~ distributed to the memory array. The load capacitance, seen by the push-pull output stage, is huge. A typical value is a few tens of nF. A typical response time when the circuit is powered-op is few tens of ps at 3.3 V power supply voltage for 16-Mb DRAM. This HVG circuit has many disadvantages such as

ZL6


duty ratio of the HVGE signal in the data-retention mode. To solve the other problems dted an HVG Gcdt was proposed k [28] but this circuit dissipates B DC =-rent.

6.2.10 Back-Bias Generator

The back-bias valtage VBB is utilised in a DRAM to reduce the subthreshold current and the junction capacitances, to improve deem isolation, to enhance latch-up immunity, and protect the circuit against voltage undershoots of the inpnt signals. Also this voltage can he wed to compensate for some device parameter variations.

For NMOS devices with P-well (substrate) a negative Vsa is generated by pumping electrons out of the ground node and into the substrate. A typical VBB generator configuration is shown in Fig. 6.47. This circuit is known as charge p a p . The node A oscillates between VT and (Vr - VDD). D n k g the high side of the cycle, the node A must be at least at VT to pump the chsrge from the gronnd. On the low side of the cyde, the node A mart be a VT drop below VsS. The antput node VBs stablize. at a voltage l e d equal to (ZVT - VDD), since the losd capacitance is huge. The clock (c lk) is generated by B ling oscillator with N (N is an odd number) stage. The frequency f of oscillation, is approximately 1/(2Ntd), where t d is the delay of one inverter. The buffer is needed to drive the huge C,,,, capacitance. The average current pumped out of the substrate is approximated by

Ipmp = ( V m - vBBm;.)c,,f (6.14)

where VBBmin is the back-bias voltage when no current is pumped and is equal to (W-VDn) (optimumvalue). During thertart-upalargecorrent Lpumpcd; equal to (-Vasin..C,,,f).

Another PMOS version, of the charge-pump circuit, ir shown in Fig. 6.48. Since the gate voltage of PI only reaches -VOD, Vsa ir pumped to a limit of (VT - VDD). For VDD = 5V, the NMOS and PMOS charge pump circuits generates typical voltage. of-3 and-4 V, respectively. However, for 3.3 V power supply, the PMOS version can generate a low negative voltage of -2.5 V which is Lower than the one generated by the NMOS version at this power supply.

Fig. 6.49 shows e. pumping circuit which avoids the VT losses and hence is suitable for low-voltage operation [35]. When the clock ( c l k ) is low, the voltage of the node A reaches (IVT~I - VDD), and the PMOS transistor PI clamps

374 CHAPTER 6

Low-Power CMOS Random Access Memory Clrczlzts 375

376 CHAPTER 6

the voltage of node B to the ground level. The Vgg level is in that case, (IVT,~ - VOD - VT,,). When clk goes to a hieh level, the voltage of A rises to V T ~ and the voltage of B, by capacitive coupling, becomes -VOD, causing VBB to be equal to -VDD. Therefore the Vse will be

Vsa = mas{-Vm, lV,I~ VDD - VF") (6.15)

This eircvit needs a special triplewell strncture to avoid minority carrier injw- tion of the NMOS transistor N, as discussed in [SS].

To reduce the power dissipation of the BBG dreuit, while the DRAM is not in an active mode, the BBG can be operated a t low fpequency. Fig. 6.50 shows D simplified circuit diagrsm of the BBG circuits for low-power operation [Xi]. In the normal mode, the ring oscillator works all the time to retain the Vsa level. In the data retention mode, the BBG Enable (BBGE) signal is clocked

Lou-Powuer CMOS Random Access Memory Czrcuits 377

with a low duty ratio. Then the ring oscillator is operating with low-frequency to iefresh the pumping eircuit.

6.2.11 Boosted Voltage Generator

A Boosted level circuit is needed to generate a voltage level above VDD by at least VT. Tho word-line driver is powered with this voltage Vrh. A simple boosted voltage generator is shown in Fig. 6.51. It use6 the charge pump circuit technique discussed in Section 6.2.10. The outpnt of this Circnit is switching between (VDD - VT) and ( 2 % ~ - V ) . The clock 4 is generated by a simple ring oxillator. Another circuit which switches between VDD and ~ V D D is shown in Fig. 6.51(b). It uses two non-overlapping clock phases. This second circuit configuration uses feedback NMOS devices, NI and Na, to eliminate the threshold voltage loss and boost the voltage a t higher voltage. This circuit is not sensitive to power supply voltage reduetion.

The boosted level can not be dkctly used to drive the load. Thus a pass transistor is needed to isolate the switching boosted level from the load as shown in the example of the drcuit of Fig. 6.52(a) [28]. The charge pump circuit CP1 generates at the node A, B boosted signal switching between VDD and ZVOD. To control the pass tiandstor N , two pump circuits CP2 and CP3, and an inverter INV are needed. The pump circuit CP generates, a t node B, a signal switching between WDD and ~ V D D and uses the boosted voltage Vrh. The other pump circuit CP3, controls the inverter INV. The output of this inverter (node D) switches between VDD and SVDD. The output of this KVG circuit is Vc,, = 2VDD and it is stable since is large. The voltage waveforms are shown in Figure 6.52(b). This ekcnit is insensitive to VDD reduction and can work down to sub1 V power supply.

6.2.12 Self-Refresh Technique

Standard DRAMS require an erternd DRAM controller5 to control the refresh pmcerir of memory cells. The stored charge in the memory cell deueases due to the leakage current with high rate at high temperature. The refresh time (period) L,.t..,h is determined from the timc needed for the stored charge in the memory cell to keep enough margin against leakage at high temperature. This indicates that tr l jr . ,h can be lower than what is expected at room tem-

378 CHAPTER 6


380 CHAPTER 6

perature. One way to increase this time, and hence reduce the dato retention powex dissipation, is to eontrol the refresh period funftion of the chip temperature. Fig. 6.53 shows LUL on-chip self-refresh control circuit with a memory-cell l e h g e monitoring scheme. A iefreJh dock hraffrlh ir generated automatically with a period of t,s,va,h. The moOitox cell, which has s hk?.&e cunent I&, controls the refresh period. Initially node A is high, the NMOS transistor N is OFF, and node B is low. When the c h a w on node A is deereased to the p&t that the PMOS transistor P toms ON, node B riser up. Then, during t h e 7 B high puke is generated at the node C, whieh in turn charges OP node A to high level.

Low-Power CMOS Random Access Memory Cixuits 381

6.2.13 Low-Voltage DRAM Operation and Circuitry

Low-uoltage operation is reqnired to reduce the power dissipation and to assue the reliability of deepsubmicrometer MOS devices in futue DRAMS. The power rupply voltage ULO be as low as 1 Y to meet the requirement of battery operation for portsble applications. To get high performsnce in a high-density DRAM, at low supply voltage, the threshold voltage of MOS devices should be reduced. This results in an increased subthreshold curtent and hence circuit techniques are neeeded to reduce the standby current. In this section, circuit tehniques to reduce the subthreshold current for the DRAM array ( equdkr , precharge and ~ e m e ampli&r) circuits, memory-cell access, and word-line driver are described.

6.2.13.1 DRAMArray Circuits

Fig. 6.54 shows the conventional DRAM array circuit with the half-VDD bit- lines precharging tehniqm. This circnit has already been discussed in Section 6.2.3. When VDO is sealed down, this M-VDD seheme causes several problems with respect to the CMOS latch-type SA and the e q d n e r . For example, for the NMOS transistor, Nsr, of the N-type SA (N-SA) the following problem can exist. When the signal 4.. is pulled-down during the readout operation, the sensing operation starts when the voltage Vosl [See Fig. G.541 becomes larger than the VT of the NMOS transistor of the SA. However, if VDO J Z is law enough, approaching the d u e of V., then the sensing operation is very slow doe to the low value of VGV,. Note that VT is subject to the body effect when the common source of the N-SA is falling to ground.

Another problem arises duing the equalization period. The equalization is carried out by the NMOS device, N g p , when the signal dp is activated. In the final stage of equalisation, the drive current of the NMOS qualiner decreases drastically, particularly when VDD/Z is not higher than VT. Note that the threshold voltage of the equalizer is also subject to the body effect.

One solotion to these problems is the use oflow-VT devices in the DRAM army for the CMOS SA, prechlrrge and equ&g circuits. However, this leads to a drastic inuerse in the leakage current during the active period. The leakage current paths are shown in Fig. 6.55. To significantly reduce this leahge current the concept of Welldynchronized Sensing and Equalizing (WSSE) concept was proposed [37]. It is based on the following two concepts:

382 CHAPTER 6

rn The voltage levels of the transistor souxes and the well are equaled during the sensing, the restoring, and the equalizing period. This dim- h a t e s the body effect.

A negative (positive) him, Vss (&) is applied to P-well (N-well), respectively, during the active period. Thus, the leakage current is reduced because VT incremes due to the body effect.

rn

Lou-Pourer CMOS Random Access Memory Circuits 383

Fig. 6.56(a) shows the WSSE eireuits using a triple-well structure. The N-well and the P-well control voltages, Vw, and Vwp, respectively, are controlled by B s p e d logic. Fig. 6.56(b) finstrates the voltage waueforms. Before the word- line is activated, the bit-lines and #,, and $,,, are equaliaed to haKVoo. The P-well and N-well levels BIC prechapged to ( ~ / ~ V D D - V n ) and (1/2Yon ~

VT~), respectively. There voltage levels permit to avoid any drain-well voltsge forward-biasing during the initial time, after WL activation. During this initial time, one bit-line is different than VDD/Z. In the sensing and restoring period, the signals 4.. and Vwp are palled-down while the signals $,, and Vw. are pallhp; each pair is synchronimd. After this period, the bit-lines BL and are in full-Jwing condition. Then, the level Vw, is pulled below GND to VHH and isolated from &, while the level Vw. is pulled above VDD to V& and isolated from qLp.

6.2.13.2 Memory Cell

First, let's dixcms the requirements far the memory cell, particularly at low- voltage. Fig. 6.51 shows the memory cell in the restoring operation. To restore the high-level, V b , from the bit-line to the storage capacitor, the word-line must be boosted to s level Vch. This l e d has the following requirement

Vrh > VDD + ~ ( V D D ) + a (6.16)

where a is the voltagemarginand VT(VDD) is the threshold voltngeofthe access NMOS transistor when its source is at VDD. Note that the NMOS device has (VDD+IVHHI) a5 an effective back-bias voltage. Far transistor reliability, Vs, should be as s m d as. possible. This meam that Vr(Voo) is required to be s m d . This threshold voltage is given by

VT(V?D) = VTO + 7v,- (6.17)

where VTo is threshold at zero source and substrate bias, 7 is the body effect coefficient and 4, is the Fermi potential.

Fig. 6.58 shows the anselected memory oell in long cyde operation. The bit- line hsr completed the s-g operation and is at gronnd level (GND). In this situation, the memory cell is exposed to worst case leakage condition. The c h q e stored in the cell leaks rapidly due to the subthreshold current. This situation sets the lower limit of the threshold voltage. Note that the access transistor of the memory cell has lVss1 as back-bias voltage. The threshold voltage in this mode is given by

384 CHAPTER 6

Low-Power CMOS Random Access MemonJ Czrcuats 385

To meet these two requirements of the threshold voltage, the substrate voltage should have a suEcient bad-bias voltage to suppress the body effect.

For example when the internal supply voltage is VOD = 1.5 V, the IVsel is set to -1. The V~(1.5 V)’ is 1 V and the Vp(0) is 0.75 V and S = 90 mV/decade.

‘Extrapolakd thrcrhold v o h g r .

386 CAAPTER 6

Therefore, the lcskage current of e transistor with W = 1 pm, is - 10 fF. In this case, Vch must be larger than (VDD + VT(VDD)) which is 3 V.

When the VT of the memory cell is reduced, the leakage current increases drastically. The concept of Boosted Senre Gronnd (BSG) [38] was proposed to shnt down the subthreshold current in the memory cell B C C ~ S S transistor. This is achieved by slightly boosting the low-level voltage of the bit-line. This level is called BSG level, and is set at 0.5 V. During a long cycle operation, the gatesource ofan unseleeted cell is negative (-0.5 V), then the subthreshold current is redveed by 6 orders ofmagnitude (for S = 80 mV/decade). Fig. 6.59 shows the BSG circuit applied to a memory cell. The BSG line is common to all N-channel sense amplifiers. The BSG l e d is generated by e. circuit similar to the VDC circuit [see Section 6.3. I0 active mode, the differential amplifier and N I are activated and the voltage of the sense ground becomes Kc,. The W2 transistor has alarge width and is activated by the signal SE at the beginning of the sensiig period to suppress an unnecessary rise in the BSG level by the sensing current. In the standby mode, the differential amplifier is made inactive to reduce the standby current and also N, and N 2 . The BSG level is clamped to the threshold voltage of N,. Note that the boosted level, Vrh, is reduced compared to the conventional scheme because VT is reduced.

6.2.13.3 Word-Line Driver

Scaling the threshold voltage down increases the subthreshold current of a DRAM, particularly for iterative circuits such m word-line drivers or decoders. If the DRAM is divided into k blocks, each block has a drivers, then the total of word drivers is k.n. Fig. 6.60 shows an example of DRAM drivers. During lhe active mode, one driver out of k.n drivers is selected by the row decoder and the word-line is at the boosted Level K h , generated by the internal ehsrge pump circuit.

When the threshold voltage is low, the subthreshold elurent of each driver is important. Then for &DRAM the total subthreshold current of the drivers is

L,adr = L.n.l.,a (6.19)

where I,,s is the subthreshold current of NMOS and PMOS transistors (assumed the same). For B high-capacity DRAM, the current L b d , would be huge. For example, a multi-Giia-bit DRAM har B 1 million drivers, and each driver har a subthreshold current of - 10 nA at room temperature, then the total subthreshold current would be 10 mA. At 75 C, this current can be hundreds of mA. This high DC current destroys the Vc6 level because the charge


Figure 8.59 Boosted Senre Ground (BSG) tirclut

pump eLcuit cannot handle such a DC current. Note that this current should always be evaluated in the worst case; maximum temperature, and the lowest value of VT. In the standby mode, all the drivers are turned OFF. The current L a d - is still the same.

To solve this problem, the concept of Self-Reverse-Biasing (SRB) scheme c 8 n be used !24]. This concept has already been discussed in Seetion 4.10 [Chapter 41. Fig. 6.61 shows the application of the SRB scheme to word-he drivers. During the active mode, the control signal 3 is low and the node SL is equal to Kh. Only one word-line is selected. When 6 goes to high (standby mode), the PMOS device Ps limits the subthreshold current. In this mode, all drivers are OFF, even lhe selected one. Fig. 6.62 rhowr the technique to turn off the

388 CHAPTER 6

V,h (boosled levcil


selected drive^ in standby mode. When d is low, node Ai is high, then the selected wmd driver is low.

One problem associated with the SRB acheme is that daring the actke mode, after one selected word-line driver is activated, d the other drivers me leaking thereby substantidly contributing to the active current. This problem is solved by the partial Betivation of hierarchical power-line scheme 139). Fig. 6.63 shows the principle of the 2-D selection scheme. In this scheme, the array of k blodrs b7 n drivers is divided into E sob-blocks in columns and I sub-blocks in mw6. The total of sub-blocks, each containing a set of drirers, is k x I . Dudng the active mode, only one subblock is activated. Thus the subthreshold carrent in the active mode is drastically reduced.

6.3 ON-CHIP VOLTAGE DOWN CONVERTER

Chip makers prefer to scale down VDD to enhance the device reliability, while the users prefer it the s a m e power supply voltage and dislike the frqumt changes. The reduction of VOD is &o important to achieve low-power characteristic. The strategy to meet these cantrildictory requirements is to use an on-chip Voltage Down Convwter (VDC). A VDC can be used to convert the old power supply voltage standard of 5 V to 3.3 V to power CMOS circuits using 0.5 p n and sub-0.5 pm technology. For the state-of-the-art 0.25 fim (SMOS technology, the power snpply voltage must be 2.5 V. However, the new standard is becoming 3.3 V and is likely to stay that way for many years. Thus a 3.312.5-V VDC is required.

On-chip VDCs are used for DRAMS as wd BJ SRAMs, ASICs and digital proeersors. They m e employed in commercial 16-Mb DRAMr to reduce the external 5 V to an internal voltage of 3.3 V. For SRAMs, they have not been commonly used as in DRAMr, partieulmly in commercial ones. The SRAMs can operate over B wide range of power supply. Moreover, they already have low data retention current, enough for battery-operated applications. In thk section, we discuss the VDC &<“it tcchniquer for DRAMS which are basically the same as for SRAMs and other circuits.

Numerous pspers have reported designs of the VDC circuit for B DRAM [32,40, 41, 42, 43, 44, 451 and for an SRAM [46]. Fig. 6.64 shows one approach using a VDC to reduce the internal voltage for 8 DRAM. Memory cell array and the periphery circuits are powered from the internal supply voltage, while the 110

390 CHAPTER 6

Low-Power CMOS Random Access MemonJ Circuits 391

Figure 8.82 Detail of rord-driver wi th voltage ahifter.

vch t , O 0 V”b

u h

392 CHAPTER 6

bfiers are powered with the external voltage to maintain the compatibility. However. the VDC, in thk situation, should be stable when supplying a large current to periphery and memory array. When the VDC is used for battery operated applications, the standby current should be less than 1 p A over a wide range of temperature (0-70 C).

Fig. 6.65 shows a schematic of the VDC structure for a DRAM, used to convert - 5 V to 3.3 V. It is composed ofaReference Voltage (&) Generator (RVG), a driver circuit and B time-dependent load. The buffer dreuit consists of a differential amplifier [Fig. 6.661 and common-smrw drive PMOS transistor Pb. The current load has B peak, for the memory spray, of more than 100 mA in 10- 30 nd time and more than 100 mA in few ns for the periphery <Leuit. To deliver such a large carrent, the width of the PMOS 8 of the outpot stage shanld be large. Moreover, when the output current changes rapidly, the output voltage VDD decreases by AVDD. To m i n i = AVDD, the gate control voltsge, VG, hes to change quickly. This is possible by increasing the differential amplifier tail current, I,. The current snomce, I., is needed to clamp the mtpnt voltage VDD when the load ourrent becomes almost zero.


Q 10 circuit

t. Figure 6.08 Schematic of Lhr differential amplifier,

394 CHAPTER 6

A VDC circuit is one of the keys for achieving 8. DRAM with data-retention current that can be used in battery based applications. The requirements for low-power are the following :

The standby current mast be less than 1 P A o v a a wide range of temperature, process and power supply voltage variations; and

The output impedance of the VDC should be low. rn

6.3.1 Driver Design Issues

The internal voltage generated by the VDC c a n have many BOIIIC~S of flnctua- tions which are as follows. DC changes in the reference voltage dne to process and temperature variations. Transient variations caused by the noise in the external power supply and by the load current. The variation of the internd voltage with respect to the reference voltage should be less than 3%. The variation with respect to the load have to be less than 10% and with respect to the power supply less than 1%.

The stability of thir circuit is essential for the operation of the VDC. To study the stability, ac smd-signal analysis is carried out. Fig. 6.67 shows the aim- plified equident circuit using the MOS smd-signal techniques [47]. The gate capacitance of the output PMOS Cor is hnge and is taken into account. gml and gmr are the transcondnctances of the differential amplifier and the output stage, iespectively. T, and p1 are their iwpective equivalent output resistance. Ci. is the ovtput load capacitance composed of the wire capacitance C-', and the switched capacitance of the memory core em8. The frequency response of this circuit L upreared by

(6.20)

The circuit has two poles: m = l/CGq, for the differential amplifier and PI = l/C,,n for the output stage. The two poles must be sufficiently separated from each other to M J U I ~ a good phase margin 1481. For a DRAM application, the pole pa varies drastically, because of the load variation. Thus. the circuit CM fail to ensure a sufficient phase margin and hence it can generate ringing or oscillation. Therefore, phase compensation has to be applied. One

'A typical ralw of C, is 1OOpF. 'A typical ralm 01 C, is 1200 DF.


possible compensation technique is shorn in Fig. 6.68(a) and it is called Miller compensation technique. The compensation capacitor C, is connected between the input and the output ofthe second stage. It shifts the pole p1 towards lower fieqoeney pk, BS shown in Fig. 6.68(b). Thos, the phase margin is improved.

The condition of the stablization is defined at the paint of 0 dB loop gain where the phase margin is larger than 45 degrees. Using the smd-sigignal analysis with the compensation eapacitm C. the condition can be utracted. This capacitor is a function of gma, gml, CL and Co. To determine it, gmml has to be known, using Iarge-Signd analysis. The PMOS driver Pb has to be rised to satisCy the condition on AVDD~VDD (less than lo%), due to the transient load current variation. Hence 9-2 can be determined from the she of &. For a 16Mb DRAM, the width of the antpot PMOS Pb can be as high as 30,000 p m and C, eqn& to 200 p F . This is for 3.3 V internal power supply generation from 5 V.

The current tail of the differential amplifier can be high (few ma) in active mode. The driver can be &activated in standby mode to conmme only a very small current by Chip Select (CS) signal. In this case, the internal vdte.ge can be supplied by a low-power voltage follower (461. The voltage fallowex has the same eonfigmation as the driver but the tail current is in the nub-fiA range.

6.3.2 Reference Voltage Generator

The Reference Voltage Generator (RVG) must provide B high accuracy over a wide variation of VDD, process, and temperature. So far, the RVGr have been based on the band-gap reference and on the threshold d t a g e generator.

396 CHAPTER 6

LOOP Gain


The former consumes a DC current which is not low enough for low-power applications. The latter is more suitable far B CMOS technology.

Fig. 6.69(a) shows a PMOS-VT difference generator with an output voltage AVT = l V ~ ~ i l - IvTpsl (VT,, < VT~Z < 0). The equivalent circuit is shown in Fig. 6.69(b). This circuit needs a PMOS device with high threshold voltage. A typical value for the threshold voltage difference is I.]*. The PMOS transistam are chosen as threshold voltage difference generator because they are in N-weUs and therefore the difference is independent of back-biar (VBB). The circuit of Fig. 6.69(a) does not s&er mnch fmm V~D..~ bounce. The temperatwe dependency of the VT difference is expressed by [49]

(6.21)

where N.il and N.42 are the surface impurity concentrations of PI and P2$ respectively. Far B stable-temperature design, the concentration ratio N.il/N,i2 and. therefore the threshold voltage difference, should not be excessively large. A typical valne of temperature dependency is 0.4 mV/C, whieh is small for the VDC circuit.

Since the AVT is around 1 V, the circuit of Fig. 6.10 is used to convert this difference to the required internal supply voltage. The voltageup converter amplifies AVT to:

V,.t = AVT (1 + 2) The mismatch between the two PMOS devices PI and P, of Fig. 6.69 can be minimised by using large channel widths and lengths. But stiU the deviation on VT, dne to the fabrication process, has to be eliminated. This can be done by using fuse trimming technique to control the ratio of the resistors R1 and R2. The total current consumed by this RVG circuit is

(6.22) R

where 31 is the current consumed by the voltage regulator [eee Fig. 6.69(a)] and I, is the current of the differential amplifier. I& = Kcf/ (Rr + R2) is the current of the ontput stage. I can be made < IpA, however I. and II, can not be made rmdcr, particdarly I,. The resistor is implemented, foz example, by using doped polysilicon. Typical valuei of the resistances m e of the order of 100 K l l . They can not be increased excessively, otherwise the m a of the RVC can be significantly high. Moreover, the substrate noise can affect the reference

398 CHAPTER 6


voltage through the coupling capacitances of the resistors. The total current of this type of RVG is in the order of few e. tens of pA.

To redme the current of the RVG to rub-pArmgefbr battery-operated DRAMs, the concept of dynamic RVG can be used [50] - s h o w in Fig. 6.71. A PMOS transistor P, with low [VT~ is used. Doring the sampling peiiod (#, is high), all switches S, -5’4 are closed. The threghold voltage difference, AVT, between the two PMOS devices, Pi and P2* appears acms the resistor RR. If the transistor dimensions of the pairs P, and P2, and HI and are identical, the reference voltage is given by

(6.24) A VT I, = ~

RR This current is mirrored to the output node. If the dimension of P is identical to that of P>, the output voltage V,,, is given by

(6 .25) Rr. V7#, = AVT- RR

This shows that the reference voltage e m be adjusted to any voltage. Moreover, with trimming technique V,,, can be adjusted against pmcess vadation effect (AVT variation). The ontput voltage is sampled on the hold capacitor C,. When 4, is low, the circuit is in hold mode. Clock +2 is delayed to clock to minimbe fluctuation of the output voltage. These clocks ape generated from the self-refresh clack circuit in il DRAM. The ciircuit consumes a DC current only when 4, is applied. The average cuiient consumed by this circuit is

I,, = 31x74 = ~ ( A V T I R E ) ~ ~ (6.26)

where 7+ is the duty ratio of The corrent of thb circuit c m be reduced to a low-level in sub-PA iange by controlling the duty ratio. For example to generate a reference voltage of 2.4 V from an externd power supply voltage of 3.3 V, RR and Rr. me 9 kR and 12 kf l , respectively. AVT has a typical value of 0.3 V. The total DC is 100 PA. So with a duty ratio lower than 1/100, the average current can be reduced below 1 pA. It can be easily shown that this circuit has a low sensitivity to power supply voltage and temperature variations.

6.4 CHAPTER SUMMARY

Low-power architectures/circuitr techniques for SRAMs, DRAMs and VDCs were reviewed. The obviow technique to reduce the power dissipation is the

400 CHAPTER 6


voltage ~ealing. The reduction of power supply voltage to 1- and sub-1 V range requires new circuit innovations and breakthroughs, particularly when low threshold voltage devices are used. It ww shown that not only the power supply voltage scaling contribntes to the power consvmption reduction but &o the reduction of capacitances and DC currents using sophisticated techniques. Many of the techniques presented for memories can be useful to other applications such as : ASICs, DSPs, etc. Design issuer for stable operation of a VDC and Iow-rtandby current techniques were invertigated.

REFERENCES

[I] 8. Tram ct al., "An 8-m 1-Mb ECL BiCMOS SRAM ~ t h a ConfigurabIe Memory Array Size," International Solid-state Circuits Cod. Tech. Dig., pp. 36-37, Febzuluy 1989.

[2] M. Matsni et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International Solid-State Circuits Conf. Tech. Dig., pp. 38-39, February 1989.

[3] Y. Maki et al., ''A 6.5-nr 1 Mb BiCMOS ECL SRAM," International Solid- State Circuits Conf. Tech. Dig., pp. 136-137, February 1990.

[4] M. Takada et al., "A 5-11s 1-Mb ECL BiCMOS SRAM," BEE Journal of Solid State Circuits, uol. 25, no. 5, pp. 1057-1062, October 1990.

151 A. Ohba et al.. "A 7--ns I-Mb BiCMOS ECL SRAM with Program-Free Redundancy," in Symp. VLSI Circuits C o d Tech. Dig., pp. 41-42, May 1990.

[6] Y. Okajimact al., "A 7-nr 4-Mb BiCMOS SRAM with a Parallel Testing Circuit," International Solid-State Circuits Conf. Tech. Dig., pp. 54-55, Febrosry 1991.

[7] K. Sas& ct d., "A 7-ns 140-mW 1-Mb CMOS SRAM with Current Sense Amplifier," IEEE Journal of Solid.State Circuits, vol. 27, no. 11, pp. 1511- 1518, November 1992.

[8] T. Ootani et al., "A 4-Mb CMOS SRAM with a PMOS Thin-Film Tran- sistor Load Cell," IEEE Journal of Solid-State Circuits, "01. 25, no. 5, pp. 1082-1092, October 1990.

[9] S. Mur&kami et al.. "A ZI-mW 4 M b CMOS SRAM for Battery Opere- tion,' lEEE Journal ofSolid-State Circuits, vol. 26, no. 11, pp. 1563-1570, November 1991.

[lo] K. Saraki et al., "16-Mb CMOY SRAM with a 2 . 3 - p ~ ~ ~ Single-Bit-Line Memory Cell," IEEE Journal of Solid-state Circuits, val. 28, no. 11, pp. 1125-1130, November 1993.


[Ill M. Metrumiya et al., ''A 15-ns 16-Mb CMOS SRAM with Interdigitated Bit-Lme Architecture," IEEE Journal of Solid-State Circuits, ual. 27, no. 11, pp. 1497.1503, November 1992.

[I21 K. Sen0 et al.. " A 9-ns 16-Mb CMOS SRAM with OfEset-Compensated Cnrrent Sense Amplifier," IEEE Journal of Solid-State Cirenitr, vol. 28, no. 11, pp. 1119-1124, November 1993.

[I31 E. Seevinck, F. J. List, and J. Lohrtroh, Static-Noise Marsin Analysis of MOS SRAM Ceb," IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5 , pp. 748-754, Oetobei 1987.

[I41 H. Kato et al., "Consideration of Poly-Si Loaded Cell Capacity Limits for Low-Power and High-speed," IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 683-685. April 1992.

[I51 K. Saraki et al., "A 23-ns 4-Mb CMOS SHAM with 0.2-pA Standby Cur- rent," IEEE Journal of Solid-state Circuits, vol. 25, no. 5, pp. 1075-1081, October 1990.

[I61 K. Ishibarhi, T. Yamanaka, and K. Shimohigashi, "An a-Immune. 2-V Supply Voltage SRAM using a Polysilicon PMOS Load Cell," IEEE Jour- nal of Solid-state Circuits, vol. 25, no. 1, pp. 55-60, February 1990.

[I?] K. Saraki et al., "A 15-ns I-Mbit CMOS SRAM," IEEE Journal of Solid- State Circuits, vol. 23, no. 5 , pp. 1067-1072, October 1988.

[I81 K. Ssaki e l al., "A 9-ns I-Mbit CMOS SRAM," IEEE Jonrnal of Solid- State Circuits, "01. 24, to. 5, pp. 1219-1225, October 1989.

[I91 K. Ishibarhi, K. Takasugi, T. Yamanaka, T. Hashimoto, K. Sasaki. " A I-V TFT-Losd SRAM using a Two-step Word-Voltage Method," IEEE Journal of Solid-state Circuits, vol. 27, no. 11, pp. 1519-1524, Msy 1992.

[20] M. Yoshimito, K. An-, H. Shioohara, T. Yoshihara, H. Takagi, S. Nagao, S. Kayano. and T. Nakano, "A Divided Word-Line Structure in the Static RAM and its Applieation to a 64K Fall CMOS RAM," IEEE Journal of Solid-State c i r c u i t s , vol. SC-18, no. 5, pp. 479-485, October 1983.

[21] T. Hirose, H. Kuriyama, S. Mnmkami, K. Yuzuriha, T. Mukai, K. Tsut- sumi, Y. Nishimura, Y . Kohno, and K. Anami, "A 20-ns 4 M b CMOS SRAM with Eieraichical Word Decoding Architecture," IEEE Journal of Solid-State Circuits, vol. 25, no. 5, pp. 1068-1074, October 1990.

REFERENCES 405

[22] A. Sekiyama, T. Seki, S. Nagai, A. Iwase, N. Surilti, and M. Hayaraka, "A I-V Operating 256-Kb FaLI-CMOS SRAM," IEEE Journal of Solid-state Circuits, vol. 21, no. 5, pp. 776-782, May 1992.

[23] T. Yabe, et al.. "High-Speed and Low-Standby-Power Cieuit Design of 1 to 5 V Operating 1 Mb Full CMOS SRAM." Symposium on VLSI Circuits Tech. Dig., pp, 107-108, May 1993.

[24] G. Kitrukawa, et 81.. "256-Mb DRAM Circuit Technologies for File Appli- cations," IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 1105- 1113, November 1993.

[25] T. Hasegawa, et al., "An Experimental DRAM with a NAND-Structnred Cell," IEEE Journal ofSolid-State Circuits, val. 28, no. 11, pp. 1099-1104, November 1993.

1261 T. Sugibayashi, et al., "A 30-nn 256-Mb DRAM with a Multidivided Array Structure," IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 1092- 1099, November 1993.

[27] M. A&, J. Etoh, K. Itoh, S-I. Kimura, and Y. Kawamota, "A 1.5-V DRAM for Battery-Bwed Applications," IEEE Journal of Solid-State Cir- cuits, "01. 24, no. 6, pp. 1206-1212, October 1989.

[28] Y. Nakagome, et d., -An Experimental 1.5-V 64-Mb DRAM," IEEE Jour- nal of Solid-State Circuits, vol. 26, no. 4, pp. 465-471, April 1991.

[29] H. Yamauehi, et al., "A Circuit Technology for High-speed Battery- Opersted 16-Mb CMOS DRAMS,~ IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 10841091, November 1993.

[30] N. C. C. Lu, " Advanced Cell Structnres for Dynamic RAMS," IEEE Cir- cuits m d Devices Magashe, no. 1, pp. 21-36, Jenuary 1989.

[31] M. Takadn, "DRAM Technology for Giga-bit Age," International Conf. Solid State Devices and Materials, Tech. Dip., pp. 874876, 1993.

[32] L. Itoh, et d., "An Experimental 1-Mb DRAM with on Chip Voltage Limiter," in International Solid-State Circuits Cod., Tech. Dig., pp. 282- 283, 1984.

[33] N. C-C. Lu, and H. H. Chao, '' Half-Voo Bit-Line Sensing Scheme in CMOS DRAMS," IEEE Journal of Solid-State Circuits, "01. SC-19, no. 5, pp. 451-454, August 1984.


(341 B. Kawamoto, T. Shinods, Y. Yamapehi, S. Shimiuu, K. Ohishi, N. Tan- imum, T. YasUi, ''A 288K CMOS Pseudostatic RAM," IEEE Journal of Solid-state Circuits, vol. SC-19, no. 5, pp. 619-625, October 1984.

1.351 Y. Trikihwa et d., "An Emcient Back-Bias Gcnezstor 6 t h Xybzid P u m p ing Circuit for 1.5 V DRAMs," in Symposium of VLSI Circuits, Tech. Dig., pp. 85-86, May 1993.

(361 Y. KQnishi, ct al., "A 3&ns 4-Mb DRAM with a Battery-Backup (BBU) Mode," IEEE Journal ofsolid-state Circuits, vol. 25, no. 5 , pp. 1112-1117. October 1990.

[37] T. Ooirhi, et al., "A Wen-Synchronized Senring/Equalizing Method for S u b 1 V Operating Advanced DRAMs," in Symposium on VLSI Circuits. Tech. Dig., pp. 81-82, May 1993.

1381 M. Asakura, et al., "An Experimental 256-Mb DRAM with Boosted Sense- Ground Scheme," IEEE Journal of Solid-state Circuits, d. 29. no. 11, pp. 1303-1309, November 1994.

1391 T. Sskata et al., "Subthreshold-Current Reduction Circuits for Multi- Gigabit DRAMS," in Symposium on VLSl Circuits, Tech. Dig.. pp. 45-46, May 1993.

[40] T. hrruyama, et al.. "A New On-Chip Voltage Converter for Submicrome ter High-Density DRAMs," IEEE Journal of Solid-state Circnits, vol. 22, no. 3, pp. 437-441, June 1987.

141) M. T s h d a . e l al., -A 4-Mb DRAM with Aalf Internal Voltage Bit-Cine Precharge," IEEE Journal ofSolid-State Circuits, vol. 21, no. 5 , pp. 612- 617. October 1986.

1.121 M. Hiroguchi, e l aL, "Dual-Operation-Vdtage Scheme for B S i g l e 5-V. 16-Mb DRAM," IEEE Journal of Solid-State Circuits, vol. 23, no. 5. pp. 1128-1132, Oetober 1988.

1431 G. Kitsukawe, et al., "A I-Mb BiCMOS DRAM Using Temperature- Compensstion Circuit Techniques," IEEE Journal of Solid-State Circuits, "01. 24, no. 3, pp. 597-602. Jnnc 1989.

144) M. Boriguchi, et al., "A Tunable CMOS-DRAM Voltage Limiter with Sta- bilised Feedback Amplifier," IEEE Journal of Solid-State Circuits, YO\. 25. no. 5. pp. 1129-1135, October 1990.

REFERENCES 407

[45] M. Roriguchi, et al., "Dual-Regulator Dual-Decoding-Trimmer DRAM Voltage Limiter far Brun-in Test," IEEE Journal of Solid-State Circuits, d. 26, no. 11, pp. 15441549, November 1991.

[46] K. Ishibashi, K. S-ki, and H. Topshima, " A Voltage Doan Converter with Submicroampere Standby Corrent for Low-Power Static RAMS," IEEE Journal of Solid-State Circuits, "01. 27, no. 6, pp. 920-926, June 1992.

[47] P. E. Anen, and D. R. Rolberg, "CMOS Analog Circuit Design," Holt, Rinehart and Winston Publisher, 1987.

[48] P. R. Gray, and R. G. Meyer, "Analysis and Design of Analog Integrated

[49] R. A. Blauschild et al., " A New NMOS Temperature Stable Voltage Ref- erence," IEEE Journal of Solid-State Cicuitr. vol. SC-13, pp. 767-774, December 1978.

Cteuit," 2nd Edition Wiley Publisher, 1984.

[60] H. &aka, Y. Nsksgome, J. Etoh, E. Ymaeki, M. Ao?4 and K. Miyamwa, *Sub-l-prn Dynamic Reference Voltage Generator for Battery- Operated DRAMS," in Symp. VLSI Circuits, T e d . Dig., pp. 87-88, May 1993.

7 VLSI CMOS SUBSYSTEM DESIGN

In this chapter, we study the application of the dreuit techniqnes developed through Chapter 4 in the implementation of CMOS bdding blocks soch as adders, multipliers, ALUs, data-path, and regnlar structures, etc. The pow= dissipation constraint is also included through the several options presented for each dreuit. The use of Phase locked Loop (PLL) in high-speed CMOS systems for deskewing the internal clock is also examined. Low-power issuer of the circuits presented are also discussed.

7.1 PARALLEL ADDERS

Parallel adders ere the most important elements used in arithmetic operations of microprocessors, DSPr, ete. As in any logic design they are constrained by parameters aoch as speed, area, and power dissipation. The adder cell ir also an dement of multipliers, dividers, multiplier-acuundatorr (MACs). etc. Among the varions adder's implementations used in many desigrw, we c a n cite the following clssse.:

m Ripple Carry Adders (RCA); - Carry Look-Ahead Adders (CLA); . Carry Select Adders (CS); and

m Conditional Sum Adders (CSA).

This section h dovoted to describing all these adder classes.

410 CHAPTER 7

7.1.1 Ripple Carry Adders

In Chapta 4, a d-rription of the fnmtiondity of an adder cell was presented. In an n-bit adder, a propagation of the carry always occurs. This propagation limits the speed of the adder. The simplest way to construct an n-bit adder is to cascade n 1-bit adders as shown in Fig. 7.1. This adder is called Ripple Carry Adder (RCA). Beesuse the carry ripples through the n-stager, the sum of the nth bit csnnot be perhmed until the c a w C=.L is evaluated. The delay of n-bit addition is given by

+,. = (n - 1)t. + t, (7-1)

where t , is the esrry delay and t. is the som delay. Since the carry propagation path is II critical stage for the delay, the full-adder cell should be optlnied. The sum and carry out are given by

S = A @ B ( B C (7.2)

C,, = A . B + ( A + B).C;, (7.3) The schematic of Fig. 7.2 cam be genewted to &dently implement the adder cell. Compared to the conventional CMOS full-adder implementation, there is no inveiter stage. Therefore, the carry delay is redoced. To optimiae the cell, the transistors in the carry path W, and W,, UUL be s i n 4 up [see Fig. 1.21. The other devices can be kept amall to reduce the load on the carry and the power dissipation. The transistors, driven by the carry in C,,, are placed close to the output. Thir will reduce the body effect. since the cairy signal is the

VLSI CMOS SubSystem Design 411

T T

Crilicai path

412 CHAPTER 7

latest one in an adder chain. The schematic of Fig. 1.2 ir symmetrical and leads to better layout and small area. Since the outpnts are complemented, and in order t o implement an RCA circuit, the configuration of Fig. 7.3 can be used. In this case, many cells use inverted inputs.

Note that an n-bit RCA circuit is subject to the glitching problem. Fig. 7.4 shows 8 static simulation of a 4-bit adder, vrith the inputs A; set to zero (0), and the inputs B; and C,. i i s ig from 0 to 1. The outputs S, should stay at 0, however, due to the delay of the carry signal, through the chain of full- adders, the autpnts exhibit spurious transitions (glitching). There dynamic transitions dissipate extra powm and can represent an important portion of the total power. With careful design this glitchhg problem cam he minimized. One ddvbntage of the RCA is its low-power characteristic. However, its speed is very limited, particularly when the adder is wide.

Another efficient full-adder cell is based on Transmission Gates (TGs). Fig. 7.5 shows an optimived version of the fd-adder cell wing TGs & e d y discussed in Chapter 4. The carry ieal propagates only through one TG. Hence, an n-hit RCA would be faster and more compact than the conventional one'. Fig. 7.6 shows the construction ofan n-bit d d e r . Pmctiedy, an inverter is added every four stages to reduce the degradation of the carry signal due to the dktribnted RC effect. When the carry rignd is inverted after 4 I-bit stager, complementary carry path adders are used for the next 4-bit stages. This adder structure is sometimes called Mancherter adder. This circuit is faster than the RCA and may have loww power dissipation.

7.1.2 Carry Look-Ahead Adders

To avoid the linear growth of the carry delay, we use a Carry Lookahead Adder (CLA) in which the earties can be generated in pardel. The carry of each bit is generated from the propagate and the generate ~ignalr (P(, G;) ss well i ~ s the input carry (Go). The propaggste and the generate signals (Pi,Gi) are derived from the operands A; and B, hy

G; = B. (7.4)


414 CHAPTER 7

. I T I

A

Ci"


The carries of the four stager are given by

C I = G a t POCO (7.6)

(7.71

(1.81

Cz = G I + PIGo + PIP&

Cs = Gn + PxGr+ PzPzGo + PZPLPOCO

Cn = Gs + PsGr + PsPzGi + PsPzPxGo+ PaP,P,PoCo (1.9)

Fig. 1.7 shows the block diagram of a 4bit CLA adder. The carry generator blocks (CLG1 to CLG4) generate the carries CL to Cn, in parallel, &om the w r y in signal Co. The different P< and G; signals are implemented following the expressions given b7 Equations (7.4) and (1.51. The B- generator blocks (SG1 to SG4) generate the sums. The mm, S( , Li generated by

Sc = Ci-1 @ Ai @ B; (7.10)

416 CHAPTER 7

or s, = C<L, B Pj (7.11)

if the propagate signal is given by

P, = A< Q B, (7.12)

In general, an n-bit CLA adder can be implemented dciently using 4-bit blocks.

Fig. 7.8(a) and 7.8(b) show the first and the fourth CMOS carry lookahead generator kcuits, respectively. The generate and propagate signals are generated in parallel and are fed to all carry generators with the input carry signal Co. The eury signals %re generated simultaneously. However, because the number of stacked MOS transistors increases, the delay of the fourth carry is greater than that of the first and limits the adder speed. The sum generator of the CMOS adder of Fig. 7.2 c m be used in this ewe. The same circuit is used for all four bits. This implementation is slow beeavae of the large numbers of stacked MOS transistors which represent a high equivalent resistance in the pull-up and pd-down paths.

Another CLA circuit implementation in static CMOS design which improves the critical carry path delay is shown in Fig. 7.9(s). In this circuit, the number of stacked devices is reduced. The same cell of Fig. 7.9(a) can be used to generate each carry within a 4-bit block. P and G are the global prqagate and generate signals, respectively. The invezter of the circuit of Fig. 7.9(4 is used to reduce the load on the fourth carry, C,-, when it is used to drive the next fourth CLG circuit. The output of this inverter, I, drives many blocks such BS the next first-bit, the next second-bit, the next third-bit CLGs, and the next sum blocks. For the fourth bit stage, P and G aze given by

P = P.+sP,+2P,+,P; (7.13)

(7.14)

The circuits of Fig. 7.9(b) and Fig. 7.9(c) show the implementations of the global functions P and G. Simildy, the P and G sign& for the third. second and first bit stages can be constructed. For an n-bit adder, all the P and G signals are computed in parallel. Hence, the critical path is the carry path C, - C;+,, except for the fust &bit adder block, where the oritieal path can be from one of the inputs ( A , or Bo) to the carry out C4.

The 11101 generator is implemented using the propagate signals, P< and p;. Fig. 7.10(a) illustrates one pwsible circuit using B static CMOS implementation.

G = Gi+a + Pi+sGi+? +P;+aP;+2Gi+i +Pi+sPd+&+tGi

VLSI CMOS SubSystern Design 417

t Gn

418 CHAPTER 7

VLSI CMOS SubSistem Design 419

c i -

Figure 7.10 ramion.

S w generator circuits: (a) static CMOS; (b) transmiasion @tr

Another circuit more compact and faster is shown in Fig. T.lO(b). It uses transmisJion gates and needs only 6 transistors.

Many urcuit techniques for high-speed carry lookahead adders have been propored. One of them uses the pseudo-NMOS like style [I]. The adder w~ used in a multiplier and achieved a high-speed static operation. However, it consumer a DC current and it is not snitable for low-power applications.

420 CHAPTER 7

Other CLA implementations, to improve the carry path delay, are based on the transmission gates and CPL families. In this section we present the one based on CPL. The TG version is left to the reader to design. Fig. 7.11 shows the block digram of a 32-bit PMOS lsttch CPL carry loakahesd adder using 4bi t blocks. The carry generators (CLGs) of each 4bi t block generate the carries C,+> through C(+$ in parallel from the carry in, C.. The different P; and G, signals, required by each 4-bit block, m e not shown for clarity reasons. When the carry Cj+4 is fed to the next 4-bit block it "re3 B buffer to distribute this carry to other CLGs and SGs. Therefore, the carry path is not signifmtly loaded. This results in a h t operation. Fig. 7.12 shows the CPL implementation of the CLG of the fourth bit. This circuit is located in the clitical path of the carry signal. It is compact and uses only NMOS pass transistors. P and G are the global propagate and generate signals, respectively. The fourth carry is generated from the carry in or G signals through only one NMOS device. The P signal block i b implemented using ANDINAND CPL style. After each 4 CLG blocks of the critical path, the carry is buffered and restored using PMOS latch buffers. The PMOS latch restorer the reduced high level to full-swing to avoid any DC leakage current as shown in Fig. 7.11. Fig. 7.13 shows the G signal block for the fourth-bit CLG 8s an example. The same circuit gtyle can be used to generate this G signal for the third-bit, the second-bit, and the first-bit CLGs. In addition the output inverter rises a PMOS latch to rertore the swing. The PMOS latch circuit is incorporated only when dual rail signals are available. However, for a single-ended signal, a feed-back PMOS, transistor is added to restore the full r d high-level ar in the case of the sum generator of Fig. 7.14.

7.1.3 Carry-Select Adder

Another adder implementation which improves the speed of the RCA is the Carry Select adder (CS). It provides B regular layout. as in the m e of an RCA. A CS adder basically consists of blocks; each wrecuting two additions. One ammeS that the carry in is "1"; the other assumes the carry in is "0". The real carry in is computed from the previans block and selects one of the two mm outputs with a simple TG multiplexer. Fig. 7.15 shows an example of an &bit carry select adder implementation with 4 4 staging. The carry signal, C,, selects the nerd foulsums and the carzy Cs. The 4bi t adder blocks usvaUy nse RCA with transmission gate implementation. For a 32-b adder, the use of the normal sta&g 4-4-4-444-4-4, does not lead to an optimum delay. This is due to the multiplexing delay of the next carry. Optimal staging depend. on the technology. For example, for the 0.8 pm CMOS device parameters presented

Buffers I

C"

... ...

422 CHAPTER 7

Figure 1.13 G blockin CPL logic.

VLSI CMOS Su6System Design 423

in Chapter 3, simulations show that the optimal staging of a 32bit CS adder nSing TGr is 4-4-7-9-8 at 3.3 V power supply '&age. This implementation is regular and easy to layout. however it has a higher occupied area than the RCA.

7.1.4 Conditional Sum Adders

In 1960 Sklansky considered the Conditional Sum Adder (CSA) 8s the fastest one, from a theoretical point ofview [Z, 31. The concept behind this architecture is explained using the basic circuit of Fig. 7.16. This example is for a 4b i t conditional rum adder. It user two types of c e h i) the conditional cell, and ii) the multiplexer. For each bit there is one conditional cell circuit. It computes two sums and two carries: So and Co are cdculsted for a eauy in iero, and S' and C' are calcdated for a carry in one. The selection of the true s- is done with the first carry in and the previous carries. The troe final carry out (G in Fig. 7.16) is also selected.

424 CHAPTER 7

A possible implementation ofthe conditional som adder is shown in Fig. 7.17 for the c s e of B 4-bit adder [4]. The conditional cell can be implemented vith the compact logic elements of Fig. 7.17(b). The different sign& ofthe conditional cell ate constructed using the following relations

s'p = A;.B* + A*.B+ (7.15)

(7.16)

(7.17)

(7.18)

VLSI CMOS Subsystem Design 425

The adder uses mainly for the multiplexers transmission gates as shown in Fig. 7.17(~). Note that the architectue we6 the signals and their complements (dualhail architecture) to avoid the use ofinverterr for the multiplexers. Oth- erwise the delay of the csrrg path will be pen&& by the addition ofinverterr.

To design an n-bit (e.g., 32-bit) adder, one possible technique for fast operation is to use staged blocks of constant width or variable width. In this case, dl the conditional sum blocks compute thelr respective double snms and double output carrier in paallel. The troe sum and carry out signals of each block a r e then selected by the carry in generated by the preYions stage. The architecture at the block level UBU B any-select like technique where the carry in of each block ir the true carry out of the previous block. The optimal staging a n be determined from circuit simulation. The architecture has two critical delay paths within a block. One from the carry in to the carry out which is affected by the layout routing since the carry in of a block is distribnted to all the final multiplexers. The other critical delay path is the one from the LSB-inpnt of B

block to the cnrry out.

To reduce the power dissipation and the delay of the CSA adder, B CPL-Wre circuit style can be used. Fig. 7.18 shows the different circuit cells needed to implement such an adder. In Fig. ?,la(*), the conditional cell schematic is shown. The output signals have a high level voltage equal to VDD - VT. Fig. 7.18(b) shows the compact mdtiplexer using NMOS pass-transistors. The control signals of the multiplexers should have ful-rd swing, When using t hee reduced swing circoits in the adder, whenever a full-rail swing is needed it can be generated with the double-rail swing restored circuit of Fig. ?.lS(c). The output inverter ofthe rum Signal is shown in Fig. 7,18(d). The feedback PMOS transistor is needed to restore the high level when only a single-rail exists. The layout of such an adder is regular. Only three c& of the first. second and third bits have to be drawn. Fig. 7.19 illutratw the layout of a 4bit block 0.8 pm design rules.

7.1.5 Adder’s Architectures Comparison

The ripple adder has the smallest area compared to the other classes and the lowest power in many ca~es. So it should be limited to applications where the area and/or the power must minimized, while the speed is not important. For fast adders, u ~ u d l y the CLA &cuit is used, however its power dissipation can be relatively high. The carry select adders are widely used as the optimum compromise between high-speed operation of the CLAr and the small area of

426 CHAPTER 7

* : MUXs (a1

VLSI CMOS SulSystem Design 427

428 CHAPTER 7

Figure 1.18 I bit ~anditional SM sddcr layout

R C h . The conditional snm adder, with variable block staging, combincd with carry select like style ULO iesult in the fastest adder if well optimized. The power dissipation of this adder can be comparable or maybe less than that of the RCA because it u e s jl reduced internal swing and a datively small transistor count if thc CPL-like style is used. When considering all the criteria ouch as the power, the area and the speed, a tool can be developed to select the adder class which satisfies the specified requirements.

Far wide adders, having operand's sire more than Whit, the different arehitec- turer can still be utilised. However, to optimize the speed and power of such a wide adder, several additional algorithms can be combined. Examples of wide adders can be found in 15. 61.

7.2 PARALLEL MULTIPLIERS

High-speed parallel multipliers are becoming one of the keys in RISCs (Re- dnced Instruction Set Compnteers), DSPs (Digital Signal Processors), graphics accelerators and so on. Parallel multipliers are used in data proeerrorr as well nr digital signal processors. For example, for multi-media applications 16 Y 16 fart multipliers are needed. For flosting-point unit osing double-precision multiplication (IEEE-754 standard), the mantissa data hnr 52-bit. Then 54 Y 54 are required for such an operation. The two added bits are the sign bit and the guard bit. In this section we discuss several parallel multiplier algorithms


which have been used in VLSI. The reader can consult references [7, 81 for more details on array multiplication algorithms.

7.2.1 Braun Multiplier

Consider two unsigned numbers X = Xn-l...XzXo and Y = YLi...YrY0

(7.19)

(7.20)

The product P = P ~ ~ ~ , . . . P ~ P , , , which results from multiplying the mdtipli- -d X by the multiplier Y, can be written in the following form

i=o j = o

Each of the partial product terms Pk = Xi% is c d e d summand. Fig. 7.20(a) show an example of 4 x 4 multiplication. The summands are generated in parallel with AND gates. Fig. 7.20(b) shows the Braun's array multiplier [7]. Such a multiplier of n x n requires n(n - 1) addecs and na AND gates. The adder can be implemented efficiently by arranging the array for a regular layout. Fig. 7.21 shows 8 regular 4 Y 4 array implementation of the multiplier of Fig. 7.20 using three different cells. The fist cell contains an AND gate [Fig. 7.21(b)]. The second cell shown in Fig. 721(c) contains a fd-adder and an AND gate. T h e routing lines arc dso illostmted in these cells. The last cell represents a M-addex composing the final carry propagate adder. The multiplier array is using what ir called carry-save adders.

The delay of such a multiplier is dependent on the delay of the full-adder cell and the final adder in the last row. In the multiplier array, an sdder with balanced carry and sum delays is desirable beoause sum and carry signals are both on the critical path. This is diJkent than the case of a p d l e l adder where the carry path should be optimized and speed up compared to the s u m path. For large arrays, the speed and power of the full-adder are very important. CPL- like styles discussed in Chapter 4 can result in reduced power dissipation and high-speed of operation. The final sdder in the last row can USE the techniques presented in Section 7.1.

430 CAAPTER 7

x, x* x, xo = x Y3 Y> Y, Yo = Y

VLSI CMOS SuhSystem Design 431

xi qv; (bl

432 CHAPTER 7

7.2.2 Baugh-Wooley Multiplier

It was noted that Biaun multiplier performs multiplication of unsigned nun- bers. The Baugh-Wooley teehnique [7] was developed to design regular direct multipliers for two's complement numbers. This direct approach doer not need any two's complementing operations prior to multiplication. Let us consider two-numbers X and Y with the following form

; a - I x = -x,-12"-' + c X.2' (7.22)

i=o

i=n-*

Y = -Y,-,2"-' + c K2i (7.23) i=o

The product P = XY is given by the following equation

i = n - 2 j = n - 2

i=o j = o

-x-. , c fi2"f"-Y n.i c X,2"+'-' (7.24)

In order to avoid the use of subtractor cells and use only adders, the negative t e r m should be transformed. So

P = X Y 5 x"_rY,_,2"-' + c c X;Ip'"

i=n-> <=*-a

i=o i = o

1 __,. -x,_1 c KZ"+L - - x ".I (- p . 2 + 2"-' + c E P - 1 i=n-2 i=n-2

i=o *=o (7.25)

Using this property in Equation (7.23), the product P becomes

P = X Y = -2-'+(z".l + + x".*Y"-,) .2'*-2

Using the above rdstion M n x n multiplier, using only adders, can be imple mented. The schematic circuit diagram of 8.4 x 4 two's complement mdtiplicr bared on Baugh-Wooley'a algorithm is shown in Fig. 1.22. The different cells composing the array are &o shown. In this scheme n(n- 1) + 3 full-addus are

VLSI CMOS SudSyslem Desagn 433

Figure T.22 M-Adder).

(a) 4 x 4 Baush-Wooley two's complement r e d s &nay (FA :

required. So for the ease a f n = 4 the array needs 15 adders. When n is relatively large, the Rnal adder stage in the multiplier army a n be implemented with the techniques discussed in Section 7.1.

This type of multiplier L suitable for applications where operands vith less than 16 bits are to be processed. Application;, for snch a mdtiplier are, far exxamplc, for digital filters where s m d operands mc used (q., 6 , 8 and 12). For low-power and high-speed of operation, the array uses a CPL-like adder BS mentioned pieviously in Section 7.2.1, while a CSA scheme, combined with carry select, a n be u t i e d in the final adder. For operands equal or greater than &bit, the Baugh-Wooley scheme becomes too area-consuming and slow.

434 CHAPTER 7

Henee, techniques to reduce the size of the array, while maintaining the regu- larity are required.

72.3 The Modified Booth Multiplier

For operands equal or greater than &bits, the modified Booth algorithm [a] have been used in almost all the designed multipliers. It is bhsed on recoding the two's complement operand (Lo., multiplier) in order to reduce the number of partial products to be added. Thb makes the multiplier faster and uses less hardware (area). For eurmple. the modified Rad*-2 algorithm is based on partitioning the multiplier into overlapping groups of 3-bits, and each group is decoded to generate the correct paztial product.


Let us mite the multiplier, Y, in two's complement

;=*--I

Y = -Y,-,2"-' + 1 Y.2' (7.27) irnO

It can be rewritten as follows

In this equation, the terms in brackets have valuer in the set{-2, -1 ,O, 1, +2}. The reeoding of Y, using the modified Booth algorithm, generates another number with the following five signed digits, -2, -1. 0, +1, +2. Each recoded digit in the multipliei performs B certain operation on the multiplicand, X, 85

illustrated in Table 7.1

Table 7.1 Partid ereduct .cl<c&n

Y2,+> Ya, Y,,., Recoded Operation digit on X

0 0 0 0 O X X 0 0 1 +I + l X X

0 1 0 +I +I x x 0 1 1 +2 + 2 x x 1 0 0 -2 -2 x x 1 0 1 -1 -1 Y x 1 1 0 -1 -1xx 1 1 1 0 O x X

So the bits of the multiplier are partitioned into groups of overlapped 3-hits, each group permits generation of B ceitain partial product. The five posi- ble multiples of the multiplicand are relatively easy to generate following the explanation given in Table 7.2

The generated partial prodnct is related to the multiplicand for each recoded digit by the relationships presented in Table 7.3. PP, is the partial product and PP, is the sign bit of the partial product w t h P, = Pn-l when no shifting of the partial product is performed. Note that the partial product is represented on n + 1 bits.

436 CHAPTER 7

Recoded Digit Opuation on X 0 Add 0 to the partial product +1 Add X to the-partid-product +2

-1 -2

Shift left X one position and add it to the partial product Add two’s complement ofX to the partial product Take two’s complement of X and shift left one

Table 7.S Pmtial prodvct gmcrathn relations.

Recoded Operation on X Added to Digit LSB

0 PP; = 0 for i=O, . - .n 0 +1 PP; = x, fori=O, ...a 0 +2 PP, = for i =0. ... n 0 -1 PP; = x, for i = 0,. . -n 1 -2 PP, = Z,-, for i = O , . . .n 1

To clarify this algorithm, an example is presented in Fig. 7.23. Let X = l O O l O l O l and Y = 01101001. The recoded digits of Y are

oiioio,oi: - +a -1 -2 +I

The bits are grouped into 3-bit groups overlapped by one bit and a bit with a value of aero is added on the right side of Y 85 Y-I. So the mdtiplicstian of two %bit numbers generates only 4 partial products. The number is then reduced by half, The partial prodnet in thb example is represented on 9 bits. For a correct partial product’s addition, the signs aze extended 85 shown in Fig. 7.23. The shape ofthe multiplier is then trapeiaidal due to the sign extension.


(-107) 10010101 = X (+165) %ELzy Operalion BltE recoded

+I 010 -2 100 -1 101 extension

~100101010 +2 ni I

1101010000011101 = P (-11235)

In order to make the =nay rectangular, and then more regular for VLSI implementation, the problem of sign extension must be addressed. This problem is more crucial when the operand lengths ars wide, where each partial product must be sign-extended to the length of the product. In thirIeetion we will not deal with the techniques to solve the problem of the sign extension. Bat we d discuss one technique which is shown in Fig. 1.24 for the e m p l e of Fig. 7.23. The bmie idea is to use two extra bits in the partial product. For the first partial product, the two additional bits, PP,+I and PP,+. ale equal to the sign bit of the partial product

PP..,, = PP-,, = PP, (7.29)

For the second partial product, if the first partial product was positive, then the two additional bits for this second partial product a e given by the expression above, otherwire we have two clues

PP,+z = PPm+,=l if PP,=O (1.30)

- and PP*+, = PP..+> = 1 if PP, = 1 (7.31)

So it is more interesting to use a third bit, F, as a flag to indicate whether there is, from the previous partial, a negative sign bit to be propagated. F1 is the flag generated by the first partial product to the next one. For the example of Fig. 1.24, FO = 0 (no PP before the first one). and F, = F2 = F, = 1. SO for the first partial product there is a sign propagation to all the others. This

438

(-107) lOOlOlOl = X

CHAPTER 7

(+I051 KOEl =Y Operation Bits recoded Y Y

.. :1E110010101 + I 010

mOl10101 I0 -2 100

~ O O l l O l O l l -I 101

D~00l01010 +2 01 1

ll~10100P0011101 =P (-11235)

.. , I 8-1 Additional hiis 10 he gencrawJ [sign ~i1cnsi0n1

0 Additional bits generated fmm the previous Sign and the prescnl sign

Figure 1.24 <om.

Thc prcviour trample of Figvrc 7.23 eith aimpiifiId sign cxtm-

fiag is expressed by the following Boolean equation

Fj+1 = Fj+PP, , j (7.32)

where PP,,i k t h e sign bit of the j t h partial product.

Let us now see the implementation of the n x n modified Booth multiplier. Fig. 7.25 shows the block diagram of the multiplier. Also it gives an idea about the fioorplan of this subsystem. It is composed of the following blodrs:

m The multiplier axray containing partial product’s generators and I-bit adders;

The Booth encoder and the sign extenJon bits (PP,+2,PP,+l ,F) . The Booth encoder generates the five signals (0, +lx, +2x, -Ix, and -2x) for each group of 3-bit of Y; and

The final stage adder performs 2n bits addition.

.i

rn

For the sake of simplicity, we treat the case of B 6 x 6 multiplier. All the c& described in this easmple are the besic cells of any multiplier size. Fig. 7.26


3 Y<n-l:O>

X<*-l:O>

"Y

I I

+JcF.w n-bit adder

P<Zn-l:n:

Figure 7.25 Block diagram of the n x n multiplier uing modificd Bovth al*mithm.

shows the implementation of such a multiplier. Four types of c& are used plus the final adder. There cells are:

The ADD cell which generates 0 or 1 [see Table 7.31. The schematic circuit of this cell is shown in Fig. 7.27(a). Two implementations m e possible: one using pars-transistors controlled by the five signals d&g the recoded digit code, and the other one is an AND2 gate of the two sign& -1x and -2x.

The partial product MUX (PP-MUX) which generates the partial product. Fig. 7.2T(b) shows the schematic of PP-MUX using CPL type logic. The feedback PMOS, Pj in this figure or in the o m of Fig.

440 CHAPTER 7


J

sumin 'i-1

C

Sum"",

(*) not conncclcd for PP-HA

(b) (Ci

7.17 Boothmdtipiicr c&: (4 ADD; (b) PP-MUX; ( 0 ) PP-FA (or PP-HAl.

5 *

cT 4

442 CHAPTER 7

?.Z?(a) are used to restore the high level to eliminate any DC current. This implementation permits fast operation and lowpower operation.

The PP-FA (PP-HA) cells. They merge the PP-MUX &cuit and a full-adder (half-adder). respectively. CPL-lihe adder can be utibed for fart operation and low-povrer.

TheBooth Encoder (BE) .I t generates thcfivecontrolrignalsox, +lx, +2x, -lx, and -2x from a group of three bits of the multiplier Y. Fig. 7.28 shows the schematic of the different circuits involved in the BE block. The additional circuits ofthe two bits PP,,+i,j and PPn+z,j of the jth PP are &o illutrsted. Pj and Fj+, are the previous and the next flags, respectively. PPn,, is the sign bit of the jth PP. Note that Po is 0.

rn

The Booth multiplier exhibits a lot ofunnecessary glitches. The main mason for glitchcs is due to the race condition between the multiplicand sod the multiplier due to the Booth encoder. The power dissipation assodated with the glitches can be an important portion ofthe total power and henee it needs to be reduced by some techniques of signal synehroniaation.

7.2.4 Wallace Tkee

By applying the Booth algorithm, the number of partial products is hdfed. However for large moltipliers, 32bit and over, the nnmber of the partial products is over 16-bit. In this case, the performanee of the modified Booth a lge rithm is limited. One techniqne, to improve the performance of there multipli. ers, b to adopt the Wallace tree using 4 2 compressors. A 4 2 compressor accepts 4 numbers and a carry in, and $urns them to produce 2 numbers and carry out (really it is a 5-3 compressor). Fig. 7.29(a) shows an example of rueh a tree on partial products of 110. unaigned 8 x 8 multiplisr. Eight partial products are produced. Using 4-2 eompressors, two levels of additioru (rteges) are needed. The final two summands are added nsing a fast 16-bit adder. Some eeros me added to the array. This example shows that the bits which m e not nsed in the M stage (level) jnmp to the next one t o be combined with the ones produced by the compressors. Fig. ?.29(b) shows the architectme of the 8 x 8 multiplier. For the first stage of the tree, two blocks, A and B, are required. The block A (B) of compressors group the first (last) four partial products, respectively.

VLSI CMOS SubSysten Design 443

3-1

Figure T.28 sion losir

Logic aehemstis of the Booth encoder including thc aim exten-

444 CHAPTER 7

pp"J Fl

Fig. 1.30 shows how the 4-2 compressor can be implemented by 2 full-adders or by custom static CMOS Iogjc [9]. 4-bit 11,. ..,In. are added to produce 2 s u m S and C. Hence, 4-bit of the partial product are compressed to produce two new partial products. The compressor is implemented, using carry-save adder construction, by two cascaded fd-adders as shown in Fig. 1.30(b). Notice that carry-out2 is never generated by carry-in. Fig. 1.31 shown the 4 2 compressor circuit osing B compact structure of multiplexers [lo]. This structure is faster than the static complementary version. Fig. 1.32 shows the intereonneetion of the 4-2 compressors for block A of the example of Fig. 1.29. C. is connected


X ........... Y: 0 zcra x7 Y7 ...........

446 CHAPTER 7


A

r

I As B 7 L

448 CHAPTER 7

VLSI CMOS Subsystem Design

x<31:0> I

449

I I

7 I 1 2nd stage-BlockE ]

i z - laslage-BlockC

1st stage-Block D

] 2nd slage.Block F PPG: Gcncrator of panial

products ] 3rd alage-Block G

I. ii 7

-P<15:0>

to the next carry-in f&. Since these signals are independent, the carry is not propagated through the row.

To further enhance the Wallace tree multiplier, the modified Booth algorithm can be used to rednee the number of partial prodocts by half in a camy-save adder array. One example of such combined construction is the architectme of the 32 x 32 multiplier shown in Fig. 7.33. It consists of four functions: the Booth encoder, the partial product's generator, the compressor blocks, and the final 64-bit adder. The Wallace tree is constructed with 3 stages (levels). The first stage har 4 blocks (A to D ) , with each block summing up 4 partial

450 CHAPTER 7

products among 16. The second stage s u m up the 8 new generated partial products from the first stage. Hence, two blocks are needed, E and F. Finally, block G of the third stage of the tree generates two other new partial products to the find adder. This architectare exhibits some irregularities in the b y m t since it has a complicated interconnection scheme. Hence, the interconnection wirer affect the speed and power dirsipntion of the adder.

7.2.5 Multiplier’s Comparison

The basic array multipliers, like Baugh-Wooley scheme, consume low-power and have relatively good performance. However, their use ean be limited to process operands with less than 16-bit (e.g., &bit). For operands of 16-bit and over, the modified Booth algorithm reduces the partial product’s numbers by half. Therefore, the speed of the multiplier is reduced. Its power dissipation ir comparable to the Baugh-Wooley multiplier due to the circuitry overhead in the Booth algorithm. However, circuit techniques can ~ a n e e this multiplier to have low-power characteristics. The fastest multipliers adopt the Wallace tree with modified Booth encoding. A Wallace tree would lead, in general, to larger power dissipation and area, due to the interconnect wlres. Henee, it is not recommended for low-power consumption applications. Dynamic multipliers ace not discussed in this section since they introduce problems of control and timing. Hence a t m area and power dissipation are added to the design.

7.3 DATAPATH

A VLSI chip can be partitioned in two piuts; the data path (oz execution unit) and the control unit. Data paths are often used in digital signal proce~~ors, microprocessors and application specific ICs (ASKS). The data path consists of a combination of an Arithmetic Logic Unit (ALU), a shifter, a file register, 1/0 ports, a multiplier, an adder, B magnitude comparator, and data busses, etc. It performs many operations on the data in the register file, to which the results are sent back. The data busses permit communication between the diSerent units of the data path. The data busses are the communication means for the dats transfer between the ALU, shiiler, and file register, ete. These busses have a heavy load (few p F ) . In CMOS design, dynamic techniques are used to &ow fast operation. One way to reduce the power dissipation, doe to the precharging transistors, is to use static burres (111.


Lalch A

Lalch C

Latch B

Op Code I I * Bus-B

Figure 7.34 Atithmeti= LogiE u d (4l.U).

The control unit delivers the instructions to the data path. These instructions determine the operations that the data path has to perform. The eontrol unit can be implemented using random logic, micro-ROM (Read Only Memory), PLA (Programmable Logic Array) or n combination of these three implementations. Other macrocells, snch as TLB (Itandation Lookaside Suffe~), cache memory. ete., can be added to the data path and the control nnit. In thj, section, several blocks of a data path are discussed.

7.3.1 Arithmetic Logic Unit

ALU is an important part of a data path. It is a macrocell which executes hthmetic operations snch as multiplication, addition, mbtraetion, negation, and logic operations such ar AND, OR, XOR. camp-on, etc. It performs the operation on two operands stored in latch A and latch B and puts the result in latch C as shown in Fig. 7.34. The operation code (op code) selects the operation of the ALU to be executed. The flags indicate the status o f the ALU, snch as overflow, ser+rerult, and carry generation, etc. The input latches A and Bare, in general, connected to two pardel data busses. Sometimes, the input latches are merged with MUXs to select many input sauces to the ALU. The result latch is connected to one of the busses or, to B t h d one. The ALU described in this section is static for low-power applications.

The madmum clock frequency of a VLSI circuit may be limited by the ALU operations; especially the arithmetic ones. The critical delay of an arithmetic

452 CHAPTER 7

operation is due mainly to the carry propagation along the width of the ALU. There are many types of ALU, depending on the number of operations to be performed. Fig. 7.35 shows the block diagram of a 1-bit slice of an ALU. It has exactly the same structure as the adder, except that the P and G blocks are programmable. Fig. 7.35(a) shows the P block with 4 control sign& (OPI . . . O&). The feedbaek PMOS transistor. P j , permits restoration ofthe high-level from VDD - V., to VDD. Hence the DC current of the first inverter, due to the reduced high-level, is eliminated. Fig. 7.35(b) shows the G block with 4 op code sign& (O&..OPa). The P and G b l a h use the pass-transistor style. The techniques discussed in Section 7.1 can be applied to achieve low- power and fast operation. The carry and resdt (sum) blocks m e shown in Fig. 1.35(c) and (d), respectively. Table 7.4 summarises some of the functions that can be implemented with these blocks. Several other operations can be realimd with this ALU.

Table 1.1 Examples of ALU wcrationr (1. me- with).

Operation LSB-C.. P function G fanction Op code

Add w. carry 0 P = A ZOI B G = A 01 B 10011101 Subtraction 1 P = A z o r B G = A o r B 10011101

(0P1 ... ope)

Bit-wke AND 0 P = A o n d B G=O 01110000 Bit-wire OR 0 P = A o r B G=O 00010000 Not A 0 P = H G=O 10100000

Table 1.4 (cm6inwd)

Operation Result

Add w. carry A t B Subtraction A t B + 1 Bit-wire AND A and B Bit-arise OR A m B Not A A

To implement an n-bit ALU, all the techniques discussed for carry speed-up in adders can be applied. Drivers are needed to dirtribvte the op code signals for

VLSI CMOS SudSystem Design 453

- - P P P P

454

* B

CHAPTER 7

Eigure 1.38 Absolute value calsulntor

an n-bit ALU. Foi low-power design, the busses which communicate with the ALU are in general not precharged 8s in the case of many data paths.

1.32 Absolute Value Calculator

The Absolute Valne Calculator (AVC) is, in general, used in data path. of video processors to compare the data of two pictuw. Fig. 7.36 shows the architecture of the AVC. This pardel circuits performs two subtractions simultaneously, A - B and B ~ A. Using the most significant bit of there two operations, the MUX circuit selects the positive one. Then the output giver the absolute d u e IA-BI.

To reduce the power dissipation and the area of an n-bit AVC, the logic of two n adders rewired can be reduced by the merging of the common functions for both operations. Also the techniques described in Section 7.1. for n-bit addition. should be nsed


7.3.3 Comparator

A magnitude comparator is oscd in many DSP applications. It permits comparison of the magnitudes of two numbcis A and B by providing if A < B, or A = B, or A > B. Fig. 7.37(a) shows an example of a two-bit comparator which requires two types of eelk C1 and CZ. The cell, C1, is constructed by the eireuit of Fig. 7.37(b). Table '1.5 shows the truth table for this cell.

Table 7.5 b t h tsbk for cLil C1

Let ns explain how B %bit comparator works. When A, c B,, then C, = DI = 0, and A1Aa < BIBo regardless of the magnitudes of the lower bits Simile.& for A1 > B,, then C, = 1, D , = 0, and AlAo > BIBo regardler. of the magnitudes of the lower bits. When A1 = BL = 0, the magnitudes of the two 2-b numbers depends on A. and Bo. In this situation, there are three different cases:

1. AlAo < BIBo for A. c BO (i.e., Co = Do = 0). Then we can set

2. AlAo = BLBO for Ao = BO ( k . , C, = 0, Do = 1). Then we can set

3. AlAo > BIBo far AO > BO (i.e., C, = 1, Do = 0). Then we c m set

Eo = Fo = 0.

Eo = 0 and Fo = 1.

Eo = 1 and Fo = 0.

These relations can easily be nsed to implement the second cell, Cz, of the comparator a8 shown in Fig. 7.37(c)

This technique, for the two-bit comparator, can be extended for an n-bit =om- parator. It can be constructed by using B parallel tree of the cells C1 and C2. A 4-bit comparator could. for example, be constructed with two 2-bit comparators connected in parallel and at the output the 4 E and F generated signals

456 CHAPTER 7

are fed to an added C2 cell. In this architecture, the glitching is reduced by equdizing the delay paths of each cell.

7.3.4 Shifter

Another macrocell of the data path is the shifter. It pertorms shift or rotate operations on the data If the number of bits to be shifted is arbitnuy, then a barrel rhifter is used [12, 131. Fig. 7.38 shows the CMOS implementation


s3 s2 S1 SO

of a 4b i t barrel sbifter. NMOS transistors are used as switches in the array. The input bns (Do - D,) can be connected to the output bus (Ra - RB) via the pass transistors. The control signal So-h selects the pass transistors to be switched. These signals determine the amount of shift and they m e generated by a 2-bit decoder. Since the outpots have a high level of VDD - VT, due to the pass transistor, then the output buffer nses a feedback PMOS device, Pf, to iestore the high level to VDO. This eliminates any DC current in the first inverter of the buffer.

Table 7.6 shows the values of the output bus function of the input data. De- pending on the values ofD < 6 : 0 >, several shift operation8 can be performed. For example if D < G : 4 >= “O”, and D < 3 : 0 > is the 4-bit input data, then

458 CHAPTER 7

B l og id shift is realiued. However, if D < 6 : 4 >= “1” and D < 3 : 0 > is the input data, then an arithmetic shift operation is performed.

Table 7.6 Output bu. function of the &Sting amount

The barrel rhiftei is not 8 critical unit for the delay. A low-power operation is performed by odng a static implementation. This shifter can be implemented with transmission gates and the feeedbak PMOS are not required. However for low-power, the use of NMOS array is more efficient. The feedback PMOS should be sized to minimum.

7.3.5 Register File

A register file is a set oircgisters which store data. It consists of a small array of static memory c&. Register files are wed by miemprocessors and DSPs and they permit multiple read and write ports [14. 15, 16, IT]. A typical array is 32 registers of 32-bit. For example an ALU needs two pieces o i data from the regjster file. The array has dual-read ringle-te architecture.

Fig. 7.39 shows the schematic ofthe singleended memory eeU with 2 read ports and 1 write port (2R-IW). The read ports are the r e d bit-lines BL.RI and BL-R2. The memory cell, composed of two cross-coupled inverters h and 12 is addrwsed by two read word-line signals, W L R l and WL-R2. The NMOS transistor N, is controlled by the Wzite Enable ( W E ) signal. N1 is connected aerially to the write B E C ~ S S transistor N 2 . The transistor f l z is controlled by the write word-line (WL-W) signal. The transistor N, isolates the stored data from the write bit-line (BLW). To write the datain the storage node A from the write bit-line, the imerters I , and I2 rhonld be sized earefnlly. The P ratio of the inverter I, should be larger than 1 (e.g., 5 ) to set the threshold voltage of 1, to a law-level. This is due to the fact that Nl and N2 we&!+ transfers a high level (only 1’00 -VT=). Moreover, to ensure a correct write operation, the

‘ThedeFdlianofB iasivoninChc~pirr4.

VLSI CMOS SubSysten Design 459

BL-W BL.RI BL-RZ

WL-w

WL-RI

WLLRZ

WE(Wdte Enable)

Figure 7.8s (ZRIW) rcgisterflle rrU.

feedback inverter 1, should he we& so the access transistors N, and N, can chmge the state of node A. For example the NMOS and PMOS of I, shodd be minim- siae except that the length of the NMOS is twice the minimum. Also the acce55 transistars should have highcr p compared to the transistors of 1,. For a given technology, the sizes should be determined by circuit simulation for a correct write operation. The inverter 1% is a buffer for the storage node.

A pair of three-port memory e& is shown in Fig. 7.40. This rtrueture has shared access transistor Na and write bit-line, B L W . To read and write the memory cell, the simplified rchematio of Fig. 7.41 is nsed. This schematic uses the calomn multiplexing scheme. For low-power, the register file U E ~ S

static design and avoids the use of the conventional sense amplifier for bit- line’s sensing. The sense amplifier consumes DC power. For a three port register file, two read and one write row decoders are required. Also, Write Enable (WE) and column addresses are needed to produce the column write enable for writing the data to the specified storage node. For fast operation AND gates can be u.ed with a m-om of of 5-bit inputs.

During the read operation, if for example Na is asserted, then the data is put on the bit-line, BL.Rl. The bit-line is selected through the pass-transistor N,. The data is then senred by the inverter I , in Fig. 7.41. During this period, the

460 CHAPTER 7

BL-FSA HL-W BL_R2H

BL-RIA WE-I WE-2 B C R i B

Figure 1.10 A pmir dthr rcpor t memory c& (2H-1W).

read enable signel, RE, is asserted, Ni is OFF and only the feedbaek PMOS P j is activated when a one ( V D ~ - VT,) is on the data-line. In this situation, the feedback PMOS charges up the data-line to VDD. Also the DC current, which c m be generated due to the reduced high l e d on the data-line, is completely eliminated. The p ratio of the inverter I, should be higher than one (e.g., 5 ) to achieve a symmetrical r e d access time for a %em and a one. When RE = 0, then the data-lines axe i 4 a t e d from the bit-liner and the NMOS transistor Nz is ON. Therefore, the latch formed by the pair of inverters 11 and I , latches the old data.

The operation of such a re&a file is fully static and does not dissipate any atatic power at any mode of operation. Furthermore, the read and write o p erations are asynchronous. This type of register file is suitable for low-power applications.

7.4 REGULAR STRUCTURES

In this section we examine the design of large regular rtruetnres such as Pro- grammable Logic Arrays (PLAs), Read Only Memories (ROMs) and Content Addressable Memories (CAMS). The ROMs and PLAs are not only used to implement controllers in a regular manner but they also can be applied to signel processing. RAMS arc treated separately in Chapter 6. These large structures

VLSI CMOS SvbSystem Design 461

(WAI

WSie decoder

vow ,K. Y l W .... WE lWritof3nablc)

- YOR. YOR. Y l R , . RE (Read Enable)

462 CHAPTER 7

me usually dynamic circuits for fart operation. These dynamic circuits can be shut down with a power management Unit for power ravings. If for example the do& is turned OFF, all dynamic circuits go into 8 piechsrge mode with all PMOS precharge devices are ON.

7.4.1 Programmable Logic Array

Logic functions such s those used in the control units of VLSI processors, or in finitestate machines, are hard to implement in random logic. One way of implementing these functions, in a regular structure, is the m e ofProgrammable Logic Array (PLA) [18, 191.

PLAs have regular architecture divided mainly in two planes BS shown in Fig. 7.42. Theso planes pelform a specific fnnction such 85 OR and AND. CMOS PLAs can be implemented in both static and dynamic styles. The style is chosen depending on the timing strategy in the chip. Other factors such BJ

speed, power dissipation, and the allowed area, p l q an important role in the PLA design style. A CMOS PLA example, ushg psendo-NMOS like style, is s h a m in Fig. 7.43. The output OR functions are r & d with NOR gates. From Fig. 7.43(a), we have

PI = A t B t C = A.B.C

P, = A+C = A.C

Pa = B+C = B.6

(7.33)

(7.34)

(7.35) - -

P, = A + 6 = A.C (7.36)

The buffers are used when the load on the bit-line is large. They consist in general of two invectez's stages. The OR plane is in principle similar to the AND plane [Fig. 7.43(b)]. From Fig. 7.43(b), we have

x = Pi + P, + Pa (7.37)

Y = P, + P, (7.38)

For this pseudo-NMOS PLA, NOR-NOR logic gate style iz used. This example shows that the PLA organization is useful for implementing Sum Of Products (SOP) functions. Hence any SOP function can be redzed by programming the army with the AND and OR cells. Any type of latch or register cm be used at the input and output. ThL design style of PLAs has e n m d size area and

VLSI CMOS SudSystem Design 463

Inputs 0"tP"tE

Figvre T.12 AND-OR PLA ~ h r t e c t u r e .

it is simple to implement. However, it is not suitable for low-power application due to the high DC power dissipetion, par t idwly when the PLA is large. Moreover, it has B speed problem.

In dynamic CMOS style, the circuit shown in Fig. 7.44 can be used. It is a self- timed PLA, where the AND and OR planes are both realised =sing precharged NOR configuration. In this structure, o d ~ a &gle clock phase is needed. When the dock, elk, is high the bit-lines are preeharged in both planes. The NMOS transistors NA and No are OBF, guaranteeing that there is no p.th to ground. Tracking liner in both planes are used to generate a delayed clock to the OR plane. When the clod is law, the prechargt PMOS transistors, in the AND plane, turn OFF, NA tarns ON and the produets a~le evdnsted. The tiaching lines ensure that No tuns ON only when the inputs to the OR planer are stable. Othetwise the outputs can be spmiously discharged. This PLA is fast, bnt it har a lot of wasted dynamio power. The wmted power har revad sources such ar:

464 CHAPTER 7

_ _ _ X = ARC+AC+RC

Y = ABCiAC

x = q + Pi+ Fj$ L + P 4

(bl

Figure 1.48 P#eudD-NMOS CMOS PLA: (s) AND plane; (b) OR pknc.


AND-plane OR-plane

clk

- :vinua1 Ground

Figure 7.44 Sclf-timcd d+c PLA using NOR-NOR style.

m The virtual ground Liner are charged and discharged every cycle. The total eapheitance of the virtual ground is important, particularly for large PLAs because for the purpose oflayout compactness the ground lines ate in diffusion. This capacitance can be reduced using metal level in multi metal’s technology;

The number of inverters forming the buffers are important. Then, duiing the evaluation, several of them switch; and

The switching activity of dynamic NOR implementation is high [see Chapter 41.

m

m

Consider now the PLA shown in Fig. 7.45 mith AND-NOR structure. The OR plane is still the same compmed to the PLA of Pig. 7.44. However, the AND plane is considerably simplified because:

rn The virtual ground Liner disappear; and

466 CHAPTER 7

AND-plane OR plane Delay Tra'h"g

- 'Vinual Ground

Figure 1.45 Sclf-timed dynamic PLA u s h r AND-NOR stylo

The number of inverters for buffering is reduced by half.

The switching activity of the NAND implementation is aLo lower than that of NOR implementation, resulting in Iower power in the AND plane. One problem associated with this struetme is that the use of NAND may result in a large discharge time.

Another dynamic PLA combines the pseudo-NMOS and dynamic logic design styles [19]. Fig. 7.46 shows an example of such a structure. The AND plane uses a predseharged pseud-NMOS NOR style, while the OR plane uses B

conventional dynamic precharged style. During the precharge phase, the clock signal is high and the bit-lines in the AND are predircharged to ground. In the OR plane, the bit-lines are precharged to VDD. The inpd s@s to the OR plane are low. During the evaluation phare (clk = 0), the PMOS loads in the AND plane are ON, and the plane behaves as pseudo-NMOS logic. In this case, the PMOS device should be siaed correctly to ensure safe operation when the output stays at a low level. The product terms are evaluated and then the outputs. During this evaluation phase, the PLA dissipates a static power m d y by the AND plane. Then the power is increased by this DC component.


PMOSlOad ,

This PLA does not need the seW-t-g techaiqne nsed previously. Also it was shown that this PLA has a kst operation [IQ].

When implementing smaller controllers, it is sometimes more interesting to use random logic. The implementation consists of two or more levels of logic gates using s standard cell library. It is much less regular than a PLA structure and it can have lower power dissipation.

7.4.2 Read Only Memory

Read Only Memory (ROM) is used in many applications. In DSPs, for example. it can be used BJ table lookup to store coefficients. Also it is often used in VLSI processors as a microcode controller. In this case, the ROM contains the microprogram instructions. Typical miero-ROM size is 2k words of 64 bits. The read-out cycle of the ROM limits the speed of the processor. Conceptually, the structore of a ROM is quite similar to that of B PLA.

Fig. 7.41 shows a simple ROM circuit architecture using NOR logic design. The state of the memory array is retained even if the ROM is not powered. The

89P


Bit-he (merall)

A

G - word-fine (rnCtSl2)

Diffurian

Ward-ime (polyriiicon)

Figure 7.41 Layout of a ROM memery cell

The ROM can be implemented in both styles: static and dynamic. In static styla, the pseudo-NMOS logic, similar to that of static PLA, can be used. Fb. 1.49 shows an example of a small ROM 'Lsing pseudo-NMOS circuit style. The conditioning circuits use PMOS devices, with their gates grounded, and the sense amplifier circuit is simply an inverter. The column decoder is also shown. One of the column decoders selects one of the two bit-lines. Then, node A is initially at VDD. If the selected bit-line is &charged, then node A is discharged and the outpot is pulled up to VDD. The pseud-NMOS is eaey to design and does not need a careful design, howveer, the power dissipation may be significant due to the DC current. For a relatidy large ROM, like the one used in microcontrollers, the power dissipation c m be significantly rcduced using the low-power techniques of SRAMsa. They include pulse mode operation using address transition detection, and r m d swing sensing, ete.

*These tecbsiisuca M discused in mom detail in Chapter 6.

470 CHAPTER 7

ROW demder 4 a q< Gmunded PMOS

Figure 7.40 PseudeNMOS ROM cirsYtry.

A dynamic version of the ROM ir shown in Fig. 1.50. During preeharge phase, elk = 1 and the bit-lines are precharged to VDD - VT, where VT is subject to the body effect. Node A is also precharged by the PMOS trensistar Pp. The select lines Sell and Sei2 are controlled by a column decoder. Ail the word-lines are predirchsrged to groond. Dudog evsluation, cfk = 0 and if the hit-line is discharged to gro.aund, node A is also discharged. Then the ontput of the inverter I is p d e d up. If node A is not discharged, the feedbadr PMOS transistor Pt permits to maintain the high level at VDD. Since the swing on the high-load bit-line is reduced, the power dissipation is reduced on this line by a factor V D D / ( V D D - VT).

7.43 Content Addmssable Memory

A Content Addressable Memory (CAM) is an important maeroeell of a T~ms- lation Loakaside Buffer (TLB) [XI and cache memory [21] circuits ofcomputer systems. The TLB permits the translation of the virtual sddress of a CPU to the physical address, and the cache memory from the physical address to the memory data.


decoder Word-linc

Bit-line

S d l + r Figure T.60 Dynsmi~ ROM cirrvit.y.

A CAM stores tags which can be compared against an input address word (A o...A,,,) as shown in Fig. 7.51(*). A match detection signal is sent by the CAM if the valuer stored in the CAM array match with the input address word. A CMOS implementation of the CAM cell is illustrated in Fig. ?.5l(b). It c m be readable and writable jwt as an ordinary memory cell. The read/write and decoder circuits are similar to that of B RAM.

A tag word ir formed by identical cells which are repeated in a horiaontd array. The write lines are used to write data in the array. The comparison procehs k described e ~ , follows. Dnring prechmge phase, the bit-lines me predischarged low. Al l the write lines are low. The Match line (ML) is precharged high. During the evaluation phase, suppose that a "1" is stored at node A. Assume that C B L line is held high and m l i n e is held low. In this case, the transistors N3 and N1 are OFF, hence the ML Line remains high, indiea&a match at this bit location. Assume now that C B L is driven low and C B L high. The transistor NQis OFF, but N1 and N2 are ON. Then the ML line is discharged, indicating B mismatch at this bit location.

For an array of n tags, there me n matchliner f M L ( 0 ) ... ML(n)). Each match line is common to m cells. If there is B mismatch in any bit of the tag wocd, the match line is discharged. If all the m bits match, the common match-line remains high, To detect the match signal in any of the match liner a dynamic

472 CHAPTER 7

1

Wnfe Line(WL)

Match Line (ML

- CBL (b) CBL

Plgurs 7.61 (a) CAM may; (b) CMOS CAM cell


NOR circuit is used, LU shown in Fig. 7.62. When the clock is low the NOR gate is precharged along with the match lines. The inputs to the NOR gate me predischarged to ground. When the cUr signal is high (evaluation phase), one of the match lines, MI,((), stays high and the others are discharged to ground. When the msteh liner are stable, the eual signal i n asserted with elk using self-timing (similar to the PLA case). This permits keeping the dynamic NOR gate from falsely diecharging. The inputs to the NOR gate must not go high until the data is stable. If one of the match line stays high, then the NOR gate is discharged and the output matoh signal goes to high.

7.5 PHASE LOCKED LOOPS

Phase Locked Loopa (PLLs) have many applications in digital and analog systems. In digital systems, on-dip PLLs are needed for the following reasons:

To reduce clock skew dne to clock distdbntion. As systems continue to demand higher clock frequencies, dock skew associated with input buffers snd clock distribution becomes a significant design problem LU shown in Fig. 7.63(a). The internal dock drives the output register, which in turn delivers the data to the output pad (with a buffer). The

474 CHAPTER 7

skew between the external and internal clocks is due to the clock tree. The outpot datais significantly delayed compared to the external clock. One main contribution is the dock skew. In Fig. T.SS(b), the internal dock is deskewed via the use of a PLL. The PLL shonld reduce this skew OD B wide range of process, temperatnre and voltage vadations;

To synchronize data between chips as shown in Fig. 7.54. The PLL solves the problem of clock skew Grom chip to chip. An example of such an application is &cussed ia “221; and

To generate internal clocks with higher frequencies than the external dock (system dock).

There are other applications of PLL for clock recovery in serial data communications and these are not discussed in this section. Several theoretical references on PLLs can be found [23, 24, 251. Thu section provides m introduction to the PLL. The CMOS circuit design of the PLL, for low-power applications, is then discussed.

7.5.1 Charge-Pumped PLL

One interesting C O Z L ~ ~ ~ U F L ~ ~ O O of the PLL is the charse-pumped loop shown in Fig. 7.55. It is B PLL-based frequency multiplier which consists of a Phase Frequency Detector (PFD), B ChargePump(CP)‘, a Loop Filter(LF), II Volt- age Controlled Oscillator (VCO), and a programmable frequency divider. The feedback of the internal dock is compared to the external clock for phase md frequency error. The outputs of the phase/frequency detector are two +tal si& called U (for Up) and D (for Down). The charge pump and loop fl- ter convert these digital EignaLE into ap analog signal (control) suitable for the VCO. The VCO function of the control signal level generates a certain oscillation frequency. If the PLL generates multiples of the external clock Gequency, then a frequency divider is inserted between the generated clock and the phase detector.

A simplified diagram of the charge pump and loop filter is shown in Fig. 7.68. It consists of two switchable corrent S O U ~ C ~ ~ driving an impedance (LF). The pnlses generated by the PFD block are nsed to switch the charge pump, to charge or discharge the impedance. The loop filter flters these pukw and has an analog output signal to control the VCO.

‘Thc chargo PUP 102 PLL should not he confused with the one vacd to sonerate diffeicnt “ O l t a g c l .


Clock

outpu, p Data oul I

I Daa uul

Figure 7.6s (b) a chip with PLL.

PLL clock gener*ticm ior drakeluing: (a) n chip without PLLi

476

Chip#l Chip #2

CHAPTER 7

Data pad

Figure T.66 Block diascm of the PLL.

7.5.2 PLL Circuit Design

This section presents the design of the PLL components. Fig. 7.57 shows the I@ diagram of the PFD circuit. It usel mainly static-CMOS NAND gates which results in good performance and law-power dissipation. The operation of this circuit using the state diagram of Pig. 7.6T(c) is aa followa. The circuit has three states: 1) UP, where the up signal U is wer ted when the external clock elk.., f a down, 2) DOWN, where the down signal D is asserted when the internal clock elk fall. down, and 3) NOP, where the detector does not


Q r4

LF

change the ontpnt control signals. In thia last state both U and D signals are at zero level. The da ta changes whenevu clk or clk..t f a down. In no case U and D are both activated.

Consider that d k and elk..t have the same freqneney bnt the f&g edges of eB..t (elk) leads the falling edges ofclk (~lkept), respectively. Then, d (8) is asserted with II certain duty cyde, while D (U) is never asserted. In this case, the PFD is characteiiaed &B the phase detector.

Consider now the case where clkezt has a higher frequency than elk. d is asserted moat of the time. More falling edger of clEsmt signal than elk. A similar sitnation vhen clE h s higher freqoency than clk,,, and D is assected most of the t h e . In this case, the PFD is characterbed as frequency detector.

The 8 and b signals, generated by the PFD, BE connected to the charge p m p dreuit of Fig. 1.58(a). When the signal d (d) is asserted the pull-up PMOS (pull-down NMOS) transistor charges (discharges) the output, respec- t i d y . Another variation of the charge pump circuit is shown in Fig. 7.58(b). Two tranei4tors P,*j and are added as current 80urces biased by 8 current

478 CHAPTER 7

clk

I

VLSI CMOS SubSystem Design

T

479

mirror circuit. In this situation, the output curent of the h g e pump can be adjusted through the control of the current mirror.

The manolit!ic impLenentation of the filter of Fig. 7.56 is shmn in Fig. 7.59. The two capacitors C, and Cz are in the order of tens of pF and are made with the NMOS transistors Ncr and Ivct. The re*stoz is made with a transmission gate in dosed stste. It can also be implemented with an N-well implant available in the CMOS pmcenn. The capacitor Ca is added in parallel to the simple RC (R-C;) low-pass filter to form a second order filter. In this ease, the stability of the system is maintained even with the process variation of these on-chip components. Note that these capacitors can occupy a large portion of the PLL.

The charge pump and filter generate a control voltage for the VCO. One important parameter of the VCO is the VCO gain. When considering the charaeted4tic frequency-control voltage, the VCO gai0 is the sbpe of lhis characteristic. A linear characteristic is, in general, desirable. In general the VCO is implemented using h ring oscillator as shown in Fig. 7.60. A series connection of de1e.y inverter cells forms a tapped delay line which oscillates with a frequency determined by the delay time of the cell and the odd number

480 CHAPTER 7

of stages. The delay of the cell is controlled by a current which in turn is controlled by the control voltage V,. V, modulates the ON resistance of p d - down N1, and through the current mirror, the p d - u p PI. All the devices of the VCO should be oriented in the same direction and have redundant contacts to reduce the jitter due to process variations. In the VCO of Fig. 7.60. madmnm frequency is achieved at madmum control voltage. Typical values of the VCO gain at low power supply voltage E B ~ range from 10 MHn/V to 100 MAzjV depending on the number of stages and technology. Note thst the bandwidth of the VCO presented previously is limited.

The VCO of Fig. 7.61 har an excellent bandwidth characteristic, where B wide range of frequency can be generated I%]. It ia used for video signal processors end covers a wide range of applications. The freqnency range EM change by one order of magnitude from 50 MHz to 350 MHe. In fig. 7.61, by turning ON and OFF 8 CMOS TGs with control signals, the number ofring oacihtor stages can be selected among eight values (7,S,ll,l5,Zl,ZS.3S.61). Each stage of the ling oscillator combines an inverter in parallel with II current-controlled inverter. The inverter inereares the frequency of oscillation of the VCO, where= the currenteontrolled inverter permits tuning of the frequency of the VCO.

The generated clock frequency can be N times the external dock frequency (reference frequency). This dock then feeds the clock driver and tree. Since the PLL discussed here is intended to be integrated on-ehip, it is then sensitive to the noise generated on the power lines (called power-supply-induced dock jitter). If the power supply changes by 100 mV the skew 01 phaae error will

VLSI CMOS SubSystem Design

Flgure T.00 VCO wing m n t controlled OMOS ring oscillator.

Selection signals

7th stage 5 I It stage

Generated clock

Figure T.01 VCO with .&&able charsctrti.tie..

481

482 CHAPTER 7

be important before the PLL has time (tens of clodrJ eydes) to correct this emor [ZT] . One vay to reduce the effect of thjs problem is to dedicate an analog power supply pin to the VCO and the charge pump. At the drcuit l e d , a ncw VCO delay cell war proposed by Young [ZT] to iedoce the phase error.

Another VCO dhmatilse is shown in Fig. 7.62. It is rimilm to the Voltage- Controlled Delay Line (VCDL) [%]. The control voltage, V., is used to vary the amount of the effective load seen by each inverter output. The frequency- control voltage characteristic of this VCO has a negative slope. Then the minimum frequency of osdllation is linlited by the maximum VDD. Therefore, the minimum freqnency is increased with iednced VDD. A positive slope is, in gened, desirable so the mioimum frequency is not set by VDD.

The frequency divider can be implemented using togglc flip-flops. Fig. 7.63 shows an example o f a divider with division ralm of 1, 112, 114, and 118. The PLL, so far discussed, is not completely digital. Only the PFD, charge pump and the frequency divider are digital. While, the I F and VCO are analog m d operate 8s eontinuoostime systems.

7.5.3 Low-Power Design

In deep mode, the on-chip PLL may bc controlled for low-frequency operation, or it may be disabled to reduce its power dissipation to the lealrsge currents.

VLSI CMOS Subsystem Design

T clk

Q

483

T clk

Q

Figure 1.84 A VCO emntrollcd by enable dgtd far low-pow= modc

484 CHAPTER 7

As an exsmple, to disable the PLL, is to shvt down the VCO and disable the external clock. Fig. 7.64 shows the Same VCO of Fig. 7.62 but with one inverter transformed to a tw&nput NAND gate. One of the inputs is controlled by the Enable signal to shut down the PLL when it is low. The NAND gate can be used for any of the VCOs presented previously. Also the enable signal can be used to disable any current O O I I T C ~ used in the PLL to eliminate any DC cunent. A typical power dissipation of B PLL, at 3.3 V, is in the range of tens of mW depending on the frequency.

7.6 CHAPTER SUMMARY

This chapter has presented the design of aeverd subsystems used in VLSI chips. Many circuit alternatives are discussed which trade area, speed and power. The reader can construct theoe options and compare their performance in terms of power, delay and area. The power dissipation isrue is stressed more. Also several building blocks of VLSI chips using advanced circuit tcdrniqoes have been investigated. These iodnde

rn High-speed addition.

rn Multiplication techniques.

I PLL and clock deskewing technique.

REFERENCES

[l] J. Mori, et al., "A 10-ns 54 x 54-b Pardel Structured Full Army Multiplier with 0.6-pm CMOS Technology." IEEE Journal of Solid-state Circuits, vol. 26. no. 4, pp. 600-606, April 1991.

(21 J. SUansky, "An Evaluation of Several Two-Snmmand Binary Adders." IRE 'Itanrllctions on Electronic Computers, vel. EC-9, pp. 213-226, June 1960.

[3] J. Sklansky, 'Conditional-Sum Addition Logic," IRE Transactions on Elee- tronic Camputem "01. E C Q , pp. 226-231, June 1960.

[4] I. S. Abu-Khater, R. H. Yan, A. Bellaouar, and M. 1. ELnaary. -A 1-V Low- Power High-Performance 32-b Conditional Sum Adder." IEEE Symposium on Loar-Power Electronics. Tech. Dig., San Diego, pp. 68-67, October 1994.

[5] T. Sato, et al., "An 8.6ns 112-b Transmission Gate Adder with a Conflict- Frec Smass Circuit," IEEE Journal of Solid-State Circuits. 701. 27, no. 4, pp. 657-659, A p d 1992.

161 K. Ucda. H. Susiki. K. Suds. Y. Tasuiihashi. X. Shinohara. "A Whit . .~ ~

' Adder Ey P a r Tranaislor B&OS Ci"rcuit," IEEE Custom' lntcgrsfcd Circuit Conference. Tech Dig. pp. 12.2 1-12 2 4 \lay 1993

(71 K. Hwang, "Compoter Arithmetic: Principles, Architecture, and Design,"

[8] J. J. F. Cawnagh, "Compoter Science Series: Digital Computer Arith-

John Wiley and Sons, 1979.

metic." MeGraw-Hill Book Co.. 1984.

[Q] M. Nagsmatsu, S. Tanaks, J. Mori, T. Noguchi, and K. Hstanska, "A 16-ns 32x32-bit CMOS Multiplier with an improved Pardel Structure," IEEE Cuatom Integrated Circuits Conference, Tech. Dig., pp. 10.3.1- 10.3.4, May 1989.


[lo] N. Ohkubo, M. Suzild, T. Shinbo, T. Yamanaka, A. Shimieu, K. Sasab, and Y. Nakagome, 'A 4.4-n5 CMOS 54x54-b Multiplier nsing Pass- Transistor Multiplexer," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 599-602, May 1994.

[Ill R. Bechade, et al., "A 32b 66MAu Microprocessor," IEEE International Solid-State Circuits Conference, Tech. Dig.. pp. 208-209, Februaiy 1994.

[12] C. A. Mead, and 1. A. Conway, "Introduction to VLSI Systems," Addison- Wesley, 1980.

[13] R. W. Sherbnme, e t al., "Data path Design for RISC," Pme. Conf. Ad- vanced Research in VLSI, pp. 53-62, 1982.

[14] R. W. Sherburne, et al.. "A 32-bit NMOS Microprocessor with e Large Register File," IEEE Journal of Solid-State Circuits, vol. SC-19, no. 5, pp. 682-689, October 1984.

[I61 K. J. O'Connoz, "The %-Port Memory Cell." IEEE Journal of Solid- State Circaits, vol. SC22, no. 5, pp, 712-720, October 1987.

[I61 R. D. Jolly, *A 9-ns, 1.4Gigabyte/s IT-Ported CMOS Register File," IEEE Journal of Solid-State Circnits, vol. 2 6 , no. 10, pp. 1407-1412, October 1991.

[I?] H. Shinoharn, et al., '"A Flexible Multipoit RAM Compiler for Data Path," IEEE Journal of Solid-state Circuits, "01. 26, no. 3, pp. 343-349, March 1991.

1181 A. R. L-, "A Low-Power PLA for B Signal Processor," IEEE Jonmal of Solid-State Circuits, voL 26, no. 2, pp. 107-115, Febrnary 1991.

[I91 G. M. Blair, "PLA Design for Single-Clock CMOS," IEEE Jounal ofsolid- State Circuits, vol. 27, no. 8, pp. 1211-12113, August 1992.

[ZO] H. Kadota, et el., "A 32-bit Microprocessor with On-Chip Cache and TLB." IEEE Journal ofsolid-State Circuits, vol. SC-22, no. 5, pp. 800.807, October 1987.

[Zl] A. J. Smith, "Cache Memories," Computing Snrveys, Vol. 14, pp. 473-530, September 1982.

(221 L. Ashby, "ASIC Clock Distribution using a Phare Locked Loop (PLL)," in IEEE International ASIC Conference and Exhibit, Tech. Dig., pp. P1.6.1- P1.6.3, September 1991.

REFERENCES 487

[23] F. M. Gardner, "Phase Lock Techniques," John Wiley and Sons, 1919.

[24] F. M. Gardner, "Charge-Pump PhaseLocked Loops," IEEE Transactions on Communications, COM-28(11). pp. 1849-1858, November 1980.

1251 R. E. Bert, "Phase-Locked Loops," McGraw Hill, 1984

[26] J. Goto, et al., "A Programmable Clock Generation with 50 to 350 MHz Lock Range for Video Signal Processors," IEEE Custom Integrated Cir- cuits Conference, Tech. Dq., pp. 4.4.1-4.4.4, May 1993.

[21] I. A. Young, J. I<. Greason, and K. L. Wong, "A PLL Clock Generator with 5 to 110 MHs of Lo& Range for Microprocessors," IEEE Journal of Solid-State Circuits, 701. 21, no. 11, pp. 1599-1607, November 1992.

[ZS] M. G. Johnson, and E. L. Hodsan, 'A Vaiahle Deb7 Line PLL far CPU- Coprocessor SyruchroniUation," IEEE Journal of Solid-State Circuits, vol. 23, no. 5 , pp. 1218-1223, October 1988.

8 LOW-POWER

VLSI DESIGN METHODOLOGY

Thk chapter presents Low-Power (LP) de- methodologies at several abstraction levels such as physical, logical, architectural, and algorithmic levels. AU the power reduction techniques discussed are related to the dynamic power dissipation. It is shown that LP techniques, at the high-level (algorithmic and architectural) of the design, lead to power ravings of several orders of magnitode. Many uampleo are included to give the reader a quaotitative picture of LP issues. Several LP techniques, particularly at the circuit level have already been discussed in Chapters 4, 6 , and 7 including those related to static power oonsiderstiona. However, they are not reconsidered in this chapter. The power estimation techniques at the circuit, logical, architectural and behavioral levels are overviewed. Power aoalysk a t high-level do- a~ early prediction and apt-stion of the power of a system. The LP concepts such as switching ncti.;ty, glitching, etc., discussed in Chapter 4 are used throughout this chapter.

8.1 LP PHYSICAL DESIGN

There are several techniques to reduce the power at the physical design (layout) level. Same ofthese issues hwe been discusscd in Chapter 4 for full-custom and semi-curtom designs. In this section we present two approaches for low-power physical design.

490 CHAPTER 8

8.1.1 Floorplanning

Floorplanning of a circuit is the first step in VLSI layout design. It permits the allocation of space on a chip for a given set ofmodules. A module can be rigid, e.g., the module is in the library and its dimension and power dissipation are known. or pezibie, e.g., it has not beon deaigned and has B list of parameters such as different shapes and power consumptions for feasible implementations. Floorplanner for low-power design should choose a suitable implementation for each module such as the total power/area of a chip are optimieed [I].

8.1.2 Placement and Routing

The placement and routing of a VLSI circuit is performed on standard cells, gate armyys, functional blocks, etc. All the diffeient modules me already laid out and well charactedeed in the library. Traditionally, placement refers to the process of placing modules to minimize area and delay. Placement for low-power uses the switching activity-eapaeitanee products as B function to be minimized, in contrast with delay minimiuation, where the wire capacitance has to be minimiad. After placement, routing permits connection of the mod- ales with wirer. High switching activity wires should he kept short using the lower parasitic capacitance layer. A CAD tool for placement has already been developed [4.

8.2 LP GATE-LEVEL DESIGN

The low-power design methodology should &LEO be applied to logic design. To achieve thia goal, power is traded for speed and area. In this section, we discuss a number of techniques to reduce the switching activity and internal capacitances during teebnology-independent and technology-dependent phases of logic design.

8.2.1 Logic Minimization and Technology Mapping

The area and power optimiaation of logic structures (both combinational and sequential) have matured considerably. The power optimimtian task benefits from there techniques. The objective of logic minimization is to reduce the boolean function. For low-power design, the signal witching activity is mini-

Low-Power VLSI Deszgn Methodology 491

mized by restructuring a logic circuit during the technology-independent phase [3]. It is assumed that at the higher-level of abstraction, decisions regmding the power supply voltage and the dock Bequency have already been made. The power minimidion is eonstrained by the delay, however, the area may increase. D-g this p h e of logic minimization, the function to be minimis& is

where P, is the probability of the node i being a "1" (1 - P$) is the probability that node i is a 'V", and Cs ia the capacitance of this node. For more infar- mation on thia model see Section 8.5.2.1. To minimiie the above equation. one has to first evaluate the current value of P; and then change it by making P: dose to 0 or close to 1. Also in [3], zero-delay approximation is assumed. This implies that the glitching power is neglected.

To minimize the switching activity, some techniques that can be used are: - Use don't-cares to minimize the probability P< of II function. Indeed, the signal probability of B gate can change by altering the ON-set or the OFF-set by adding points from the don't-cme set.

Collapse nodes that are not on the critical path. The intermediate signal lines me implemented as single node. The delay may increase, however this does not affect the m m d l performance of the circuit.

rn

Power dissipation can be imprwed by m much as 60%, at the expense of an 8 % area increke [3] and with no delay degradation. More typical power reduction would be in the range of 10.20% [4].

The technology mapping step for low-power refers to the process of trans- forming a logic function into a technology-dependent (e.g,, CMOS) circuit with minimieed power consomed. This technology dependent Step ~ s e s a target technology. The first step in technology mapping is to decompose each logic function into twwinput gates. The objective of this decomposition is to minimize the total power dissipation by reducing the total switching activity. Fig. 8.1 shows an example of a foor-inpot AND gate decomposition into two different implementations. The probabilities of inpots being at "1" logical are also shown in pig. 8.1. Primary inputs ace assumed to be uncotrelatcd. The switching activity at each internal node is also shown in Fig. 8.1. A two-input ( i , j ) AND gate is given by

a = (1- P,Pj)PdPj (8.2)

492 CHAPTER a

Lmpiomcnration 1

lrnpiemsntition 2

We s m m e also that the gate delays are zero to ignore the power dne to the glitehing phenomenon. The total switching setivitie for implementations 1 and 2 are 0.888 and 1.056, respectively. Therefore, implementation 1 is better than implementation 2. This problem ofdecomposition was addressed by [5 ,6 ] . In 151, the power dissipation, associated to glitehing, is neglected while in [6] it is not. Taking into rrccount the power dissipation of glitches is very impor ta t ar is discussed in Section 8.2.2.

The concept of technology mapping of logic opt-ation is an important step for standard c e h and gate anays (or sea of gstes) circuit design. All the cells in the library are characterized in terms of ares and speed. Another parameter to be added for low-power design is the characterization ofthe internal power of the gate and its output parasitic capacitance. Hence the process of technology

Low-Power VLSI Design Methodology 493

mapping ir to search, using B target library, the best possible implementation following constraints such power, area and delay.

In this aectian we do not consider the algorithms for technology mapping. The reader can consult rcfcrencea [5, 71. We illnstrste this concept of technology mapping by the following example. Fig. 8.2 shows an example for implementing the logic circuit of Fig. 8.2(a) into two implementations. The first implementation [Fig. 8.2(a)] is for minimal area deign using OAI (OR-AND-INVERT) gate. The second implementation [Fig. 8.2(b)] is for minimal power design where the high switching node N of Fig. 8.2(a) ir hidden using B mom complex gate.

Thus the process of technology mapping is to &st decompose the logic function such that the total switching activity is minimbed. Then, to hide any high svitching activity node within complex gates 80 that the capacitance of that node is minimisod. However, mahiog LL gate too complex can trade the delay for low-power. Typical reduction in power dissipation is on the order of 20% without any degradation in performance but st the expenac of small area penalty.

The quality of the targeted cell library can considerably impact the results of mapping [S]. For eremple, the availability ofcells with different drive etrengths and doublerail outputs (signal and its complement) gives more fleldbility for logic optimisstion. A goad library a n result in 20-5095 of power dissipation reduction.

8.2.2 Spurious Thinsitions Reduction

Due to the finite delays of logic gates, signal m e * in static logic deigns can result in dynamic hasards. Hence, a node can have transitions in one dock eyde before stabbing to the correct Logic level. These unnecessary switching transitions (glitches) can consume power dissipation in the order of 20.40% 19, 10, 111.

To .educe this power the first appioach in to balance the path delays by changing the logic atmsture (e.g., tree) ar explained in Section 4.5.5. Another technique ir to balence the delay of the patho by pising down the gates in the fast paths 1121, However, this approach can increare the delay of the circuit. ALSO insertion of buffers (delay elements) in the fast paths can baknce the delay. However, the added buffers increare the power dissipation.

494 CHAPTER 8

Another techniqne employs self-timing techn;gues to reduce the lo@= depth and then the glitehing power [9, 111. The self-timed circuit should save more power than what it introducer. As B cLcuit example that exhibits spadous transitions, is an adder. The rum sign& can have fake transitions before they are stable. If the load capacitances on the outputs are relatively large, then the power due to the glitches can be important.

A conventional self-timed method for an adder is shown in Fig. 8.3. A Tran- sition Detector (TD) similar to the one discussed for SFLAMs h Chapter 6 is used. For each set of inputs ( A and B;) there is one transition detector. If A and B are both n-bit wide, then n TDs are reqnired for the pardel adder. For any transition at the inputs, the TD generates a pulse for the self-timed function. This self-timed circuit delays the pulse by an amount equal to the critical pnth of the adder. The delayed pulse then feeds the clock of a D-Flip- Flop (DFF) or B gated &wit for the sum function. Consequently, the output


Self-timed funclion

- Gated

Pdlel-adder function I

s m s are not witched notil they are evaluated. The additional Circuitry in the conventional approach UUI colls~unr more power than it mag sme.

Another approach bsded on self-timing to reduce the spudous transition was proposed by [ll]. Fig. 8.4 shows a parallel adder using simple self-timed circuitry. When input signals are written into the registerr A and B, a single register bit is used to genepate an 'Input Valid" signal to the self-timed function. For an n-bit pardel adder, only B onebit register is required. es shown in Fig. 8.4. The self-timed function is implemented using a series of inverters with dual-rail. Two enable signals E and 3 are generated by the selEtimed Circuit. They feed the gsted sum XOR gates. Also the enable ipd, E. cantrola the one-bit register to disable the inpmt d i d signal. This technique har resulted in 25% power reduction [ll].

496 CHAPTER 8

Parallel-adder

Gated Output XOR Oale

8.2.3 Precomputation-Based Power Reduction

Consider the original circuit of Fig. S.S(a). R1 and R2 are two registers at the input and output of II combinational logic block. The idea of precomputing is to preevaluate the output values of the circuit one clock cycle before they are required, to disable a part of the input register R1, then to reduce the inteinal switching activity in the succeeding clock cycle [l3]. Fig. 8.5(b) illustrates B

simplified architecture of the preeompoting concept. Thin technique can be applied to several circuits su& BS: Finite State Machines (FSMs), pipeline circuits, etc.

To illustrate this technique, consider the ulunple of an n-bit comparator that compares two n-bit numbers A and B and computes the function F that indicates that A > B. Fig. 8.6 shows the application of precomputing technique to the comparator. If the most signifiesnt bit, A=.I and B,.,, are different, then F ean be performed from the 1-bit MSB comparator and the registers R2 and R3 are disabled. Therefarc, the (n-I) comparators are shut-down. If the inputs have a uniform probability equal to 0.5, the enable signal has a pmbability of 0.5 to be at the logical level "1" or "0". Therefore. for h relatiwly large n the power saving can be qnite significant even if we include the power due to the *dditional circuitry.

This technique of preeomputation can be synthesized for logic opt-ation. The selection of sub-set of input signals for which the output is precomputed


is critical for power savings. Otherwise, the additional circuitry can dissipate a relatively important power. Note that this added logic slightly increases the area of the circnit and may also inerese the clock cycle. The preeomputation techniqne can be applied to a mnltiple output function. However, if the logic has a large number of ontputs, then it may be worthwhile to sekct idy apply precompotation technique to a small number of complex outputs. This selective partitioning will add a duplication of combinational logic and regirtera and this may offset the powex savings.

498 CHAPTER 8

8.3 LP ARCHlTECTUKE-LEVEL DESIGN

In this section, sxhitecture meens also Register Transfer Level (RTL). The architecture uses a set of primitives suoh 8s adders, multipliers, ROMs, register filer, etc. RTL synthesis programs m e used to convert an RTL description to a set of registers and combinational lwgic. The impact of low-power techaiqnes on the architecture level can be more significant than the gate level as .rill be shown in this section. Techniques to reduce the power dissipation discxssed me: parallelism, pipeline, distributed processing m d power man<&ment.

8.3.1 Parallelism

Parallelirm can be used to reduce the power dissipation at the expense of area while maintaining the same throughput [lo]. To finstrate thia, the quantitative example of Fig. 8.7 is considered. In Fig. 8.7(a), a regbter snpplies two 16-bit operands to a 16 x 16 multiplier. We refer to this architecture to reference one and we w e the ref notation for frequency, power snpply voltsge, power dissipation, etc. This register is clocked at a maximal frequency f , s j = 50 ME$. We assume that the worse case delay of the multiplication is 20 ns at V,el = 3.3 V power supply voltage. It is clear that we cannot reduce %,I to reduce the


A

500 CAAPTER 8

throughput as in the c s e of Fig. 8.7(a). The input registers are docked at f7.,/2 = 26 M A S . Therefore, the power snpply can be reduced to achieve B worst c- delay of 40 m. With the same 16 x 16 multiplier, the power supply UUL be reduced Gom K,f = 3.3 V to 1.8 V ( V , s j / l . 8 3 ) . This value can be determined from the simulation of the two architectures. The effective capacitance has increased by a factor of 2 due to the duplication. However, due to the extra routing to both multipliers, thb effective capacitance is around 2.2 G C j . Thus, the estimated power dissipation is given by

Hence Ppe7 = 0.33P,.j

Thus, the power dissipation is significantly reduced.

The key to this power ssVings is the duplication of the hardware in parallel configuration. In general, N processors E B ~ be paralldked by duplication, with each processor running with slower do& (by 8 factor of N). In this case, for the same throughput, the power dissipation c a n be ieduced with the increase of N. Therefore. the power ropply voltage (VDD) can be aggressively rednced to meet II worst case delay almost equal to the reference delay divided by N. To wploit this power mpply reduction, the threshold voltage (VT) should also be reduced to limit the degradation of the delay as VDD approaches VT. Keep in mind that the scaling of VT is also limited by the static current oonsiderations.

When the number N is relatively large, the parallelism can lead to several problems. A highly pmddked configuration can result in s drastic incresse of the occupied area. In addition, there is rooting overhead to distribute the input and output signals. This also increases the &re8 and the wiring capacitance. Therefore, the power dissipation &a tends to increase and then limits the utility of parallelism.

8.3.2 Pipelining

Pipelining is another arehiteetluc that can reduce the power dissipdion [lo]. As an example, let us consider the case of the 16 x 16 multiplier presented in Section 8.3.1. The 60 MAB multiplier is broken into two equal parts as shown in Fig. 8.8. A set of pipeline registtun (or latches) is inserted, resulting in a 2-stage pipelined version of the multiplier. Architectures with more pipeline stages can


mulliplicr i be realized. S i e e the hardware between the pipeline stager is reduced then the reference voltege V,.! = 3.3 V can be reduced to 1.8 V (V,.t/1.83) to maintain a worst case delay of 20 ns (50 MHe). The estimated power dissipation is given hv

The switching capacitance has increased slightly due to the pipelining. Thus, the power dissipation is redneed by a faetar ofalmost 2.8 which is spprodmately the same IU the pardel EIUC. Alao the area increase is relatively low and the area penalty h due only to the additional registers (or latches). As the pipeline registers reduce the logic depth, the power dissipation, due to the glitches, is also reduced.

In general, if a processor is pipelined with N stages of regiptets, then the delay between the pipeline stages is reduced by almost a factor N while the dock frequency is maintained. Then, the power supply voltage can be scaled sggres- sively. Canscqnently, the power saving is large.

Note that ez in the case of pardelism, an architecture with a large nnmber of pipeline stages can result in an offset in power and &re&. The added registers must be clocked and hence the load on the clock network can be important, with increased pipelining. One drawback ofthe w e of the pipeline is that more latency is added to the ontput signal.

The combination of pipelining and pardelism can result in further power redoction. because the power gopply voltage can be reduced aggressively. Also

502 CHAPTER 8

the frequency of operation is reduced. However. the luea would increase sign%- eantly. For low-voltage, the threshold voltage should also be reduced to reduce the power dissipation, otherwise the power supply voltage redoction is limited. Indeed, at low-voltage, VDO approaches VT and the delay inereares d r a r t i d y . To maintain the throughput with pardelism/pipelhing, the threshold voltsge should be reduced compared to VDO.

8.33 Distributed Processing

To reduce the power dissipation of a centraked processor, B distributed processing technique can be ntihed. This concept of distributed processing is explained by the example of the Vector Quantied (VQ) image encode [I41 presented in reference [15]. First we review the VQ algorithm for the video compression, then in the next section the power reduction st the algorithm level of the VQ is discussed.

A video image, represented by a group of pixel, is vector qoantized by b r e a m it into blocks (uectois) of pix& that are mapped to a codebook of probable vectors using Mean Square Error (MSE) as the distortion m e m e . For the example given in [15], the image is segmented into 4 Y 4 pkel-vector (vector siae is 16). The VQ employs B codebook of 256 lev& The inpot data is represented on 16 x &bit and the output (&bit) represents the index of the best match as shown in Fig. 8.9 [El. Then the compression ratio is 163. To process 30 framesjs, a vector must be compressed every 17.3 ps ( e d frame is 128 Y 240 pixels). The MSE (distortion metric) between a vector X of 16 pix& and a codebook vector Cis given by

15

M S E = c ( C ; - X $ (8.8) i=o

To compute this algorithm, a large number of memory access to the codebook and arithmetic operations is needed (see Section 8.4). The number of computations can be reduced by using differential search a priori combined with TrecSearch (TS) between two vectors a and b at the same level of the tree. The distortion diffeience between the two vectors a and 6 at the same level of the tree is given by

Then, M.7E.s = M S E , - M S E b (8.9)

(8.10) 1s 1 6

MSE.a = c(C.i - X,)' - c(Cbi - X,)l i=o i=o


The two terms (C:; - CiJ and Z(C,; - C,) are Computed B pliori and stored in a memory to reduce the number of operations.

Fig. 8.10(a) shows the centralized implementation of the VQ. It has a ten-

traliaed memory, processing element, and eontroller. This architecture is time- multiplexed, wbich performs operations sequentially over a large number of clock cycle^. In TSVQ, each l e d of the tree has specific code vectors that are found only at that level. Therdore, the memory can be paltitimed into separate memories for each level of the tree. Fig. 8.10(b) shows the distributed implementation of the VQ. The memory s k e from one module to the other increaser. The architecture is pipelined allowing the dock frequency and supply voltage to be reduced.

The distributed memory architectme has lover switched capacitance when leading the code vectors than the centralized ease. This distributed imple mentation has eight controners and prowsing elements, bot since th.7 arc clocked a t lower freqneney, with low svpply voltage, the energy dissipated per vector does not change [15]. Through this partitioning, the power dissipated, of the eentraliaed implementation, was reduced by a factor - 11 at the expense of an area increase by a factor of - 2.

504 CHAPTER a


From this example we can learn that proper design of the architecture, through distributed processing, is more power-efficient than the centralieed procerror. In the distributed implementation, the different l o d hardware ~esonrces can be optimized more efficiently than the global hardware in the centralized implementation. The application of this technique depends on whetha the executed algorithm can be partitioned. Keep in mind, that the power s8-g trades the occupied area, while the throughput is maintained.

8.3.4 Power Management

In old designs of microprocessors, DSPs, ASICs, etc., there war warted power due to the clocking of blocks which a e idle for B significant period of time. Recently, power management methodologies are playing an important iole to avoid wasted power in normal and standby modes of operation [I?, IS, 19, 201. In this section, only some of the power management techniqnes m e discussed.

There are two types of power management: i) dynamic and i) static. Dynamic Power Management (DPM) allows selective shut-down of different blocks of the chip based on the l e d of activity required to run a particular application. Different blocks of the chip may be idle for a certain period of time when mnning different applications. For example, the floating point unit can have lOO%idle time when the processor is executing integer applications. The DPM requires additional logic on the chip. This logic is controlled by signals of idle periods.

In the PowerPC' 603 [21], the DPM mode is ensbled by software. The DPM logic automatically stops the dock switching of specific unit generated by clock regenerators. The clock regenerators produce two docks, C1 and C2, which feed master and slave latches. Two "freeze" input signals control the clocks, C1 and C2, as s h o w in the timing diagrams of Fig. 8.11. The logic needed for DPM does not introdnee any performance degradation and it eon- s - ~ 0.3% of the total die areain the PowerPC. The DPM provides a power raving of 10.20% depending on the application to be executed. The DPM can be implemented at either high-level (cg., execution u.it) and low-level (e.g., a block inside a unit) of hardwlue.

Static Power Management (SPM) permits the awing of the power dissipation in the standby mode. In this $me, the activity of the entire system is monitored rather than a specific unit (or block). When the system remains idle for a

'PowerPC 603 is h a m l B M Cow.

506 CHAPTER 8

................ ........ ylT c1 ............... ...............

CLLiRr-tLh

a_FP.EEz

................ c2 ~

........ .........

c1mm c1 e significant period of time, then the entire chip L rhut-down2. The SPM may have several modes depending on whether the entire chip is shut-down or a part ofit. For example, the PowerPC 603 has three modes which are programmable through a hardware bat controlled by software (operating +em). In this microprocwor, one mode is called sleep mode which allows a m-am power swings by disabling the do& to all units. h this mode the PLL and external input do& are disabled to bring the power dissipation down to the leakage levels. The power of PowerPC 603. in the sleep mode, is as low as 1.8 mW 1201.


8.4 ALGORITHMIC-LEVEL POWER REDUCTION

Algorithm opt-ation can have a signifcant impact on the power eonsump tion of a system. Design decisions, made at this level, combined with the architecture level, may lead to a large powcr saving. In this section, we disicnsr two approaches that reduce the power dissipation at the algorithm level. The first one is based on the reduetion of the switched capacitance, by minimieing the complexity of the system. The second method cxploita data coding for the purpose of low switching activity.

8.4.1 Switched Capacitance Reduction

The power dissipated bs an algorithm can be mearmred, for example. by coant- ing the number of operations reqnired to execute such an algorithm. To reduce the power of an algorithm, the number of primitive operations so& as: memory access, ALU operations, ctc., should be minimiled. The different types of operations do not consume the same amount of power. For example a multiplication operation consumes more power than an addition operation. Thus, when minimiving the number of operations of an algorithm, the type of operation should be taken into account. Keep in mind that high performance systems w e complex algorithms that require a large nnmher of operations.

To illnstrate this consideration, the computation complexity for three methods of the VQ algorithm are presented. Remember that the distortion metric b e tween the input data (vecto. X) and a codebook vector Cis given by Equation (8.8). One method to evaluate the distortion and find the best match is to use B full rearch through the codebook. Thus, the distortion k computed for the 256 levels of the codebook. Each level requires 16 memory access lo perform 16 aubtrastions, 16 multiplications, 15 additions, etc. Hence a large number of primitive operations are needed.

In the binary TSVQ already presented in Section 8.3.3, the codebook is orga, nieed into a tree structure a~ shown in Fig. 8.12. The input vector is compared with two code vectors at each node. Based on this comparison, one of the two branches is chosen and the eodehook search space is reduced compared to the full search, since a reduced number of code vectors (16) is utiked. For each comparison, at 8 specific level, an index bit is generated as shown in Fig. 8.12. The process of comparison thmngh the tree is repeated until a leaf node is reached. Far II codebook of 256 levels, the tree has depth of 8 (d=7). Com- pared to the full search, the nvmber ofmemary ~ e e e s s and executing operations

508 CRAPTER 8

d=O

d = l

d = 2

d = 3

6.7

iedoced considerably since only 16 code vectoxs -re used in the TSVQ a l p rithm. One VLSI implementation of the TSVQ algorithm uses systolic arrays P21.

The number of computations can be fulther reduced by using the djffermtial search of the TSVQ [see Eqnation (8.11)]. At each level (i) of the tree the daferentd distortion between the left (vector a) and right (Tector 6) code vectors connected to the level (i ~ 1) is compnted. Therefore, the number of operations is reduced. Table 8.1 [15] shows the computation complexity of the three methods of the VQ. The differential TSVQ results in a lower number of operations to be executed for each type.

8.4.2 Switching Activity Reduction

Minimizing the switching activity, at high level, is one way ta ieduee the power dissipation of digital proccsso~s. This can hsve an infinenee on the power reduction, erpedally when the switching signals have a large capseitanee. One method to minimiae the switching activity, at the algorithmic level, is to USE an appropriate coding for the signals rather then s t rakht binary code.


Algorithm Memory Multi- Add / Access plication Substract

Full Search 4096 4096 8448 Tree Search 266 256 520 Differential 136 128 136 Tree Search

In [23], Grey-coding h s been nsed for the address lines of B microprocessor, for both instructions Bnd data accesses, to reduce the switching activity of the nets. The sdwntage of Gray code over binary code is that Gray code changes by only one bit as it sequences from one number to the next. In other words, if the memory access pattern is a sequence of consecutive addresses, then each memory access chmgen only one bit at its address bit. Dur to instruction locality, dudng program execution, most of the memory accesses are sequential. Therefore the Gray code eliminates the simultanmus switches of a significant nnmber of bits.

Table 8.2 shows B eomphrison of 3-bit representation of the binary and Gray codes. Note that the Gray code have only one transition for reqoential change

Tabla 8.2 Binary snd Gray-oode rcpresmtstion.

Binary Grav Decimal Code- COG Equivalent 000 000 0

110 101 6 111 100 7

510 CHAPTER 8

while the binary code may have many transitions

In 1231, the switching property of the address coding w e memured Using the number of bit switches per executed instruction. For instroction accesses, both the Gray and binary coding were compared wing benchmark programs. The maximum reduction in bit switches was found to be as high as 58% and the average reduction was equal to 31%. The same study was also carried out for data addresses. The average reduction of bit switches was - 8%.

8.5 POWER ESTIMATION TECHNIQUES

Power estimation means, in general, the techniques of estimating the average powex dissipation of cirenits. The goal of t h s section is to present an overview of power analysis techniques and took at the eleuit, gate, architectural, and behavioral levels of sbstractian. Measuring the power consumption is cdti- ea l for low-power design as it permits the designer to optimise power, meet r q ~ e m e n t s , and know the power distribution through the chip.

8.5.1 Circuit-Level Tools

The most straight-forward method of power estimation is by circuit simulation; perform a circuit airnulation of the design and m m u e the average current drawn fram the supply. Therefore, the average power can be estimated. The disadvantage of this approach is that the results are strongly dependent on the input patterns to the circuit (pattern-dependent technique) also called dynamic3 power simulation. If the circuit has 8 large number of inputs, thcn the circuit simulation would be lime consuming and w e n impractical.

The most accurate power simulator to date is still SPICE. However, it can handle only very small circuits (e.g, hundreds of transistors). SPICE accurately taker into account non-linear capacitances ljunction and gate) which esnnot be eaptvred by higher level tools. Also, it rn accurately measwe short-circuit and leakage currents. The latter is very important for low-VT applications. SPICE cannot be used to estimate the power of large circuits or chips, due to the time eonru ing nature of the simulator. It is a pattern-dependent power analysis tool.

' D y n d c l l i y computed PQWY should not bm c o d a d with dynamic power.


Another transistor-level power simulator/analyeer is PowerMdI' [24]. It a p plies an event-driven airnulation algmithm to inere- the computation speed by two to three oiderr of magnitude over SPICE, with an acceptable level of aecuracy (within 10%). Also, it uses table lookup to determine the terminal current of the device from the applied voltages.

PowerNIill can also identify the hot spots (which consnme more dynamic power) and twuble spots (which comnme unexpectedly large amoontr ofleahge .mulent). Moreover, elements with excessive short-circait are detected. This allows the designer to resise the circuit to reduce the riselfall time. Static reduced-swing nodes ace detected as shown in the example of Fig. 8.13. The node A is charged to VDD - VT when the input is low.

Another approach far power estimation is the use of statistical techniques. The work in [25] suggested the use of Monte Carlo simdation to ert-te the total average power of the circait. Basically, this statiitical technique is based on applying randomly generated inpnt patterns, a t the primary inpnto, and monitoring the convergenee of the power dissipation. The simulation is stopped when the measured power is dose enough to the troe average power.

This approach, based on the Monte Carlo method, requires simulation over B

large number of measurements. The advantage of the statistical techniques is that they can be built around existing simulation tools.

'PorerMill is fromEPlC D&gn Technology.

512 CHAPTER 8

8.52 Gate-Level Techniques

In order to oveccome the shortcoming of power analysis tools, at the *renit level, recently several gatdeml estimation tools have been proposed. In this section, we present two techuiqnes for power estimation at the gatelevel. The first approach relies on the probabilistic method. while the second one is bared on event-driven simulation.

8.5.2.1 Probabilistic Power Estimation

The power dissipation c a n be analyeed wing pattern-independent approach when the sign& sre represented with probabilities (also called static techniques). This approach permits to overcome the shortcomings of simulation- baaed techniques. The nser supplier the probabilities of the primary inputs to a logic network. The average power dissipation of a logic network is estimated as

N P = V & f C % C , (8.12)

i=l

where N is the nnmber of nodes in the network. With a total physical capxi- tance Ci. ai is the switching activity (or c d e d transition probability, P,) given

oli = P,(1 - P,) (8.13)

where P* ir the probability that the node i is at high level. In this expression of sctivity it in assumed that the circuit input and internal nodes me independent (spatial independence). Also the values of the same Jignal, in two consecutive dock cycles, are assumed independent ( t e m p m l independence).

If the input probabilities to a network w e provided, then they are propagated through the circuit to evaluate the transition probability at each node. For example, for a 2-input AND function: y = z,.=a, the probability of the output to be at high level is given by: Pu = Pz,.P*,. The computation of the probabilities for different gates is discussed in Chaptu 4.

One tool (LTIMES), bared on probabilities, w s r h t proposed in [26]. In this work, the temporal and spatial independence of rignds are assumed. Prac- tically, the signals may be correlated. Also e aero-delay model wm aasumed, which leadds to an error in ertimating the power, since the glitching power h not accounted for.

by

Low-Power VLSI Desrgn Methodology 513

Probabilistic power estimation approaches that compute the power, due to glitches, and apply a r e d delay model have been proposed [Z7, 281. In [27], the switching activity computation is based on the tmnailion density. The assnmption made in [ZT] is the spatial independence of the sign&. A power estimator tool, based on the tran&tion demity, has been called DENSIM. The transition density of a node is defined as the ayerage number of nodal transitions per unit time. If y is a boolean function with inputs, z,, then the boolean difference of y, with respect to zi, is defined by

By = y(=, = 1) @ y(.; = 0) (8.14) az;

It was shown in [29] that if 2, are spatially independent, then the density of the boolean fonction is given by

(8.15)

where P ( z ) is the equilibrium probability of the signal over time. Equations (8.14) and (8.15) are used to propagate the density throngh the boolean network. Byfa=; is one if B transition at zi wil l cause a simultaneous transition at y. As an example, consider the c8se of a 2-input AND gate with

P ( Z ~ ) D ( Z I ) +P(z,)D(ra). Hence, from the probability and density d u e s , at the pdmay inputs of a logic network, the density at the aotput can be =om- puted. The boolean differences of B logic network s l e calculated using Binory Doeision Diagrams (BDDs) [30].

Note that the average power dissipation is computed by

Y = ~n thi. CW, ay/a., = c2 and ay/ars = =,, that D ( ~ ) =

(8.16)

The factor 112 k added to a c c o r d for the doable transition pm dock period.

This model, blued on transition density, ignores the spatial correlation of the signals and eompntes, approximatidy, the power due to glitches. The work in [28] attempts to handle both spatial and temporal eorrdations. One disadvantage of the approach in [28] is that the use of BDDs, for the whole circuit, tends to limit the siw of the network thst can be analyzed.

The probabilistic techniques have the advantage that the user does not have to supply dmnlation patterns and they are daimed to have fast computation

514 CHAPTER a

time. However, they do not account for the internal power of the gates and static power dissipation. These techniques can be nsed, for example, as a fast power estimator for logic synthesis. They might also be suited for comparing varioos subsystem structures.

8.5.2.2 Event-Driven Simulation

Another gate level power analysis approach has been proposed for semi-cutom design [31]. The environment of the system is shown in Fig. 8.14. The system uses a cell library that has been charscterieed for static and dynamic pover dissipation with the Entice' (ENergy and Thing Characterieation En-on- ment) cell characterization system [32]. The dynamic power includes the power due to the short-circuit and the one due to the load capacitance. Entice char- acterizes each cell taking into account the following parameters: input signal slope. output capacitive load, operating voltage, temperature, and process parameters. Entice uses SPICE as a circuit simulator to model each cell for power.

A set of p a e r vector8 drrcribes all possible events where power can be &- sipated by the cell for dynamic and static cases. With SPICE these power events are accurately chanlcterised. There are two types of power vectors: i) dynamic snd ii) static. A dynamic power vector describer an event in which power is dissipated due to a signal switching st the cell inputs. For example, for a 2-input ( A and B) AND gate, when A = 1 and B makes a tianAtion from 0 to 1, an energy is dissipated. A ststic power vector describes the conditions of logic signals under which leakage power OCUUS.

The designer creates a design from the cell library at gate level then it is inpnt to the Aspen' (A System for Power EatimatioN) system. Also the stimulus to drive the logic simulator and the interconnect loads, representing the inter- cell connectivity (estimdea or actual d u e s provided by back-annotation from layout) are specified. A logic simulator such as Verilog-XL' is wed as an even-driven simulator. Upon invocation, Aspen monitors the power event occwrence (node a~tiYity) ofeach cell and computes the total power dissipation a8 the sum of the power dissipation of all the cells in the power vector paths. Multiple time windows can be specified for simulation to compute the average power O Y ~ I different time periods Note that Aspen uses the power vectors of a cell to compute the total power.

bEnliceis from MotordsInc. *Aspen io from Motmrola In.. 'Verilog-XL is fmm Cadcncr Deign Systems In..


The dynamic power of each cell is computed by multiplying the number of power events (transitions' count) by the energy dissipation per transition event of I cell. This proce$s is applied to all dynamic power vectors for a cell to obtain the total energy dissipated. The total dynamic power of a cell, over a certain time period, is equal to the total energy divided by the t h e period.

The static power vector is used to compute the leakage of B cell. Note that the static power of B cell is dependent on the logic state of a cell, 85 shown in Fig 8.15. To compute the static power dissipation, the duration of activation time of the corresponding static power vector is measured. A transition of net signal may cause a static power vector to be activated and another vector to be deactivated. Vectors are time stamped during aetiwtion andnpon deactivation. Then the total time length in which the vector is active is foand. The activation time length of the static power vector is multiplied with the power dissipation value (per time unit) to obtain the static power of the vector. Again the static power dissipation for aU veotors asrociatcd with a cell instance is summed to derive the total power dissipation.

516 CHAPTER 8

The results reported by Aspen, such SJ the switching activity of nodes, can be used to drive floorplanning, placement and routing tools. Also Aspen can handle chips with B complexity of o w e d hundred thousand gates and is four orders ofmagnitude faster than SPICE. It prodnces results within 10% accuracy of SPICE results. One disadvantsge of Aspen is that it cannot handle power due to the glitches.

8.53 Architecture-Level Power Estimation

The architecture of B design is represented by fnnctiond blocks and the complexity of the design at this l e d is relatively low compared to the circuit lrnd gate levels. In this section, several approaches and techniqoes for power mod- &g and mdysia at the archi%ectoml level are reviewed.

8.5.3.1 Gate Count Method

One tool developed for architectural power dissipation estimation is based on epuivdent logic count, memory sise, logic circuit styles (dynamic 01 static), interconnection busses, cLo& network a d layoat style (fdkustom or remi- custom) [33]. The complexity of an architecture is described in terms of average number oflogic gates soch ~1 a Sinat AND (bufeted NAND) gate connected to three identical AND gates at the output node (i.e, Ianin=fanout=3) as shown in Fig. 8.16. The total power ofthe logic part is roughly equal to the number of gates multiplied by the power of a gate using B user specified switching activity. This activity factor is sssumed fued acioss the design.


latch 1 The power ofthe on-chip memory is modeled for a certain memory architectnre. The interconnections are defined in two categories, local and intermediate, and global busses. The local interconnection is defined as interconnections within a logic gate. The intermediate interconnections are used for connection between gates or functional blocks (subsystems). The global bun includes data, control, and address busses. The lengths of local and intermediate interconnections are modeled by the Rent's rule [34]. Then the power can be computed from the lengths u&g a fixed switching activity equal to the one specled far the logic. The global interconnect is determined from the dimensions of the ehip and the number of drivers/receivers connected to it.

The power model of the clock network ia bared on the H-tree [34] and the chip dimFnsionr. The power of on-chip drivers are also modeled in two components. One'is the power used to drive the off-chip total capacitance. The other is the pou/er consumed by the pad driver itself. The activity factor for the pads is ars med fixed and is equal to 1 [33].

T$e tool developed in [33] is used ar a power estimator in the early stage of t#e design. It requires some technological parameters (feature siae, gate oxlde fltickncss, p a m e t e r e of the intereonneetion layers, etc.), the snpply voltage, the chip area, the switching fhctor and the gate count. This tool can only be used ar a roogh estimator of the total power of the chip since the switching activity is arrumed fixed through the design. Therefore the pourer partition between the different units can be incorrectly estimated.

P

518 CHAPTER 8

8.5.3.2 The Power Factor Approximation Method

The Powcr Factor Approrunation (PFA) technique is another method to e& timate the power dissipation [35]. It h a been used for DSPs architectnres. The total power dissipations ofa functional block such as: multipliers, adders, memories, etc. can be modeled by the following approximation

where G is the number of the logic gates comparing the fnnctional block, ui is the switching activity of the ith gate, C, is the load of the ith gate, i,.,i is the short eirenit component, and f is the frequency. This power equation can be expressed in more compact form as

Pavg = SGf (8.18)

where xis the PFA constant snd can be related e d y to Equation (8.17). G can also be looked at a the hardware complexity factor instead of a number of gates. The parameter I( has Merent d o e s for different blodts. For example for an n-bit multiplier, thc factor G can be approldmately equal to 2 as shown in Fig. 8.17. This is due to the number of addw eelk in the multiplier. Then, we have

P."d< = K.".ltn2f."". (8.19)

The power supply voltage is included in the parameter IC. This parameter is extracted empi~ id ly from meeaured or simulated power valuer at a h e d power supply voltage.

For a VLSI chip, composed of several functional blocks, the to td power dissipation can be determined by summing the power o f & bloekr. We have

PM = niG,f, (8.20)

Thus, this PFA technique is based on modeling precharacterimd functiond blocks. Each block has a PFA factor independent from the other. Hence this technique provides some general methodologg compared to the gate esnivalent model of Svenssan and Liu discussed previously. The PFA factor is extracted using independent Uniform mile Noise (UNW) inputs (i.e, random inputs). UWN inputs mean that the input's bit axe uncorrelated in space and time and

d, bler l .

'Withon* ,he static power diaaip.,i.,,


independent of the data distribution. The signal and transition probabilities of each i bit of the input are given by

Pi(1) = 0.5 and P((0 + 1) = 0.25 (8.21)

Consequently, this technique doer not account for the strong dependency of power consumption on the statistics of the input data [36]. The next section tr ts the ease of power modeling, taking into account the correlated behavior ofthe bits.

8.6.3.3 Dun/ Bif Type Model

In digital signal processing, corrdation can exist between value of a temporal ~e k uence of data. The UWN model can lead to an error in estimating the power of a dreuit even if the bit-width utiliantion is maximized. To take into account the data correlation, the Dual Bit Type (DBT) dbta model har been proposed in [36, 311. The DBT data permits accurlrte estimation of the power

I

dksipation.

520 CBAPTER 8

p = 4.99

p = -0.60

p = 0.0

p = 4.80

p = 0.60

p = 0.99 p=o.80

P(0-1)

14 12 10 8 6 4 2 I1

Fig. 8.18 shows the transition activity for several different two's complement data stream versus the bit (for an n-bit word). In this figure, eaeh enme corresponds to B different temporal correlation given by

P = cou(Xt-l,X,) sl (8.22)

where X,_l and Xt are successive data (in time) and ra is the variance. p = 0 corresponds to the white noise case, where P ( 0 - 1) = 0.25. From Figure 8.18 it is evident that the UWN model, while sufficient for describing activity in the Least Significant Bits (LSBs), is inadequate for the Most Significant Bit (MSB) region. The U N W model works correctly for the LSBs up to the break point BPO. The MSB region corresponds to the sign bits and consequently, the signal and transition probabilities of there bits are far from random. p > 0 eorrerpands to a lower activity for positively correlated signals, while p < 0 corresponds to a higher activity for negatively correlated signals. The MSB region starts from the break point B P I . The region between BPO and BPI can be modeled by linear interpolation. BPO and BP1 can be determined from the word-level statistics [37].

The power estimation of the architecture modules is based on B black-box teehnique of the switched capacitance. T y p i d modules are: adders, multipliers,


shifterr, RAMS, ROMs, ete. The power dissipation is modeled for each module by

P = CV&f (8.23)

where the switched capacitance C is related to the compleity and the activity of the module. For example of an n-bit dpple-carry subtractor, the switching capacitance is modeled by

c = CGf,n (8.24)

where C,,, is a capacitive coefficient (in fF/bit) determined from the DBT model. Ce,f can be a single coefficient for the UWN case. The DBT model employs several codfieienti for C.,,, which reflect the data representation and signal statistics. For the case of the subtractor, for example, B table of Cc,j is generated as a function of all possible data transitions, i.e., i g n bits transitions and LSB bits random transitions.

To extract the capaeitiae coefficients ofeaeh module, the library should be characterbed. This operetion is performed onetime for one library. The process of extraction consists of several steps:

I Pattern generation. Input patterns to B module are generated based on the DBT data model. Both xandom (UWN) and sign data stlearns should be used. The input patterns containing the U W N camponent must be simulated for several cycles. This allows convergence of the a~erage capacitance.

Simulation. The generated patterns are fed to a simulator (such 85 a circuit simulator) from which the switching capacitances ace extracted.

rn Capacitive coefficient's extraction. The simulation step produces the average effective switching capacitances for the entire series of applied input tramitions such a: U - U, S - 9 , cte. The capacitive coefficients are utracted from the effective switching capacitances and the complexity parameters.

Based on this methodology, a power mdysis tool, at the architectural level, has been developed [%I. 'U and S me- UWN and dgl P-S of the input bits. rmapcctively.

522 CHAPTER 8

8.5.4 Behavioral-Level Power Estimation

A behavioral representation describes the function of e. system versus a set of inputs. The behavior can be specified, for example, by algorithms (in Vedog, VHDL, ete.) 01 by boolean functions. The power estimation, at the behavioral level, relates the consumed energy to the execution of an algorithm. Decisions at the system and behavioral levels can influence the final power dissipation of the circuit by several orders of magnitude.

One approach for power estbation, at the behavioral level, h a been proposed in [38]. It is based on the combination of analytical and stochatic power models. In this work, e cl- ofapplieationa such a zeal time DSPs is considered for the power estimator. In the behavioral context, the power consnmed by a hardware resource is given by

P = N.CV'f (8.25)

where N . is the number of accesses to the resource over the period of computation. Cis the average capacitance switched per access and f is the computation frequency.

In [38] the power of aome hardware ielionrce~, such as execntion units, registers, etc., are analytically modeled (using Equation (8.25)) from the Control/Data Flow Graph (CDFG) which is used to represent the design. The average capacitance switched, per BCC~JI, for a partioular hardware is estimated from the white noise data modd. The power consumed by hardware resources such a controllers, interconnects, and clock network is diScult to estimate. Sta- tistically a large number of reabed chips i used to estimate the switched capacitance of there hardware ~esources.

8.6 CHAPTER SUMMARY

Low dynamic power techniques at several levels of abstractions have been presented. Algorithmic and architectural decisions c ~ n influence the power dissipation of a circuit by orders of magnitude. Therefore, CAD tools that help the designer to analyee the power of the ckeuit at these levels are needed. At lower levels of the design, the power reduction teehniqner offer some ravings but less than the one expected at higher levels. Several powor estimation tools have been discussed at the different levels of the design. Keep in mind that the circuit simulators provide B high accuracy for power analyais and take into account all power components.

REFERENCES

[I] K-Y. Chaa. and D. F. Wong. "Low Power Considerations in Floorplan Design," Prae. of the International Workshop on Law Powev Design, pp. 45-50, April 1994.

[Z] H. V8ishnav and M. Pedram, "PCUBE A Performance Driven Placement Algorithm for Lower Power Designs," Proc. of the EURO-DAC'93, pp.72- 77, September 1983.

[3] A. Shcn, A. Ghosh, S. Devadar, and K. Keutaer, "On Average Power Dis- sipation and Random Pattern Testability of CMOS Combinational Logic Network," Proc. of the International Conference on Computer-Aided De- sign, pp. 402-401, November 1992.

[4] K. Keutaer, "The Impact of CAD on the Design of Low Power Digital Circuits." IEEE Symposinm on Low Power Electronics, Tech. Dig., pp. 4245, October 1994.

[5] GY. Tsui, M. Pedram, and A. M. Despain, "Technology Decomposition and Mapping Targeting Low Power Dissipation," 30th ACMfIEEE Dcsign Automation Conference, Tech. Dig., pp.68-T3, June 1993.

[6] R. Murgai, R. K. Brayton, and A. Sangiovanni-VinEente, "Deeomposi- tion of Logic Functions for Minimum Transition Activity," Proe. of the International Workshop on Low Power Design, pp. 33-38, A p d 1994.

[TI V. Tiwad, P. Ashar, and S. M&, "Technology Mapping for Low Power." 30th ACMfIEEE Design Antomation Conference, Tech. Dig., pp.74-79, Jrme 1993.

[a] K. Scott and K. Keutsc., "Improving Cell Libraries for Synthesis," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 128-151, May 1994.

[9] C. Lemonds and S. Mhhant Shetti, "A Low Power 16 by 16 Multiplier using Transition Reduction Circuitry," Proe. of the International Workshop on Low Power Design, pp. 139-142, April 1994.


A. Chandrakasan, S. Sheng, and R. W. Brodcrren, '%w-Power CMOS Design," IEEE Journal of Solid-state Circuits, "01. 27, no. 4, pp. 472-484, A p d 1992.

U. KO, P. T. Balsam, and W. Lee, '"A Self-timed Method to Mlnimiie Spurious Trannitionr in Low Power CMOS Cixcuit.," IEEE Symposium on Low Power Electronics, Tech. Dig., pp. 62-63, October 1994.

[I21 R. I. Bahar, H. Cho. 0. D. Hachtcl, E. Mac", and F. Somenzi. "An Appli- cation of ADD-Based Timing Analysis to Combinational Low Power Re- Synthesis," Proe. of the International Workshop on Low Power Design, pp. 139-142. April 1994.

[I31 M. Alidins, 1. Montiero. S. Devadar, A. Ghosh, and M. Papaefthmiou, "Precomputing-Based Sequential Logic Optimization for Low-Power," IEEE lhnsactionr on Very Large Scale Integration Systems, vol. 2, no. 4, pp. 426-436, December 1994.

1141 A. Ghersho, and R. Gray, "Vector Qusntisation and Signal Compression,' Khwer Academic Pubhhers, MA, 1992.

[I51 D. B. Lidrky, and J. M. Rabaey, "Low-Power Design of Memory Intensive Functions," IEEE Symposium on Low Power Electronic-, Tech. Dig., pp. 16-11. October 1994.

[16] A. P. Chnndrskasan, A. Burstein, and R. W. Brodersen, "A Low-Power Chipset for B Portable Multimedia I/O Terminal," IEEE Jonrnal of Solid- State Circuits, "01. 29, no. 12, pp. 1415-1428. December 1994.

[I71 J. Sfhut., *A 3.3 V 0.6 p m HiCMOS Superscalar Microprocessor," IEEE International Solid-State Cholits Conf., Tech. Dig., pp. 202203, Febiuary 1994.

[I81 N. K. Yeung, Y-H. Sutu. T. Y-F. Su, E. T. Pat, C-C Chao, S. Akki, D. D. Yau, and R. Lodenquai. "The Design o f a SSSPECint92 RISC Proces- sor under ZW," IEEE International Solid-state Circuits Conference, Tech Dig., pp. 206-207, February 1994.

[19] D. Pham, et sl., "A 3.0W 75SPECint92 85SPECfp92 Superscalar RISC," IEEE International Solid-state Circuits Conference. Tech. Dix., DO. 212- 213. February 1994

[ZO] G. Gerora, et al., "A 2.2 W 80 MHz Superscalar RISC Microprocessor." lEEE Journal of Solid-State Circuits, vol. 29, no. 12, pp. 1440-1454, De- cember 1994.

REFERENCES 525

[XI S. Gary, C. Diete, J. Eno, G. Geross, S. Park, and H. Sanches. "The Poa- erPC 603 Microprocessor: A Low-Pow- Design for Portable Apphtiom," Proc. of COMPCON'94, Tech. Dig., pp. 307-315, February 1994.

[22] R. K. Kolagotla, S-S. Yu, and J. F. Jda, "VLSI Implementation of a 'Itee Searched Vector Quantieer," IEEE Transactions on Signal Processing, "01. 41, no. 2, pp. 901-905, February 1993.

[23] C-L. Su, C-Y. Tsui, and A. M. Derpain, "Low Power Aichitecture Design and Compilation Techniques foz High-Performance Processors," Proceed- ings of COMPCON'OI, Tech. Dig., pp. 489-498, Februsry 1994.

[24] A-C Deng, "Power Analysis for CMOS/BiCMOS Circuits." Proe. of the International Workshop on Low Pow- Design, pp. 3-8, A p d 1994.

[25] C. M. Emher, "Power Dkipation Andyysk of CMOS VLSI Circaits by Means of Switch-Level Simulation," Proc. of the European Solid-state Cir- cuits Conference, pp. 61-64, 1990.

1261 M. A. Cirit, "Estimating Dynamic Power Consumption of CMOS Cir- cuits," IEEE International Conference on Computer Aided Design, pp. 534537, November 1987.

[27] F. Najm, I. Hai, and P. Yang, *An extension of Probabilistic Simulation for Reliability Andy& of CMOS VLSI Circnits," 28th ACMjIEEE Design Automation Conference, Tech. Dig., pp. 644649, June 1991.

[28] A. Ghosh, S. Devadas, K. Keutser, and J. White, 'Estimation of Av- erage Switching Activity in Combinational and Sequential Circuits," 29th ACM/IEEE Design Automation Conference, Tech. Dig., pp. 253-259. June 1992.

[29] F. N. Najm, '"A Survey of Power Estimation Techniques in VLSI Circuits," IEEE Transactions on Very Large Scale Integration Systems. vol. 2, no. 4, pp. 446-455, December 1994.

[30] R. E. Bryant, "Graph-Baaed Algorithms For Boolean Function Manipula- tion," IEEE Tmnsaetiona on Computer-Aided Design, pp. 677-691, Augort 1986.

[31] B. J. George, G. Yeap, M. G. Wloka. S. C. Tyle., and D. GossCn, "Power Analysis for Semi-custom Design," IEEE Custom Integrated Circuits Con- ference, Tech. Dig., pp. 249-252, 1994.


[32] B. J. George, G. Yeap, M. G. Wloka, S. C. Tyler, and D. Goss&, "Power Analysis and Characteridion for Semi-Custom Design," Proc. of the In- t e r n s t i o d Workshop on Low Power Design, pp. 215-218, April 1934.

1.331 D. Lui, and C. Svensron, "Power Conramption Estimation in CMOS VLSI Chips,' IEEE Journal of Solid-state Circuits, uol. 29, no. 6, pp. 663-610, June 1994.

[34] A. B. Bakoglu, "Circuits, Interconnects, and Packaging for VLSI," Addison-Wesley, Rcading, MA, 1990.

[35] S. R. Powell and P. M. Chm, 'Estimating Power Dissipation of VLSI Signal Processing Chips: The PFA Technique," VLSI Signal Procesing N. pp. 250-259, 1990.

1361 P. E. Landman, and J. M. Rabaey, "Power Estimation for High Level Synthesis," EDAGEUROASIC, Paris, Rance, pp. 361-366, February 1993.

[37] P. E. Landman, and J. M. Rahaey, "Bla&-Box Capacitance Models for Architectural Power Analysis," Proceedings of the International Workshop on Low Power Design, Nap, CA, pp. 165-170, A p d 1994.

1381 R. Mehra, and J. Rabaey, "Behavioral Level Power Estimation and Explo- ration," Proceedings of the International Workshop on Low Power Design, Nape, CA, pp. 191-202. April 1994.

INDEX

Absolute value calculator. 454 Adders

carry lookahead, 412 carry select, 420 sompruison, 425 conditional I-, 423 Manchester, 412 ripple carry, 410

Address transition detection, 332 Adiabatic computing, 249 ALU, 451 Arithmetic logic unit, 451 Array multiplication, 429 ATD, 332 AVC, 454 Back-biar generator, 373 Barrel rhifter, 456 BiCMOS

applications, 299 BiNMOS logic, 272 bootstzapped, 288 CEBiCMOS, 285 comparison, 294 complementaiy technology, 43 complementary, 283 conventional gate, 257 delay analysis, 262 DSP, 303 gate array, 304

merged, 281 power dissipation. 266 pracesser, 36 quasi-complementary, 282 shunting techniques, 268

low-voltage families, 280

Bidirectional I/O, 229 BiNMOS

family, 272 gate design, 274 logic gates, 277 p-transistor, 299

Bipolar EberrMoU model. 94 Gummel-Poon model, 101 high current effects, 99 hwh level injection, 101 Kirk effect, 99 knee cumnt, 101 structure, 91 technology, 21 transit time, 105 Webster effect, 99

Bird’s beak, 30 Body effect, 66 Boosted voltsge generator, 377 Booth multiplier, 434 Bootstrapped BiCMOS, 288 BSlM model, 77 Buffet siring, 221 By-pars capacitance, 235 CAM, 470 Capacitance

gate, 83

estimation, 138 fringing, 144

inwt . 139 . . junction, 82 MOS. 82 parasitic, 141 wiring, 143


CBiCMOS, 283 CEBiCMOS, 285 Channel length moddation, 75 Chmge pump, 373 Charge sharing, 180 Clock buffers, 226 Clock distribution, 224 Clock skew, 187, 474 Clock tree, 226 Clacked CMOS, 183

singlephase, 198 strategy, 188 two-phase, 202

CMOS sealing, 89 CMOS

C I O ~

complex gate, 149 CPL, 203 delay- 124 domino, 177 DPL, 207 dynamic, 177 full-adder, 171 inverter, 116 layout, 161 NORA, 183 power dissipation, 129 process technology, 14 peodc-NMOS, 176 SRPL, 210 tranamistiion gate, 169 Zipper, 183

Colnmn decoder, 332 Comparator, 455 Complementary BICMOS, 283 Complementary pass-transistor

Compressor, 442 Content addressable memarp: 4:

logic, 203

.. Control unit, 451 CPL, 203 current gain, 97

Data path, 450 Desi- roles, 44 Dital d g d PIOC~QSOI, 303 Distzibuted processing, 502 Domino logic, 177 DPL, 207 DRAM, 356

asceoo t i e , 359 architecture, 359 baek-bi- generator, 373 boosted voltage generator, 377 ceh 359 charge pump, 373 deeodez, 366 half-voltage generator, 371 hierarchical word-line, 370 lowvoltage, 381 refresh, 377 sense amplifier, 367

DSP, 303 Dnal pass-tramistor logic, 203 Dynamic logic, 177 Early effect, 89 voltage, 99

Ebers-Moll model, 94 Edgetriggered D-Ripflop, 194 F&, 146 Fanout, 146 Flipflop, 194 Floorplanning, 490 hequency divider, 482 FuU-adder, 171 Full-custom design, 165 Gate array, 166, 304 Glitches, 160, 493 Ground bounce, 233 CTL, 236

Gunning 110, 236 Half-voltage generator. 371 High level injection, 101

70 Gummcl-Poon model, 101

Indez 529

HSPICE bipolar parsmeters, 105 MOS parameters, 77 110 circuits, 214 Input pad, 214 Isolation, 27 JK Bipflop, 197 Kink effect, 62 Kirk efteet, 99 Latch, 190

dynamic, 191 hold time, 190

static, 190 setnp tie, 190

Leakage current, 130 Lightly doped drain, 17 L o 4 oxidation of silicon, 28 LOCOS, 28 Low-power

algorithmic-level, 507 arehitreturtlevel, 498 circuit techniques, 239 CMOS technology, 17 DRAM, 364 gate-level, 490 Layout guidelines, 165 physical design, 489

SRAM, 330

CMOS technology, 20 DRAM. 381

reference voltage generator

Low-voltage

MOS model, 84 SRAM, 352 TTL, 215

MBiCMOS, 281 Memory

DRAM, 356 ROM. 467 SRAM, 313

Merged BiCMOS, 281 Minimum power supply, 123

Mobility model, 74 MOS SPICE Models, 69 MOSl model, 72 MOS3 model, 73 Multi-threshold voltage techniqne,

Multiplexer, 171 Multipliers

242

Baugh-Wooley, 432 Braun, 429 comparison, 450 modiiied Baath, 434 Wanace, 442

N-well process, 14 Noise margin, 121 NORA logic, 183 Output buffer, 229 Output pad, 227 Pardel adders, 409 Parallelirm. 498 P-tranristor logic

complementary, 203 conventional, 169 dud. 203 swing restored, 203

Phase IocEred loop, 473 Pipelining, 500

Plaeement and routing, 490

charge pumped loop, 414 filter, 479 phase frequency detector. 476 voltage controlled oscillator, 479

components, 129 dynamic, 132 estimation, 510

measurement, 138 short-circuit, 135 stetic, 130

,399 PLA, 462

PLL, 473

Power diSsip&on

internal, 152

530

Power management, 505 Prechargc transistor, 178 Preeomputation, 496 Prababilirtic power estimation, 512 Programmable logic a~ray , 462 Pseudo-NMOS, 176 QCBiCMOS, 282 Quasi-complementary BEMOS,

Raee, 493 RAM

282

dynamic, 356 static, 313

Read only memory, 467 Reference voltage generator. 395 Register file, 458 Register transfer level, 498 Register, 194 Reg& structures, 460 RGM, 467 Row decoder, 332 RTL, 498 RVG, 395 Scaling, 89 Schmitt trigget, 218 Self-reverse biasing, 239 Semi-custom design, 165 Sense amplifier. 339 Shift-, 456 Silicon On Insulator. 52 SO1 SIMGX, 52 Sol. 52 SPICE, 510 Spnrious transition, 160, 412,493 SEAM, 313

addrear access time, 315 architectnx, 315 ATD, 332 bitline prechatge, 337 cell. 318 column decoder, 332 divided word-line. 348

equalieing, 327 hieiacbical word decoding, 350 law-voltage, 352 ontpnt latch, 347 read cycle time, 315 readjwsrite circuitry, 324 row decoder. 332 s-e amp&, 339

SRPL. 210 Standard-cd, 165 Subthreshold current, 86 Swing restored pars-transistor

Switchiw activity. 152 logic, 203 - ..

Technology mapping, 491 TFT, 323 Thin film transistor, 323 Threshold mltage, 66, 85 TLB, 470 Toggle, 197 Trench isolation, 31 TTL. 215

Vector quantiacd image encoder,

Video compression, 502 Voltage controlled oscillator, 479 Voltage down convcrtez, 389 Voltage levels interface, 231 Voltage-eontrolled delay h e , 482 VQ, 502 Wallace tree, 442

Zipper CMOS logic, 183

502

webster effect, 99

Documents

low power digital