12
554 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013 Constant Delay Logic Style Pierce Chuang, Student Member, IEEE, David Li, Student Member, IEEE, and Manoj Sachdev, Fellow, IEEE Abstract—A constant delay (CD) logic style is proposed in this paper, targeting at full-custom high-speed applications. The CD characteristic of this logic style regardless of the logic type makes it suitable in implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage is ready. This feature offers performance advantage over static and dynamic domino logic styles in a single-cycle multistage circuit block. Several design considerations including timing window width adjustment and clock distribution are discussed. Using 65-nm general-purpose CMOS technology, the proposed logic demonstrates an average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different logic gates. Simulation results of 8-bit ripple carry adders show that CD logic is 39% and 23% faster than the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64% (22%) energy-delay product (EDP) reduction from static logic at 100% (10%) data activity in 32-bit carry lookahead adders. For 8-bit Wallace tree multiplier, CD logic achieves a similar speedup with at least 50% EDP reduction across all data activities. Index Terms— Adder, constant delay (CD), feedthrough, high-performance logic style, pre-evaluated. I. I NTRODUCTION H IGH-PERFORMANCE energy-efficient logic style has always been a popular research topic in the field of VLSI circuits because of the continuous demands of ever- increasing circuit operating frequency. The invention of the dynamic domino logic [Fig. 1(a)] in the 1980s is one of the answers to this request, as it allows designers to implement high-performance circuit blocks, i.e., arithmetic logic units, at an operating frequency that traditional static and pass transistor CMOS logic styles find difficult to achieve [1]. However, the performance enhancement comes with several costs, including a reduced noise margin, a problem of charge-sharing, and higher power dissipation due to a higher data activity. Several variations of the dynamic domino logic, namely NP domino (NORA domino) [2], zipper domino [3], and data-driven dynamic logic ( D 3 L) [4], [5], have been proposed but they are never widespread in the VLSI industry [6], [7]. Compound domino logic (CDL), where dynamic and static gates alternating between each other, has become the most Manuscript received May 9, 2011; revised October 13, 2011; accepted February 16, 2012. Date of publication March 13, 2012; date of current version February 20, 2013. P. Chuang and M. Sachdev are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L5Y9, Canada (e-mail: [email protected]; [email protected]). D. Li was with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L5Y9, Canada. He is now with Qual- comm Inc., San Diego, CA 92121 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2012.2189423 popular logic style in high-performance circuit blocks, i.e., 64-bit adder, in modern CPUs [8]–[11]. In this design, the output inverter is replaced with a more complex inverting static gate, i.e., NAND, such that the monotonicity require- ment is satisfied while conducting complex logic operations without wasting the one inverter delay [12]. Moreover, all the dynamic stages except the first stage can be footless in CDL. This implementation, however, comes at the expense of: 1) increased power consumption due to the possible direct path current during the precharge period; and 2) a reduced noise margin as a result of unprotected dynamic domino logic’s outputs [6]. A significant research effort has been dedicated to exploring new logic styles that go beyond dynamic domino logic and CDL. In particular, source-coupled logic (SCL) [13] has shown superior performances that are difficult to achieve by any other logic styles. However, it suffers from high power dissipation due to a constant current draw, and its differential nature requires complementary signals. Pseudo-nMOS logic, which uses a single pull-up pMOS transistor, provides both high speed and low transistor count at the expense of high static power consumption as well as reduced output voltage swing. Output prediction logic (OPL) [14] has also shown superior performance in high-speed adders [15]. Nevertheless, OPL requires the generation and distribution of multiphase clock signals with small timing separations and low skews, which are difficult to achieve. While numerous high-speed logic styles have been proposed, dynamic and CDL still remain the most attractive choices when performance is the primary concern. In recent years, a new way of logic operation, also known as feedthrough logic (FTL) [16], [17], has been proposed, which has demonstrated its high-performance capability. Consider a dynamic domino logic [Fig. 1(a)], the critical path consists of nMOS logic transistors. In FTL however, the roles of the clock and logic transistors are interchanged (Section II-A) and the clock transistor is now the critical path. The first generation of FTL exhibits many shortcomings, including excessive power dissipation, and reduced noise margin. To mitigate these prob- lems, we propose a new high-performance logic, which we call “constant delay” (CD) logic. CD logic provides a local window technique and a self-reset circuit which enable robust logic operation with minimized power consumption while maintain- ing FTL’s speed advantage. The most distinct characteristic of CD logic from previously proposed logic styles is that the delay is, on a first-order approximation, not affected by the logic expression. 1 Unlike SCL, CD logic does not require 1 The delay of CD logic is still a function of logic expression when all effects are considered, since a more complicated logic expression implies a larger capacitive load. 1063–8210/$31.00 © 2012 IEEE

Constant Delay Logic Style

Embed Size (px)

DESCRIPTION

For more projects on ARM9 ARM11 please contact us @ www.nanocdac.com

Citation preview

Page 1: Constant Delay Logic Style

554 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013

Constant Delay Logic StylePierce Chuang, Student Member, IEEE, David Li, Student Member, IEEE, and Manoj Sachdev, Fellow, IEEE

Abstract— A constant delay (CD) logic style is proposed inthis paper, targeting at full-custom high-speed applications. TheCD characteristic of this logic style regardless of the logic typemakes it suitable in implementing complicated logic expressionssuch as addition. CD logic exhibits a unique characteristic wherethe output is pre-evaluated before the inputs from the precedingstage is ready. This feature offers performance advantage overstatic and dynamic domino logic styles in a single-cycle multistagecircuit block. Several design considerations including timingwindow width adjustment and clock distribution are discussed.Using 65-nm general-purpose CMOS technology, the proposedlogic demonstrates an average speedup of 94% and 56% overstatic and dynamic domino logic, respectively, in five differentlogic gates. Simulation results of 8-bit ripple carry adders showthat CD logic is 39% and 23% faster than the static anddynamic-based adders, respectively. CD logic also demonstrates39% speedup and 64% (22%) energy-delay product (EDP)reduction from static logic at 100% (10%) data activity in32-bit carry lookahead adders. For 8-bit Wallace tree multiplier,CD logic achieves a similar speedup with at least 50% EDPreduction across all data activities.

Index Terms— Adder, constant delay (CD), feedthrough,high-performance logic style, pre-evaluated.

I. INTRODUCTION

H IGH-PERFORMANCE energy-efficient logic style hasalways been a popular research topic in the field of

VLSI circuits because of the continuous demands of ever-increasing circuit operating frequency. The invention of thedynamic domino logic [Fig. 1(a)] in the 1980s is one of theanswers to this request, as it allows designers to implementhigh-performance circuit blocks, i.e., arithmetic logic units, atan operating frequency that traditional static and pass transistorCMOS logic styles find difficult to achieve [1]. However, theperformance enhancement comes with several costs, includinga reduced noise margin, a problem of charge-sharing, andhigher power dissipation due to a higher data activity. Severalvariations of the dynamic domino logic, namely NP domino(NORA domino) [2], zipper domino [3], and data-drivendynamic logic (D3L) [4], [5], have been proposed but theyare never widespread in the VLSI industry [6], [7].

Compound domino logic (CDL), where dynamic and staticgates alternating between each other, has become the most

Manuscript received May 9, 2011; revised October 13, 2011; acceptedFebruary 16, 2012. Date of publication March 13, 2012; date of current versionFebruary 20, 2013.

P. Chuang and M. Sachdev are with the Department of Electrical andComputer Engineering, University of Waterloo, Waterloo, ON N2L5Y9,Canada (e-mail: [email protected]; [email protected]).

D. Li was with the Department of Electrical and Computer Engineering,University of Waterloo, Waterloo, ON N2L5Y9, Canada. He is now with Qual-comm Inc., San Diego, CA 92121 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2012.2189423

popular logic style in high-performance circuit blocks, i.e.,64-bit adder, in modern CPUs [8]–[11]. In this design, theoutput inverter is replaced with a more complex invertingstatic gate, i.e., NAND, such that the monotonicity require-ment is satisfied while conducting complex logic operationswithout wasting the one inverter delay [12]. Moreover, allthe dynamic stages except the first stage can be footless inCDL. This implementation, however, comes at the expense of:1) increased power consumption due to the possible direct pathcurrent during the precharge period; and 2) a reduced noisemargin as a result of unprotected dynamic domino logic’soutputs [6].

A significant research effort has been dedicated to exploringnew logic styles that go beyond dynamic domino logic andCDL. In particular, source-coupled logic (SCL) [13] has shownsuperior performances that are difficult to achieve by any otherlogic styles. However, it suffers from high power dissipationdue to a constant current draw, and its differential naturerequires complementary signals. Pseudo-nMOS logic, whichuses a single pull-up pMOS transistor, provides both highspeed and low transistor count at the expense of high staticpower consumption as well as reduced output voltage swing.Output prediction logic (OPL) [14] has also shown superiorperformance in high-speed adders [15]. Nevertheless, OPLrequires the generation and distribution of multiphase clocksignals with small timing separations and low skews, which aredifficult to achieve. While numerous high-speed logic styleshave been proposed, dynamic and CDL still remain the mostattractive choices when performance is the primary concern.

In recent years, a new way of logic operation, also known asfeedthrough logic (FTL) [16], [17], has been proposed, whichhas demonstrated its high-performance capability. Consider adynamic domino logic [Fig. 1(a)], the critical path consists ofnMOS logic transistors. In FTL however, the roles of the clockand logic transistors are interchanged (Section II-A) and theclock transistor is now the critical path. The first generation ofFTL exhibits many shortcomings, including excessive powerdissipation, and reduced noise margin. To mitigate these prob-lems, we propose a new high-performance logic, which we call“constant delay” (CD) logic. CD logic provides a local windowtechnique and a self-reset circuit which enable robust logicoperation with minimized power consumption while maintain-ing FTL’s speed advantage. The most distinct characteristic ofCD logic from previously proposed logic styles is that thedelay is, on a first-order approximation, not affected by thelogic expression.1 Unlike SCL, CD logic does not require

1The delay of CD logic is still a function of logic expression when alleffects are considered, since a more complicated logic expression implies alarger capacitive load.

1063–8210/$31.00 © 2012 IEEE

Page 2: Constant Delay Logic Style

CHUANG et al.: CONSTANT DELAY LOGIC STYLE 555

CLK

CLK

PDNIN

Out

(a)

CLK

NMOSPull DownNetwork

Out

CLK

M1

M2IN

(b)

Fig. 1. Schematic of (a) dynamic domino logic with a footer transistor and(b) FTL.

complementary signals and can be easily integrated with staticand dynamic domino logics. Also, CD logic does not havethe problem of constant static power dissipation similar topseudo-nMOS. Furthermore, the clock timing requirement ofCD logic is not as stringent as OPL. CD logic can achieverobust operation with optimal performance as long as clocksignals arrive earlier than the input signals. In this paper, wewill demonstrate that: 1) CD logic outperforms other logicstyles with better energy efficiency and is particularly suitablefor high-performance digital blocks and 2) CD logic is robustunder extreme process and temperature variations.

The rest of this paper is organized as follows. Section IIreviews the history of FTL and introduces the proposed CDlogic. Section III discusses the design considerations of CDlogic and provides a first-order approximation of the unwantedglitch voltage. Section IV analyzes the impacts of windowwidth on robustness and compares CD logic with other logicstyles in different logic expressions. Section V demonstratesCD logic’s performance advantage in single-cycle multistagedigital applications such as 32-bit adders. Finally, Section VIconcludes with some final remarks.

II. EVOLUTION OF CD LOGIC

A. FTL Logic

FTL logic [Fig. 1(b)] in CMOS technology was first intro-duced in [16] and [17]. Its basic operation is as follows: whenCLK is high, the predischarge period begins and Out is pulleddown to GND through M2. When CLK becomes low, M1is on, M2 is off, and the gate enters the evaluation period.If inputs (IN) are logic “1,” Out enters the contention modewhere M1 and transistors in the nMOS pull-down network(PDN) are conducting current simultaneously. If PDN is off,then the output quickly rises to logic “1.” In this case, FTL’scritical path is always a single pMOS transistor.

Despite its performance advantage, FTL suffers fromreduced noise margin, excess direct path current, and nonzeronominal low output voltage, which are all caused by thecontention between M1 and nMOS PDN during the evaluationperiod. Furthermore, cascading multiple FTL stages togetherto perform complicated logic evaluations is not practical.Consider a chain of inverters implemented in FTL cascadedtogether and driven by the same clock, as shown in Fig. 2.When CLK is low, M1 of every stage turns on, and the outputof every stage begins to rise. This will result in false logic

575 600 625 650 675 700 725

1.0

.75

.5

.25

0

(V)

time(ps)

CLKN2 N4 N6 N8

N2 N4

CLK

N4CLK

M1

M2N3M3

N1 N3

CLK

Fig. 2. Simulated unwanted glitch at different logic depths in a chain ofinverters implemented with FTL.

evaluations at even numbered (i.e., 2, 4, 6, etc.,) stages sinceinitially there is no contention between M1 and nMOS PDNbecause all inputs to nMOS transistors are reset to logic “0”during the reset period.

B. CD Logic

To mitigate the above-mentioned problems, CD logic isproposed with a schematic shown in Fig. 3(a). Timing block(TB) creates an adjustable window period to reduce the staticpower dissipation. Logic Block (LB) helps to reduce theunwanted glitch and also makes cascading CD logic feasible.A buffer implemented in CD logic with schematics of TB andLB is shown in Fig. 3(b).

1) CD Logic Operation: Fig. 4 depicts the correspondingCD logic timing diagram and flowchart. For simplicity, weassume that IN come from dynamic domino logic gates. WhenCLK is high, CD logic predischarges both X and Y to GND.When CLK is low, CD logic enters the evaluation period andthree scenarios can take place: namely, the contention, C–Qdelay, and D–Q delay modes. The contention mode happenswhen CLK is low while IN remain at logic “1.” In this case, Xis at a nonzero voltage level which causes Out to experiencea temporary glitch. The duration of this glitch is determinedby the local window width, which is determined by the delaybetween CLK and CLK_d. When CLK_d becomes high, andif X remains low, then Y rises to logic “1,” and turns off M1.Thus the contention period is over, and the temporary glitch atOut is eliminated. C–Q delay mode takes places when IN makea transition from high to low before CLK becomes low. WhenCLK becomes low, X rises to logic “1” and Y remains at logic“0” for the entire evaluation cycle. The delay is measured bythe falling edge of both CLK and Out: hence the name C–Qdelay. D–Q delay mode utilizes the pre-evaluated characteristicof CD logic to enable high-performance operations. In thismode, CLK falls from high to low before IN transit, henceX initially rises to a nonzero voltage level. As soon as INbecome logic “0,” while Y is still low, then X quickly risesto logic “1.” A race condition exists in this case between Xand Y. If CLK_d rises much earlier than X and Y will go tologic “1,” turn off M1, and result in a false logic evaluation. IfCLK_d rises slightly slower than X, then Y will initially rise(thus slightly turns off M1) but eventually settle back to logic“0.” CD logic can still perform the correct logic operationin this case, however, its performance is degraded because ofM1’s reduced current drivability. Therefore, it is important to

Page 3: Constant Delay Logic Style

556 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013

Out

NMOSPull DownNetwork

X

Y

CLKM2

M1

IN

M0CLK

LogicBlock

TimingBlock

(a)

Out

Y

CLKM2

M1

CLK

Y

X

X

CLK

CLK

M3

M4

M5 M6

CLK_dM0

CLK2.8u

2.8u

IN

X

1uM7

NMOS Pull Down NetworkTiming Block

Logic Block

INV1 INV2 INV3

(b)

Fig. 3. CD logic (a) block diagram and (b) buffer.

CLK is low, evaluation periodbegins

PDN is on ?(IN = “1”)

Yes, direct path current betweenPMOS and PDN

X rises to a non-zero voltagelevel, Out experiences a

temporary glitch

C-Q Delay ModeNo X rises to logic “1”, Out

discharges to GND

Predischarge period

X rises to logic “1”, Outdischarges to GND

Contention Mode

PDN is on for the entireevaluation period ?

Yes, remain in contention mode

Y = logic “1” (window closes),eliminates temporary glitch

No, inputs transit to “0”while window is stilltransparent (Y = 0)

D-Q Delay Mode

Contention Mode existsfor the entire evaluation

period

*All inputs are initially at logic “1”,and conditionally transit to “0”prior or during the evaluation

period

CLK is high. X is predischarged to GND andOut is precharged to VDD

D-Q DelayC-Q DelayWindow Width

CLK

CLK_d

IN

X

Y

Out

Pre-evaluated

Contention

TemporaryGlitch

Fig. 4. Timing diagram and flowchart of the proposed CD logic.

TABLE I

SUMMARY OF CD LOGIC’S OPERATION

Mode Scenario Operation

Predischarge CLK is high X and Out are predischarged and precharged toGND and VDD, respectively.

Contention IN = “1” for the entire evaluation period. Direct path current flows from pMOS to PDN.X rises to a nonzero voltage level and Outexperiences a temporary glitch.

C-Q Delay IN goes to “0” before CLK transits to low. X rises to logic “1” and Out is discharged to VDD.Delay is measured from CLK to Out.

D-Q Delay IN goes to “0” after CLK transits to low (whilewindow is still open, i.e., Y is still “0”).

X initially enters contention mode and later rises tologic “1.” Delay is measured from IN to Out.

maintain a sufficient window width under process–voltage–temperature (PVT) variations. Table I presents a summary ofCD logic’s operations.

Compared to FTL, where the contention lasts for theentire evaluation period, TB effectively reduces CD logic’spower consumption during the contention mode. The localwindow technique in the proposed CD gate allows designers tocustomize the window width for different logic expressionsto achieve minimal power dissipation while not sacrificingthe performance. For instance, a multiple input NAND gatewill require a longer window width than a NOR gate becauseof the larger internal capacitance due to the stacked nMOStransistors. Another advantage of CD logic is that the internal

node (X) is always connected to either VDD or GND, thusmaking the robustness of CD logic comparable to static logic,except during the contention mode.

CD logic eliminates the problem of false logic evaluationassociated with cascaded FTL. Consider a cascaded CD logicsystem, in which the inputs to nMOS PDN are always at logic“1” when first entering the evaluation period, because X andOut are always predischarged and precharged to logic “0” and“1,” respectively. Therefore, when CLK is low, CD gates willalways first enter the contention mode and conditionally makea low-to-high transition depending on the inputs. This is notthe case for the first stage CD gate, however, as there is noguarantee that the inputs will always be at logic “1.” In other

Page 4: Constant Delay Logic Style

CHUANG et al.: CONSTANT DELAY LOGIC STYLE 557

VSSP1

N1N2

P2∆V1 VDD - ∆V2

VDD

Fig. 5. Simplified schematic of CD logic during contention mode.

words, designers need to ensure that the input signals to thefirst CD gate arrive earlier than the clock signal, i.e., operatein C–Q delay mode only.

III. CD LOGIC DESIGN CONSIDERATIONS

A. CD Logic Sizing

The sizing of INV1-3 and M3–M6 in Fig. 3(b) is closeto the minimum size so that they do not create a huge areaburden. The length of INV1-3 can be altered to provide therequired timing window duration based on designer’s choices.

1) CD Logic Versus Pseudo-nMOS: Both pseudo-nMOSand CD logic are ratioed circuits which rely on the correctpMOS to nMOS strength ratio to perform correct logic oper-ations. pMOS transistor width is often selected to be aboutone-fourth the strength of the nMOS PDN as a compromisebetween noise margin and speed in pseudo-nMOS [6]. On theother hand, CD logic always discharges X to GND when CLKis high, therefore, CD logic can be optimized for low-to-hightransition only. Hence, pMOS clock transistors in CD logiccan be upsized larger to provide more speedup, as long as theoutput glitch is maintained at an acceptable level.

B. Output Glitch

Fig. 5 depicts a simplified schematic of CD logic duringthe contention mode, where both transistors P1 and N1 are onsimultaneously and induce a glitch voltage �V1, which in turngenerates another smaller glitch �V2. By design, �V1 shouldbe small [i.e., less than the threshold voltage (Vt )]. Hence,P1 operates in the saturation region while N1 is in the linearregion. The current equation is given as

1

2μpCox

Wp1

L p1

(Vgsp1 − Vtp

)2

= μnCoxWn1

Ln1

[(Vgsn1 − Vtn

)Vdsn1 − (Vdsn1)

2

2

](1)

where μp and μn are the hole and electron mobility ofpMOS and nMOS transistors, respectively, Cox is the oxidecapacitance, W and L are the transistor width and length,respectively, and Vgs and Vds are the transistor gate-to-sourceand drain-to-source voltages, respectively. Assuming samelength devices are used, and μp ≈ 0.5 μn , rearranging (1)gives

Wp1

4Wn1

(VDD − Vtp

)2 = (VDD − Vtn) �V1 − (�V1)

2

2

. (2)

�V1 can be found by solving the quadratic equation

�V1 = VDD − Vtn

−√

(VDD − Vtn)2 − Wp1

2Wn1

(VDD − Vtp

)2. (3)

By Taylor expansion, the square root term can be approxi-mated in first order as

√N2 + d = N + d

2N− d2

8N3 + d3

16N5 · · · ≈ N + d

2N. (4)

Assuming Vtp ≈ Vtn, (3) can be approximated as

�V1 ≈ VDD − Vtn − (VDD − Vtn) + Wp1(VDD − Vtp

)2

4Wn1(VDD − Vtn)

≈ Wp1(VDD − Vtp

)

4Wn1. (5)

�V2 can also be found through a similar approach. ConsiderFig. 5 again transistor N2 operates in the subthreshold regionwhile P2 is working in the linear mode. Equating the twocurrent equations yields

Wn2

Ln2It e

Vgsn2−VtnηVT

(1 − e

−Vdsn2VT

)

= μpCoxWp2

L p2

[(Vgsp2 − Vtp

)Vdsp2 − (Vdsp2)

2

2

](6)

where [18], [19]

It = μoCox(VT )2e1.8, η = 1 + 3Tox

Wdm, VT = K T

q(7)

where μo is the zero bias mobility, η is the subthreshold swingcoefficient, Wdm is the maximum depletion layer width, VT

is the thermal voltage, K is the Boltzman constant, T is thetemperature in kelvin, and q is the electron charge. In the caseof nMOS transistors, μo is simply μn . Rearranging (6) gives

Wn2 It e�V1−Vtn

ηVT

μpCoxWp2= (

VDD − �V1 − Vtp)�V2 − (�V2)

2

2(8)

where e(−Vdsn2)/VT ≈ 0 since (VDD − �V2) >> VT . Solving�V2 then yields

�V2 = VDD − �V1 − Vtp

−√

(VDD − �V1 − Vtp

)2 − Ae�V1−Vtn

ηVT (9)

A = 2Wn2 It

μpCoxWp2≈ 4Wn2

Wp2(VT )2e1.8. (10)

Applying Taylor expansion

�V2 ≈ VDD − �V1 − Vtp

−(

(VDD − �V1 − Vtp

) − Ae�V1−Vtn

ηVT

2(VDD − �V1 − Vtp

))

.

(11)

Page 5: Constant Delay Logic Style

558 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013

1.6 1.65 1.7 1.75

1.03

1.0

.975

.95

.925

.9

.875

(V)

Time (ns)

N1N2

N5N4N3

Fig. 6. Simulated output glitch of a five-stage two-input AND gateimplemented with CD logic.

-40 -20 0 20 40 60 80 100 120 1400102030405060708090100

AND3 - Average GlitchAND3 -3σOR3 - Average GlitchOR3 -3σ

Voltage(mV)

Temperature (°C)

Fig. 7. Temporary glitch mean and standard deviation at the output of three-input CD, AND, and OR gate versus temperature in a Monte Carlo simulationwith 7500 iterations.

Finally

�V2 ≈ Ae�V1−Vtn

ηVT

2(VDD − �V1 − Vtp

) . (12)

Equations (5) and (12) provide several first-order designinsights for CD logic. For a given �V1, designers can quicklyestimate the required Wp1 to Wn1 ratio. Moreover, �V1 islinearly proportional to the shift of Vt and transistor width inthe presence of process variations. When �V1 is sufficientlysmall, �V2 is approximately zero. As �V1 increases, boththe numerator and denominator in 12 contribute to �V2’sexponential increase. In a multistage CD logic circuitry, �V1of each stage will slowly increase because of the reduced Vgs

of N1 due to �V2 from the preceding stage. This phenomenonis demonstrated in Fig. 6, where the glitch level aggravates as ittraverses through a series of two-input AND gate implementedwith CD logic. Equation (12) also suggests that �V2 is a strongfunction of temperature. Fig. 7 illustrates the mean and threestandard deviations (3σ ) of the temporary glitch at the outputof a three-input CD, AND, and OR gate versus temperature.Clearly, as the temperature increases from 20 °C to 120 °C, the3σ glitch level of a three-input AND gate raises from 55 mVto approximately 140 mV.

C. Power Consumption

Data activity measures how frequent signals toggle and isdefined as

data activity = # of signal transitions

# of signals × # of clock cycles. (13)

Static logic has an empirical α of 0.1–0.2 [6] and dynamicdomino logic has an activity factor of 0.5. While CD logic’sα is also 0.5, it always consumes power when it enters theevaluation period. During the evaluation period, CD logicalways dissipates power via either dynamic power dissipation(X goes to VDD and Out is discharged to GND) or direct pathcurrent (contention mode). While CD logic consumes morepower, we believe that CD logic is still an attractive choice ina high-performance full-custom design because: 1) CD logicis only intended to replace the critical path; and 2) powermanagement techniques such as clock gating [20], [21], wherethe clock connection to idle module is turned off (gated), willsignificantly reduce CD logic’s dynamic power consumption.

D. CD Logic Family

CD logic’s LB [Fig. 3(a)] can be modified such that theinverter is replaced by a static gate to achieve even higherperformance, since the inverter delay is not wasted. We referto such a variation as “compound CD logic” (CCD), analogousto the case of CDL of the dynamic domino logic. Anotherfamily of CD logic was proposed in [22], where the outputinverter is replaced by a dynamic domino logic. The analysisin [22] shows that a 64-bit parallel-prefix adder employingthis type of logic is superior to its CDL-based counterpart,however, it requires additional design considerations becauseof the degraded noise margin.

IV. CD LOGIC CHARACTERIZATION

Unless otherwise specified, all simulation runs in this paperare done in schematic level (transistor netlists), with extractedparasitic capacitance in the Cadence design environment using65-nm CMOS technology. All the CD logic gates are designedsuch that the worst case glitch level under 6σ deviation withnominal VDD (1.0 V) is less than 300 mV at 110 °C. Themeasured power consumption includes clock trees and databuffers, which are both sized to drive a fanout-of-4 (FO4) load.The outputs of all the logic gates are driving an identical 20 fFload. The window duration (width) is defined as the 50% pointof the falling edge of CLK to the 50% point of the rising edgeof node Y. The delay is measured at the 50% switching pointof either the CLK or data to the 50% switching point of thelatest output.

In this section, all logic transistors have a 1-μm effectivenMOS width. For CD logic, the pMOS CLK transistors’width is 2.4 μm. The transistor sizings are optimized primaryfor delay, because the main objective of this section is toexplore CD logic’s performance advantage. The clock and datafrequencies are set to 2 GHz.

A. Noise Margin Versus Window Width

Noise margin is defined as the dc noise level at the inputgenerating a false logic evaluation at the output of the samegate and can be computed based on the following formula:

Noise Margin = ∣∣Voriginal − Vnoise

∣∣ (14)

where Voriginal is the expected voltage level without any inputnoise interference, and Vnoise is the input dc noise voltage that

Page 6: Constant Delay Logic Style

CHUANG et al.: CONSTANT DELAY LOGIC STYLE 559

0 50 100 150 200 250 300100

200

300

400

500CD OR3 "0" NM

CD OR3 "1" NMCD AND3 "1" NM w keeperCD AND3 "1"-NM w/t keeper

DynamicOR3 NMN

oiseMargin(mV)

Window Duration (ps)

400600800

DynamicAND3 NM

CD AND3 "0" NM w keeperCD AND3 "0"-NM w/t keeper

NoiseMargin(mV)

Fig. 8. Simulated logic “1” and “0” noise margin versus window durationfor a three-input AND and OR gate.

AND2 AND3 OR2 OR3 AOI220.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Static CD C-Q CD D-Q Dynamic

NormalizedDelay

Fig. 9. Normalized delay of five logic expressions implemented in static,dynamic, and CD logic.

causes the false logic evaluation. For CD logic, two types ofnoise margin are defined: logic “1” and “0” noise margins.Logic “1” noise margin refers to the input dc noise level thatcauses the CD logic to fail to remain in the contention mode.If IN [Fig. 3(b)], which is supposed to be at full VDD, is nowdegraded due to noise, then the glitch level at X may be toohigh such that Out is falsely discharged. In this case, the noisemargin can be calculated as 1V − Vin, where Vin is Vg of M7.Similarly, logic “0” noise margin refers to the input dc noiselevel that causes the CD logic to fail evaluating. In this case, ifan input which is supposed to be at GND is now much higherdue to noise, the contention between M1 and M7 will causeX to settle at an intermediate voltage instead of VDD. WhenCLK_d rises to VDD (window closes), Y will also be chargedup through M3 and M4, since M3 is on and M4 is partially onbecause X is not at VDD. If the voltage level at X is too low,then Y will be charged to VDD through positive feedback andX will be discharged to GND through M7, which is driven bythe noise source.

Fig. 8 shows the simulated worst case logic “1” and“0” noise margin for a three-input AND gate and OR gateimplemented with CD and dynamic domino logic. The CDlogic’s logic “0” noise margin is always much higher thanthe logic “1” noise margin, suggesting that CD logic is morerobust during the C–Q delay and the D–Q delay than thecontention mode. Moreover, as the window width prolongs,

AND2 AND3 OR2 OR3 AOI22

12345678121518212427

NormalizedAveragePower

atVariousDataActivity

Static Dynamic CD

10%

50%

100%

10%

50%

100%

10%

50%

100%

10%

50%

100%

10%

50%

100%

Fig. 10. Normalized average power of five logic expressions at various dataactivities implemented in static, dynamic, and CD logic.

logic “0” noise margin improves while logic “1” noise margindegrades for both logic types. Therefore, reducing the windowduration not only minimizes the power consumption but alsoimproves CD logic’s overall robustness. For noise-margin-sensitive applications, a minimum-size nMOS keeper (gateconnected to Out, drain connected to X) can be added toimprove its overall robustness. In this case, the minimum-size keeper improves logic “1” noise margin by approxi-mately 60 mV with virtually no degradation in logic “0”noise margin.

B. CD Logic Performance

Figs. 9 and 10 illustrate the normalized delay and aver-age power consumption of static, dynamic, and CD logic,respectively, in five logic expressions with various input dataactivities. The average power is calculated by summing up thepower consumption of every possible input vector and thendividing it by the number of input vector combinations.

CD logic demonstrates superior performance, especiallyfor complicated logic expressions, such as Y = AB + C D(AOI22), in the D-Q mode due to the pre-evaluated charac-teristic. This is demonstrated in Fig. 11, where CD logic isapproximately two times faster than dynamic domino logic.This is contributed by: 1) the pre-evaluated characteristic;and 2) the less number of transistors in the critical path(3N1P for dynamic, while only 2P1N for CD logic). On theother hand, CD logic’s performance is only approximatelythe same as or even worse than that of dynamic dominologic during the C–Q mode. Therefore, it is advantageousto implement CD logic in a single-cycle multistage datapathbecause then the pre-evaluated feature (D–Q delay) of CDlogic can be fully utilized. The power consumption of CDlogic at 50% data activity is at least 3 × and 5 × higher thanthat of static logic in AOI22 and the rest of logic expressions,respectively. This suggests that CD logic should be used onlyto replace the critical path in any circuit block, since it isnot energy efficient to implement any system with CD logiconly. Table II summarizes the total transistor width of static,dynamic, and CD logic. Despite CD logic’s additional tran-sistor overhead, the average area of CD logic is 13% smallerand 4.5% larger than that of static and dynamic domino logic,respectively.

Page 7: Constant Delay Logic Style

560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013

CLK1

CLK1OUT1

B

A

3u

3u

3u

0.5u3u

0.5u

1.1 1.15 1.2 1.25 1.3 1.35

1.251.0.75.5.250

(V)

1.251.0.75.5.250

-.25

(V)

CLK1AB

OUT1M0(27.56ps, 0.0V)M0(27.56ps, 0.0V)M0(27.56ps, 0.0V)27.56ps

time (ns)

20fF

(a)

1.325 1.35 1.375 1.400 1.425 1.45

1.251.0.75.5.250

(V)

1.251.0.75.5.250

-.25

(V)

ACLK2

OUT2_

OUT2

13.27ps

time (ns)

1.251.0.75.5.250

-.25

(V)

B

Pre-evaluated

OUT2

Y

CLK2M2

M1

M7

CLK2

Y

X

X

CLK2

CLK2

M3

M4

M5 M6

CLK_dM0

CLK2

0.3uA

2.8u

2u

2.8u1u

1.5u

BM8

X (OUT2_ )

2u

20fF

(b)

Fig. 11. Schematic and timing waveform of (a) dynamic and (b) CD logic.

TABLE II

STATIC, DYNAMIC, AND CD LOGIC AREA COMPARISON

Total transistor width (μm) Number of transistors

Static Dynamic CD Static Dynamic CD

AND2 11 13.6 14.96 6 7 17

AND3 18 20.9 19.96 8 8 18

OR2 13 10.6 12.96 6 7 17

OR3 24 12.6 13.96 6 7 18

AOI22 27 19.6 18.96 10 9 19

Average 18.6 15.46 16.16 7.2 7.6 17.8

V. PERFORMANCE ANALYSIS

A. 8-bit Ripple Carry Adders (RCAs)

The simulation setup in this section is similar to that ofSection IV. Three 8-bit RCAs using static, dynamic, and CDlogic style are simulated to compare their performances. AnRCA with FTL on the critical path is also implemented,however, our analysis indicates that FTL-based RCA wouldgenerate false outputs at the later bits because of the falseevaluation phenomenon described earlier. NP-FTL (equiva-lent to NP-domino, where nMOS-FTL and pMOS-FTL alter-nate) is also difficult to realize because the output glitchis significant and easily exceeds 500 mV under processvariations.

The basic static full adder (FA) is implemented with 28 tran-sistors with sizing strongly in favor of Cout computation [6].The main purpose of this 8-bit RCA is to demonstrate CDlogic’s performance advantage and to discuss the design con-siderations that should be taken into account when using CDlogic. A more energy-efficient pass-transistor FA design [23]will be implemented in the subsequent analysis to provide amore realistic comparison.

Only the timing-critical carry generation is replaced withdynamic and CD logic, while noncritical sum computationremains static in all three RCAs. Ten-thousand random inputvectors are applied to RCAs to compute the average powerconsumption. The clock timing is designed in such a way thatall the CD logic gates except the first stage are operated in the

D–Q mode with a window duration of approximately 115 ps.2

Fig. 12(a) depicts the RCA block diagram and FA schematic.Fig. 12(b) shows the corresponding worst case timing diagramfor CD logic, which occurs when {Cin, A0 · · · A7, B0 · · · B7} ={0, 0 · · · 0, 1 · · · 1}.

1) Design Considerations in a Multistage System: Fig. 12(b)provides several insights into designing single-cycle multistageCD circuitries. When CD logic is to be used with other logicstyles and when the gate preceding the CD logic is not aprecharge type logic (i.e., dynamic domino logic), then theinputs (Data) can only make transitions when all CD logicgates are in the predischarge mode (i.e., CLK1–4 are high). CDlogic may suffer from additional power consumption due tothe possible direct path current during the predischarge mode.Consider a CD-based RCA with the worst case input vector,FA0-2’s CD logic carry circuitries enter predischarge modewhen CLK1 goes to high. The internal nodes before the outputinverter of all CD logic gates are all discharged to logic “0”by nMOS clock transistors, and the outputs (C1 to C3) arecharged to logic “1.” If C3 becomes logic “1” before FA3enters predischarge mode (i.e., CLK2 is still high), then adirect path takes place.3 To avoid this condition, it is necessaryfor Tdischarge ≥ Tdelta, where Tdischarge is the time that theCD logic takes to charge its output and Tdelta is the delaybetween two adjacent clock signals (CLK2 to CLK1, or CLK3to CLK2, etc.).

2) CD Logic Sizing Strategy: To guarantee a 6σ glitchlevel of 300 mV at 110 °C, the following sizing strategy isemployed.

1) An equally weighted variable is assigned to the widthof CD logic’s pMOS pull-up transistors.

2The window duration is a function of the logic expression, the number ofpreceding stages that are driving by the same phase clock signal, the maximumglitch level constraint, and the robustness of the overall system. Extensivesimulation results have indicated that a window duration of 115 ps providesexcellent delay and power performances while maintaining a sufficient timingmargin against PVT and transistors mismatch variations (Table VII).

3Consider the CD carry generation circuitry shown in Fig. 12(a), underworst case vector condition B = 1 and A = 0. If input C (Cin from thepreceding stage) rises to logic “1” (originally at logic “0”) before CLK ishigh, then both pMOS and nMOS transistors are on.

Page 8: Constant Delay Logic Style

CHUANG et al.: CONSTANT DELAY LOGIC STYLE 561

S

A

B

C

A

B

C 1

1

1

1

1

1A B C

A B C

111

111

1

1Cout

0.5

1 S

2

2A

B

C

BA 2 2

2 2

1 Cout

2.8

2.8CLK

TB

0.3CLK

Sum Generation

1

1A

B

C

BA 3 3

3

3CLK

1

21CLK KeeperCout

Cout

Cout

Carry Generation

CDDynamic

FA0...C1

S0

A0B0

FA1S1

A1B1

FA7C8

S7

A7B7Cin

Critical Path

CLK

CLK1 (FA0~2)CLK2 (FA3~4)

CLK3 (FA5~6) CLK4 (FA7)

CD logic clock generation * Window period≈ 115ps

* Clock and databuffer power isincluded

t ≈ 115ps

CLK1CLK1CLK4

1

1

1

1

A

B

A

BA B

C

C

BA 3 3

3

6

66

Cout1

2 Cout

Static

(a)

CLK1

CLK2

CLK3

CLK4

C1

C2

C3

C4

C5

C6

C7

S8

Data1 Cycle Time

Tdischarge

TdeltaCLK2 needs to arriveearlier than C3

(b)

Fig. 12. RCA (a) block diagram and (b) timing diagram implemented with CD logic.

TABLE III

RCA PERFORMANCE COMPARISON

Static Dynamic CD

Data activity 10% 50% 100% 10% 50% 100% 10% 50% 100%

Delay (ps) 369.4 (1.00) 292.6 (0.79) 224.4 (0.61)

Power (μW) 50.95 254.77 509.54 142.70 336.05 401 231.85 533.96 602.82

Power–delay product (PDP) (fJ) 18.8 94.1 188.2 41.8 98.3 117.3 52 119.8 135

Energy–delay product (EDP) (fJ·ps) 6944 34761 69521 12231 28763 34322 11669 26883 30294

2) The entire circuitry (i.e., 8-bit RCA) is then simu-lated under typical corner at 110 °C to determine theglitch level. Extensive simulation results reveal that, ifthis glitch level is approximately 65 mV, then the 6σglitch level will be less than 300 mV.

3) Iterative simulations are performed by sweeping thisvariable until the glitch level is around 65 mV.

The equally weighted scheme clearly may not be the opti-mal solution. However, different sizing schemes have beenexplored and simulation results indicate that no apparentperformance improvement is achieved compared to the sizingstrategy described above.

3) RCA Performance: Table III compares static, dynamic,and CD logic RCAs with various figures of merits at dif-ferent data activity factors. CD-based RCA is approximately39 and 23% faster than the static and dynamic counterparts,respectively. On the other hand, the power consumption of CDlogic ranges from 4.55 × to 1.18 × higher than that of staticlogic. In terms of the PDP, CD logic is 2.78 × more and 0.72 ×less than static logic at 10 and 100% data activity, respectively.CD logic provides a speed advantage that logic styles such asstatic and dynamic find difficult to reach. Therefore, CD logicis suitable in a system where performance is the most criticalfactor.

B. 32-bit Carry Lookahead Adder (CLA)

We implement 32-bit CLAs to further analyze CD logic’sperformance. The detailed operations of CLA are describedin [6] and the schematic is displayed in Fig. 13. The 32-bitCLA uses eight 4-bit FAs with dedicated circuitry to facilitatecarry generation. The energy-efficient FA used in this analysisutilizes pass transistor logic styles with only 24 transistors forsum generation [23]. For the carry generation, only the criticalpath is replaced with different logic style. The maximum fan-in is limited to four, except in the case of dynamic dominologic due to the footer transistor. In this case, the 4-bit criticalcarry generation path of CLA is

G3:0 = G3 + P3(G2 + P2(G1 + P1(G0))) (15)

where G and P are the generate (A · B) and propagate (A⊕ B)signals, respectively. CDL and CCD logic are implemented toreduce the number of fan-ins. One can utilize the inversionproperty and rearrange 15 to

G1:0 = G1 + P1G0, P3:2 = P3 P2

G3:2 = G3 + P3G2, G3:0 = G3:2(P3:2 + G1:0). (16)

Therefore, a maximum fan-in of two and three can be achievedwith CCD logic (Section III-D) and CDL, respectively. Thecritical pMOS transistors’ width of CCD logic is slightly

Page 9: Constant Delay Logic Style

562 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013

0.5

0.25

G31

2 G3:0

2P3

1G2 2P2

1.5G1 2P1

2G0Pseudo-NMOS

A0:3B0:3S0:3A4:7B4:7S4:7A8:11B8:11S8:11A12:15B12:15S12:15

A16:19B16:19S16:19A20:23B20:23S20:23A24:27B24:27S24:27A28:31B28:31S28:31

Cin

C3C7C11

C19C23

C31G15:0, P15:0

G31:28, P31:28 G27:0, P27:0

C27 C15

Critical Path

0.5

1.4

1.4CLK

TB

0.3CLK

t ≈ 115ps

G33

1 G3:0

2P3

1G2 2P2

1.5G1 2P1

2G0CD logic

1.2

1.2CLK

TB

0.3CLK

t ≈ 115ps

1P3

1P2

P3G2

G3:0

Compound CD(CCD) logic*

1

1CLK

CLK

G30.5

3 G3:0

2.5P3

1.5G2 2.5P2

2G1 2.5P1

2.5G0

2.5

0.3

0.3 CLK

Dynamic logicB

0.5

A

1BA

B

0.5

A

1BA

1

2P A

0.5

B

1AB

A

0.5

B

1AB

1

2P

0.5

C

1

C

0.5

C

1

S

P

P2

1

2

1

2

1

A

B

C

A

B

C

Sum Generation

1

2

0.5

1.2

1.2CLK

TB

0.3CLK

t ≈ 115ps

G1 1P1

1G0

G3+P3G2

3

3

44

4

1.5

G1+P1G0

Non-critical pathimplemented with staticlogic (not shown)

CCD logic replaces the inverter in the logic block with a complex logic gate to providebetter performance, similar to the idea of compound domino logic (CDL) over dynamic logic.*

Fig. 13. 32-bit CLA.

TABLE IV

32-BIT CLA PERFORMANCE COMPARISON

Data activity 100% 50% 25% 10%

Delay Power PDP EDP Power PDP EDP Power PDP EDP Power PDP EDP Worst case

(ps) (mW) (pJ) (pJ·ps) (mW) (pJ) (pJ·ps) (mW) (pJ) (pJ·ps) (mW) (pJ) (pJ·ps) leakage (μA)

Static 448 (1.00) 5.13 2.3 1030 1.94 0.87 390 0.99 0.45 199 0.42 0.19 85.1 32.9

Dynamic 316 (0.71) 5.27 1.67 526 2.22 0.7 222 1.32 0.42 132 0.8 0.25 79.8 32.1

CDL 287 (0.64) 5.29 1.52 437 2.21 0.63 182 1.33 0.38 110 0.81 0.23 66.7 32.7

Pseudo-nMOS 313 (0.70) 5.34 1.67 523 2.21 0.69 216 1.25 0.39 122 0.68 0.21 66.3 551.2

CD logic 272 (0.61) 5.02 1.37 371 2.31 0.63 171 1.43 0.39 106 0.90 0.25 66.8 33.5

CCD logic 239 (0.53) 5.09 1.22 291 2.34 0.56 134 1.46 0.35 84 0.94 0.22 53.5 33.4

smaller than that of CD logic to satisfy the 300-mV glitchconstraint because the pull-up path in the LB now consists oftwo pMOS transistors. Footless dynamic domino logic cannotbe applied in this analysis because not all the circuits areimplemented with dynamic domino logic. Therefore, some ofthe inputs to the dynamic domino logic in the critical pathcan come from noncritical static gates. In order to satisfythe monotonicity requirement and to avoid the possible directpath current, a footer transistor is required for all dynamicdomino logics. On the other hand, if the entire 32-bit CLAis implemented with dynamic domino logic, then the footlessscheme can be applied. However, simulation results indicatethat the power consumption in this case is much higher thanthat of the current setup, thus making it a less attractive design.

Table IV summarizes the simulation results for the 32-bitCLAs. Power consumption is calculated with 5000 randominput vectors. The performance enhancement of CD and CCD

logic is evident in this case, with 39 and 47% speedup overstatic design, respectively. Both CD and CCD logic are alsofaster than pseudo-nMOS, primarily because of the largereffective pMOS width. The worst case leakage of all thedesigns is comparable except the case of pseudo-nMOS, whichis caused by the contention between nMOS PDN and the weakpMOS pull-up transistor. In this case, pseudo-nMOS’s leakage(static power dissipation) is at least 15 × higher than the restof the designs.

Figs. 14–16 show the normalized power, PDP, and EDPof all the CLAs analyzed in this paper, respectively. At 10%(100%) α, the power consumption of CD logic is 2.1× (1.05×)higher than the static logic. Compared to the case of the8-bit RCA, the power consumption discrepancy between staticand CD logic at low data activity is less obvious, becauseCD logic only accounts for approximately 10% of the entirecircuitry. CCD logic achieves the lowest PDP at all α except

Page 10: Constant Delay Logic Style

CHUANG et al.: CONSTANT DELAY LOGIC STYLE 563

100% 50% 25% 10%0.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.3

NormalizedPower

Data Activity Factor

StaticDynamicCDLpseudoNMOSCD logicCCD logic

Fig. 14. Normalized power of 32-bit CLAs.

100% 50% 25% 10%0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

NormalizedPDP

Data Activity Factor

StaticDynamicCDLpseudoNMOSCD logicCCD logic

Fig. 15. Normalized PDP of 32-bit CLAs.

100% 50% 25% 10%0.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.8

NormalizedEDP

Data Activity Factor

StaticDynamicCDLpseudoNMOSCD logicCCD logic

Fig. 16. Normalized EDP of 32-bit CLAs.

at 10%. While higher than CCD logic, the PDP of CD logic iscomparable to CDL and pseudo-nMOS and is lower than therest of the designs. At 25% α, EDP reduction of CD and CCDlogic from other designs is at least 4 and 24%, respectively.At 10% data activity, CCD logic achieves the lowest EDP withat least 19% improvement. Notice that the power consumptionof a CLA implemented with CD logic is lower than that of aCLA with dynamic domino logic at 100% data activity. Thisis because the width (1 μm) of CD logic’s pull-up pMOStransistors in this CLA is much smaller than that (2.4 μm,Section IV) of CD logic in a single-stage logic gate. Hencethe power consumption of dynamic domino logic is higherthan that of CD logic due to the higher internal capacitance

11 Bit Carry-bypass Adder

CLKDFF

CLKDFF

CLKDFF

2:1 Mux

4 Bit FA

2:1 Mux

S0S1S2S3S4S5S6S7S8S9S10S11S13S14S15

3 Bit FA

P8P7P6

P12P11P10P9

1 Bit FA

1 Bit FA

1 Bit FA

1 Bit FA

CriticalPath

WallaceTree

Critical path implementedwith various logic styles

Static logic only

S12

Fig. 17. 8-bit Wallace tree multiplier.

100% 50% 25% 10%0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

NormalizedPower

Data Activity Factor

StaticDynamicpseudoNMOSCD logic

Fig. 18. Normalized power of 8-bit multipliers.

TABLE V

CD AND CCD LOGIC GLITCH IN 32-bit CLAS

CD logic CCD logic

Temp. Mean σ Mean + Mean σ Mean +(°C) (mV) (mV) 6σ (mV) (mV) (mV) 6σ (mV)

85 65.3 23.2 204.5 63.6 26.2 220.8

110 88.5 29.8 267.3 86.8 36.1 303.4

at 100% data activity. Similar reasoning can easily apply to aCLA with CCD logic.

Table V summarizes the mean and σ of CD and CCDlogic’s worst glitch level in a Monte Carlo simulation with2000 samples. The worst case glitch is calculated by summingup the mean and the extrapolated sixσ deviation.4 CCD logicexhibits higher worst case glitch level than CD logic at both85 °C and 110 °C, despite its already lower critical pMOSeffective width. Both designs are approximately within theglitch constraint set earlier, and demonstrate that they arestill able to function properly under extreme process andtemperature conditions.

4It is assumed that the glitch variation as a result of process variationsfollows Gaussian distribution.

Page 11: Constant Delay Logic Style

564 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 3, MARCH 2013

TABLE VI

8-bit MULTIPLIER PERFORMANCE COMPARISON

Data activity 100% 50% 25% 10%

Delay Power PDP EDP Power PDP EDP Power PDP EDP Power PDP EDP Worst case

(ps) (mW) (pJ) (pJ·ps) (mW) (pJ) (pJ·ps) (mW) (pJ) (pJ·ps) (mW) (pJ) (pJ·ps) leakage (μA)

Static 404 (1.00) 3.52 1.42 575 2.29 0.93 375 1.29 0.52 211 0.67 0.27 109 88.78

Dynamic 294 (0.73) 3.57 1.05 309 2.36 0.7 205 1.39 0.41 121 0.80 0.23 69 89.09

Pseudo-nMOS 292 (0.72) 3.82 1.11 325 2.56 0.75 218 1.57 0.46 134 0.96 0.28 82 619.6

CD logic 243 (0.60) 3.59 0.87 212 2.46 0.60 145 1.49 0.36 88 0.88 0.22 52 89.99

TABLE VII

PVT AND MONTE CARLO PERFORMANCE ANALYSIS OF THE CD AND CCD LOGIC-BASED DESIGNS

Corner FF FS SF SSMonte Carlo

Temperature (°C) 110 −30 110 −30 110 −30 110 −30

VDD (V) 1.1 0.9 1.1 0.9 1.1 0.9 1.1 0.9 1.1 0.9 1.1 0.9 1.1 0.9 1.1 0.9 mean σ

8-bit RCA (CD) (ps) 152 199 142 197 182 244 178 265 164 220 152 217 262 369 249 392 226 16.2

32-bit CLA (CD) (ps) 189 238 172 229 222 288 206 289 205 259 186 248 321 439 294 437 274 17.8

32-bit CLA (CCD) (ps) 157 200 147 199 190 250 185 263 169 218 156 214 280 388 266 405 240 18.2

8-bit multiplier (CD) (ps) 211 229 208 228 223 247 219 249 216 236 212 235 259 327 251 347 244 7

100% 50% 25% 10%0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

NormalizedPDP

Data Activity Factor

StaticDynamicpseudoNMOSCD logic

Fig. 19. Normalized PDP of 8-bit multipliers.

C. 8-bit Wallace Tree Multiplier

Single-cycle two-phase 8-bit Wallace tree multipliers areimplemented and analyzed. The first phase (CLK is high) is aWallace tree utilizing 3:2 compressors to reduce the number ofpartial products, and during the second phase final addition iscarried out by a 11-bit carry bypass adder, as shown in Fig. 17.Only the critical path of the final adder is implemented withvarious logic styles with the exception of multiplexors, whilethe rest of the circuits remain static. Simulation setups aresimilar to that of 32-bit CLAs.

Table VI summarizes the performance results of 8-bitmultipliers and Figs. 18–20 illustrate the normalized power,PDP, and EDP of the various multipliers under different dataactivities, respectively. CD logic achieves a similar speedup(40%) over static logic compared to 32-bit adders. At 25% α,CD logic consumes 16% more power, but is 30 and 58%more PDP- and EDP-efficient than static logic, respectively.In 8-bit multipliers, because the critical path is only a smallportion of the entire circuitry, CD logic has the lowest PDPand EDP values among all the logic styles across all dataactivities. CCD and CDL logic are not included in thisanalysis because they are not particularly suitable for this

100% 50% 25% 10%0.30.40.50.60.70.80.91.01.11.21.31.4

NormalizedEDP

Data Activity Factor

StaticDynamicpseudoNMOSCD logic

Fig. 20. Normalized EDP of 8-bit multipliers.

setup. This is because G and P signals are not generatedin this case, instead, direct inputs (A, B) similar to 8-bitRCA are used to compute the carry. Finally, Table VII showsthe PVT analysis and Monte Carlo simulation results (2000iterations, 27 °C) of the proposed CD and CCD logic-baseddesigns. All the designs are functional under extreme PVT andtransistors mismatches and variations, thereby demonstratingthe proposed CD logic’s robustness.

VI. CONCLUSION

A new high-performance logic style with CD characteris-tic and self-reset circuitry was proposed. The pre-evaluatedfeature of CD logic makes it particularly suitable in a circuitblock where a unique critical path exists and performanceis the primary concern. Performance analysis of 8-bit RCAsreveals that CD logic is 39 and 23% faster than static anddynamic domino logic, respectively. Simulation results of32-bit CLAs show similar speed advantage of CD logiccompared to static logic. In this setup, CCD logic achieveslowest PDP at all data activity except at 10%. Also, CCDlogic achieves the best EDP results, with 66% (37%) reductioncompared to static logic at 50% (10%) data activity. The 6σ

Page 12: Constant Delay Logic Style

CHUANG et al.: CONSTANT DELAY LOGIC STYLE 565

worst case glitch for CD and CCD logic at 110 °C are 220.8and 303.4 mV, respectively.

CD logic’s advantages in terms of delay and EDP were alsodemonstrated in 8-bit Wallace tree multipliers. Compared to32-bit adders, CD logic achieves a similar delay improvement,but has an even better EDP reduction, primarily because thefinal adder which makes up the critical path of the multiplieris a relatively small circuit block of the overall circuitry. At25% α, CD logic is 52, 25, and 37% more EDP-efficient thanstatic, dynamic, and pseudo-nMOS logic, respectively.

REFERENCES

[1] R. Zimmermann and W. Fichtner, “Low-power logic styles: CMOSversus pass-transistor logic,” IEEE J. Solid-State Circuits, vol. 32, no. 7,pp. 1079–1090, Jul. 1997.

[2] N. Goncalves and H. De Man, “NORA: A racefree dynamic CMOStechnique for pipelined logic structures,” IEEE J. Solid-State Circuits,vol. 18, no. 3, pp. 261–266, Jun. 1983.

[3] C. Lee and E. Szeto, “Zipper CMOS,” IEEE Circuits Syst. Mag., vol. 2,no. 3, pp. 10–16, May 1986.

[4] R. Rafati, S. Fakhraie, and K. Smith, “A 16-bit barrel-shifter imple-mented in data-driven dynamic logic (D3L),” IEEE Trans. Circuits Syst.I, Reg. Papers, vol. 53, no. 10, pp. 2194–2202, Oct. 2006.

[5] F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, and P. Corsonello, “Low-power split-path data-driven dynamic logic,” Circuits Dev. Syst. IET,vol. 3, no. 6, pp. 303–312, Dec. 2009.

[6] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and SystemsPerspective, 4th ed. Reading, MA: Addison Wesley, Mar. 2010.

[7] K. Bernstein, High Speed CMOS Design Styles, 1st ed. New York:Springer-Verlag, Aug. 1998.

[8] S. Mathew, R. K. Krishnamurthy, M. A. Anders, R. Rios, K. R. Mistry,and K. Soumyanath, “Sub-500-ps 64-b ALUs in 0.18-μm SOI/bulkCMOS: Design and scaling trends,” IEEE J. Solid-State Circuits, vol.36, no. 11, pp. 318–319, Nov. 2001.

[9] S. Mathew, M. Anders, R. Krishnamurthy, and S. Borkar, “A 4 GHz130 nm address generation unit with 32-bit sparse-tree adder core,” inVLSI Circuits Dig. Tech. Papers Symp., 2002, pp. 126–127.

[10] S. K. Mathew, M. A. Anders, B. Bloechel, T. N. Krishnamurthy, andS. Borkar, “A 4-GHz 300-mW 64-bit integer execution ALU with dualsupply voltages in 90-nm CMOS,” IEEE J. Solid-State Circuits, vol. 40,no. 1, pp. 44–51, Jan. 2005.

[11] S. Wijeratne, N. Siddaiah, S. Mathew, M. Anders, R. Krishnamurthy, J.Anderson, S. Hwang, M. Ernest, and M. Nardin, “A 9 GHz 65 nm intelpentium 4 processor integer execution core,” in IEEE Int. Solid-StateCircuits Conf. ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2006,pp. 353–365.

[12] I. Sutherland, R. F. Sproull, and D. Harris. (Feb. 1999). Log-ical Effort: Designing Fast CMOS Circuits [Online]. Available:http://amazon.com/o/ASIN/1558605576/

[13] S. Kiaei, S.-H. Chee, and D. Allstot, “CMOS source-coupled logicfor mixed-mode VLSI,” in Proc. IEEE Int. Circuits Syst. Symp., NewOrleans, LA, May 1990, pp. 1608–1611.

[14] L. McMurchie, S. Kio, G. Yee, T. Thorp, and C. Sechen, “Outputprediction logic: A high-performance CMOS design technique,” in Proc.Comput. Des. Int. Conf., Austin, TX, 2000, pp. 247–254.

[15] K. H. Chong, L. McMurchie, and C. Sechen, “A 64 b adder using self-calibrating differential output prediction logic,” in IEEE Int. Solid-StateCircuits Conf. Dig. Tech. Papers, San Francisco, CA, Feb. 2006, pp.1745–1754.

[16] V. Navarro-Botello, J. A. Montiel-Nelson, and S. Nooshabadi, “Lowpower arithmetic circuit in feedthrough dyanmic CMOS logic,” in Proc.IEEE Int. 49th Midw. Symp. Circuits Syst., Aug. 2006, pp. 709–712.

[17] V. Navarro-Botello, J. A. Montiel-Nelson, and S. Nooshabadi, “Analysisof high-performance fast feedthrough logic families in CMOS,” IEEETrans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 6, pp. 489–493, Jun.2007.

[18] Y. Taur and T. Ning, Fundamentals of Modern VLSI Devices. Cambridge,U.K.: Cambridge Univ. Press, 1998.

[19] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leak-age current mechanisms and leakage reduction techniques in deep-submicrometer cmos circuits,” Proc. IEEE, vol. 91, no. 2, pp. 305–327,Feb. 2003.

[20] N. Kurd, J. S. Barkarullah, R. O. Dizon, T. D. Fletcher, and P. D.Madland, “A multigigahertz clocking scheme for the Pentium(R) 4microprocessor,” IEEE J. Solid-State Circuits, vol. 36, no. 11, pp. 1647–1653, Nov. 2001.

[21] S. Vangal, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla,H. Wilson, and A. Pangal, “5GHz 32b integer-execution core in 130nm dual-VT CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech.Papers, Nov. 2002, pp. 334–535.

[22] P. Chuang, D. Li, and M. Sachdev, “Design of a 64-bit low-energy high-performance adder using dynamic feedthrough logic,” in Proc. IEEE Int.Circuits Syst. Symp., May 2009, pp. 3038–3041.

[23] M. Aguirre-Hernandez and M. Linares-Aranda, “CMOS full-adders forenergy-efficient arithmetic applications,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 19, no. 99, pp. 1–5, Apr. 2010.

Pierce Chuang (S’08) received the B.A.Sc. (hons.)and M.A.Sc. degrees in electrical engineering fromthe University of Waterloo, Waterloo, ON, Canada,in 2008 and 2010, respectively, where he is currentlypursuing the Ph.D. degree in electrical engineering.

His current research interests include low-powerhigh-performance digital CMOS circuits.

David Li (S’05) received the B.A.Sc. (hons.),M.A.Sc., and Ph.D. degrees in electrical engineeringfrom the University of Waterloo, Waterloo, ON,Canada, in 2005, 2007, and 2011, respectively.

He joined Qualcomm Inc., San Diego, CA, in 2011to work on custom memory and digital designs.His current research interests include low-power,high-performance, and reliable digital designs withspecial focus on flip-flop metastability.

Dr. Li received the Best Paper Award at the Inter-national Symposium on Quality Electronic Design

in 2011.

Manoj Sachdev (F’11) received the B.E. degree(hons.) in electronics and communication engi-neering from the University of Roorkee, Roorkee,India, and the Ph.D. degree from Brunel University,Uxbridge, U.K., in 1984 and 1996, respectively.

He was with Semiconductor Complex Ltd.,Chandigarh, India, from 1984 to 1989, where hewas involved in the design of CMOS integratedcircuits. From 1989 to 1992, he was with the ASICDivision, SGS-Thomson, Agrate, Milan. In 1992,he joined Philips Research Laboratories, Eindhoven,

The Netherlands, where he worked on various aspects of VLSI testingand manufacturing. He joined the Electrical and Computer EngineeringDepartment, University of Waterloo, Waterloo, ON, Canada, as a Professor in1998. He serves as the Department Chair and holds the University of WaterlooResearch Chair. He has contributed five books and two book chapters, andhas authored over 170 technical articles published in conference proceedingsand journals. He holds more than 30 granted or pending U.S. patents onvarious aspects of VLSI circuit design, reliability, and testing. His currentresearch interests include low-power and high-performance digital circuitdesigns, mixed-signal circuit designs, and test and manufacturing issues ofintegrated circuits.

Prof. Sachdev was the recipient of the Best Paper Award at the EuropeanDesign and Test Conference in 1998, the Honorable Mentioned Award at theInternational Test Conference in 1999, the Best Panel Award at the VLSITest Symposium in 2004, and the Best Paper Award at the InternationalSymposium on Quality Electronic Design in 2011. He has served on severalconference committees. He was the Technical Program Co-Chair for theIEEE Mixed-Signal Test Workshop in 1999 and the Technical ProgramChair for the IEEE IDDQ Test Workshop in 1999 and 2000. From 2007to 2008, he was an Associate Editor for the IEEE TRANSACTIONS ON

VEHICULAR TECHNOLOGY. From 2008 to 2009, he was the Program Chairfor Microsystems and Nanoelectronics Research Conference, Ottawa, Canada.He serves on the Editorial Board of the Journal of Electronic Testing:Theory and Applications. He is the Chair of the Test, Debug, and ReliabilitySubcommittee of the IEEE Custom Integrated Circuits Conference.