6
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006 1383 s-CDMFF-lay saves 16% power at 50% activity and saves 28% power at 25% activity. Note that the results are different to those in Table II, due to differences in circuit topologies (especially the extra inverter to produce ), and s-CDMFF-lay is optimized in terms of power and area instead of power and delay. The s-CDMFF-lay has about 34% increase in size and about 34% less in IRI. The spreads of power, delay, and IRI are better than those of TGFF-lay. VII. CONCLUSION A novel conditional data mapping methodology is proposed for elim- inating redundant internal transitions of a flip-flop by mapping its in- puts to a certain configuration. Using this methodology, we have cre- ated two CDMFFs, which have the best power-delay product among the state-of-the-art flip-flops. They have not only comparable small delays among high-performance flip-flops, but also have outstanding power reduction ability at various data activities. CDMFFs also have the best internal race immunities among pulse-triggered flip-flops. ACKNOWLEDGMENT The authors would like to thank T. Furuyama, S. Watanabe, T. Aikawa, H. Sasaki, Y. Unekawa, Y. Uchino, T. Yamashita, and F. Tachibana for their support and discussions. REFERENCES [1] M. Nogawa and Y. Ohtomo, “A data-transition look-ahead DFF circuit for statistical reduction in power consumption,” IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 702–706, May 1998. [2] M. Hamada et al., “Flip-flop selection technique for power-delay tradeoff,” in ISSCC Dig. Tech. Papers, 1999, pp. 270–271. [3] N. Nedovic and V. G. Oklobdzija, “Hybrid latch flip-flop with im- proved power efficiency,” in Proc. Symp. Integr. Circuits Syst. Des., 2000, pp. 211–215. [4] B.-S. Kong, S.-S. Kim, and Y.-H. Jun, “Conditional-capture flip-flop for statistical power reduction,” IEEE J. Solid-State Circuits, vol. 36, no. 8, pp. 1263–1271, Aug. 2001. [5] P. Zhao, T. K. Darwish, and M. A. Bayoumi, “High-performance and low-power conditional discharge flip-flop,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 477–484, May 2004. [6] B. Nikolic et al., “Improved sense-amplifier-based flip-flop: Design and measurements,” IEEE J. Solid-State Circuits, vol. 35, no. 6, pp. 876–883, Jun. 2000. [7] F. Klass, “Semi-dynamic and dynamic flip-flops with embedded logic,” in Symp. VLSI Circuits Dig. Tech. Papers, 1998, pp. 108–109. [8] H. Partovi et al., “Flow-through latch and edge-triggered flip-flop hy- brid elements,” in ISSCC Dig. Tech. Papers, 1996, pp. 138–139. [9] J. Tschanz et al., “Comparative delay and energy of single edge-trig- gered and dual edge-triggered pulsed flip-flops for high-performance microprocessors,” in Int. Symp. Low Power Electron. Des. Tech. Dig., 2001, pp. 147–152. [10] S. D. Naffziger, “The implementation of the itanium 2 microprocessor,” IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1448–1460, Nov. 2002. [11] J. D. Warnock et al., “The circuit and physical design of the POWER4 microprocessor,” IBM J. Res. Dev., vol. 46, pp. 27–51, Jan. 2002. [12] G. Gerosa et al., “A 2.2 W 80 MHz superscalar RISC microprocessor,” IEEE J. Solid-State Circuits, vol. 29, no. 12, pp. 1440–1452, Dec. 1994. [13] V. Stojanovic and V. G. Oklobdzija, “Comparative analysis of master- slave latches and flip-flops for high-performance and low-power sys- tems,” IEEE J. Solid-State Circuits, vol. 34, no. 4, pp. 536–548, Apr. 1999. [14] N. Nedovic, W. W. Walker, and V. G. Oklobdzija, “A test circuit for measurement of clocked storage element characteristics,” IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1294–1304, Aug. 2004. [15] D. Markovic, B. Nikolic, and R. W. Brodersen, “Analysis and design of low-energy flip-flops,” in Int. Symp. Low Power Electron. Des. Tech. Dig., 2001, pp. 52–55. Fast Interconnect and Gate Timing Analysis for Performance Optimization Soroush Abbaspour, Massoud Pedram, Amir Ajami, and Chandramouli Kashyap Abstract—Static timing analysis is a key step in the physical design op- timization of VLSI designs. The lumped capacitance model for gate delay and the Elmore model for wire delay have been shown to be inadequate for wire-dominated designs. Using the effective capacitance model for the gate delay calculation and model-order reduction techniques for wire delay calculation is prohibitively expensive. In this paper, we present sufficiently accurate and highly efficient filtering algorithms for interconnect timing as well as gate timing analysis. The key idea is to partition the circuit into low and high complexity circuits, whereby low complexity circuits are han- dled with efficient algorithms such as total capacitance algorithm for gate delay and the Elmore metric for wire delay and high complexity circuits are handled with sign-off algorithms. Experimental results on microprocessor designs show accuracies that are quite comparable with sign-off delay cal- culators with more than of 65% reduction in the computation times. Index Terms—Asymptotic waveform, effective capacitance, Elmore delay, gate delay calculation, interconnect delay calculation, static timing analysis. I. INTRODUCTION As CMOS process technologies scale down towards nanometer regimes, the accuracy and efficiency of static timing analysis (STA) has become increasingly important to the successful timing closure of an integrated circuit design flow. Most STA tools break the analysis into two parts: 1) gate delay calculation and 2) interconnect or wire delay calculation. It is widely accepted that computing gate delays using a lumped total capacitance and computing the wire delay using the Elmore model are grossly inadequate for the wire dominated designs of today. To address this drawback, various model order reduction techniques such as AWE [2], PRIMA [9], etc., have been proposed to accurately model the interconnect delay. On the other hand, the load, as seen by the driving gate, is modeled by a reduced-order model such as the -model [5]. Therefore, an “effective capacitance” technique was proposed [6] which provides a way to map the -load to an equivalent capacitance (in the sense of gate propagation delay). While these approaches exhibit good accuracies and are used for sign-off level, they can be too computation extensive to be used in the context of physical design optimization. Recognizing this shortcoming, there has been much research on deriving closed-form formulas or delay metrics for wire delay estimation [3], [4], [8], [12]. However, these delay metrics introduce lots of error to the STA results and are not reliable enough to be used for optimization. On the other hand, not much attention has been given on speeding up the gate delay calcula- tion, which as we show next, accounts for a significant portion of the overall STA runtime. We measured the time spent in various parts of a commercial sign-off STA tool on many designs including two 90-nm technology micropro- cessor designs, which we call Design#1 and Design#2. Table I presents important statistics for these two designs. It also reports the time that Manuscript received October 6, 2005; revised April 17, 2006. S. Abbaspour is with the IBM Corporation, Hopewell Junction, NY 12533 USA (e-mail: [email protected]). M. Pedram is with the Department of Electrical Engineering Systems, Uni- versity of Southern California, Los Angeles, CA 90089 USA. A. Ajami is with Magma Design Automation, Santa Clara, CA 95054 USA. C. Kashyap is with the Intel Corporation, Hillsboro, OR 97124 USA. Digital Object Identifier 10.1109/TVLSI.2006.887834 1063-8210/$20.00 © 2006 IEEE

Fast Interconnect and Gate Timing Analysis for Performance Optimization

Embed Size (px)

Citation preview

Page 1: Fast Interconnect and Gate Timing Analysis for Performance Optimization

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006 1383

s-CDMFF-lay saves 16% power at 50% activity and saves 28% powerat 25% activity. Note that the results are different to those in Table II,due to differences in circuit topologies (especially the extra inverterto produce CK1), and s-CDMFF-lay is optimized in terms of powerand area instead of power and delay. The s-CDMFF-lay has about34% increase in size and about 34% less in IRI. The spreads of power,delay, and IRI are better than those of TGFF-lay.

VII. CONCLUSION

A novel conditional data mapping methodology is proposed for elim-inating redundant internal transitions of a flip-flop by mapping its in-puts to a certain configuration. Using this methodology, we have cre-ated two CDMFFs, which have the best power-delay product among thestate-of-the-art flip-flops. They have not only comparable small delaysamong high-performance flip-flops, but also have outstanding powerreduction ability at various data activities. CDMFFs also have the bestinternal race immunities among pulse-triggered flip-flops.

ACKNOWLEDGMENT

The authors would like to thank T. Furuyama, S. Watanabe,T. Aikawa, H. Sasaki, Y. Unekawa, Y. Uchino, T. Yamashita, andF. Tachibana for their support and discussions.

REFERENCES

[1] M. Nogawa and Y. Ohtomo, “A data-transition look-ahead DFF circuitfor statistical reduction in power consumption,” IEEE J. Solid-StateCircuits, vol. 33, no. 5, pp. 702–706, May 1998.

[2] M. Hamada et al., “Flip-flop selection technique for power-delaytradeoff,” in ISSCC Dig. Tech. Papers, 1999, pp. 270–271.

[3] N. Nedovic and V. G. Oklobdzija, “Hybrid latch flip-flop with im-proved power efficiency,” in Proc. Symp. Integr. Circuits Syst. Des.,2000, pp. 211–215.

[4] B.-S. Kong, S.-S. Kim, and Y.-H. Jun, “Conditional-capture flip-flopfor statistical power reduction,” IEEE J. Solid-State Circuits, vol. 36,no. 8, pp. 1263–1271, Aug. 2001.

[5] P. Zhao, T. K. Darwish, and M. A. Bayoumi, “High-performance andlow-power conditional discharge flip-flop,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 477–484, May 2004.

[6] B. Nikolic et al., “Improved sense-amplifier-based flip-flop: Designand measurements,” IEEE J. Solid-State Circuits, vol. 35, no. 6, pp.876–883, Jun. 2000.

[7] F. Klass, “Semi-dynamic and dynamic flip-flops with embedded logic,”in Symp. VLSI Circuits Dig. Tech. Papers, 1998, pp. 108–109.

[8] H. Partovi et al., “Flow-through latch and edge-triggered flip-flop hy-brid elements,” in ISSCC Dig. Tech. Papers, 1996, pp. 138–139.

[9] J. Tschanz et al., “Comparative delay and energy of single edge-trig-gered and dual edge-triggered pulsed flip-flops for high-performancemicroprocessors,” in Int. Symp. Low Power Electron. Des. Tech. Dig.,2001, pp. 147–152.

[10] S. D. Naffziger, “The implementation of the itanium 2 microprocessor,”IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1448–1460, Nov. 2002.

[11] J. D. Warnock et al., “The circuit and physical design of the POWER4microprocessor,” IBM J. Res. Dev., vol. 46, pp. 27–51, Jan. 2002.

[12] G. Gerosa et al., “A 2.2 W 80 MHz superscalar RISC microprocessor,”IEEE J. Solid-State Circuits, vol. 29, no. 12, pp. 1440–1452, Dec. 1994.

[13] V. Stojanovic and V. G. Oklobdzija, “Comparative analysis of master-slave latches and flip-flops for high-performance and low-power sys-tems,” IEEE J. Solid-State Circuits, vol. 34, no. 4, pp. 536–548, Apr.1999.

[14] N. Nedovic, W. W. Walker, and V. G. Oklobdzija, “A test circuit formeasurement of clocked storage element characteristics,” IEEE J.Solid-State Circuits, vol. 39, no. 8, pp. 1294–1304, Aug. 2004.

[15] D. Markovic, B. Nikolic, and R. W. Brodersen, “Analysis and designof low-energy flip-flops,” in Int. Symp. Low Power Electron. Des. Tech.Dig., 2001, pp. 52–55.

Fast Interconnect and Gate Timing Analysisfor Performance Optimization

Soroush Abbaspour, Massoud Pedram, Amir Ajami, andChandramouli Kashyap

Abstract—Static timing analysis is a key step in the physical design op-timization of VLSI designs. The lumped capacitance model for gate delayand the Elmore model for wire delay have been shown to be inadequatefor wire-dominated designs. Using the effective capacitance model for thegate delay calculation and model-order reduction techniques for wire delaycalculation is prohibitively expensive. In this paper, we present sufficientlyaccurate and highly efficient filtering algorithms for interconnect timingas well as gate timing analysis. The key idea is to partition the circuit intolow and high complexity circuits, whereby low complexity circuits are han-dled with efficient algorithms such as total capacitance algorithm for gatedelay and the Elmore metric for wire delay and high complexity circuits arehandled with sign-off algorithms. Experimental results on microprocessordesigns show accuracies that are quite comparable with sign-off delay cal-culators with more than of 65% reduction in the computation times.

Index Terms—Asymptotic waveform, effective capacitance, Elmoredelay, gate delay calculation, interconnect delay calculation, static timinganalysis.

I. INTRODUCTION

As CMOS process technologies scale down towards nanometerregimes, the accuracy and efficiency of static timing analysis (STA) hasbecome increasingly important to the successful timing closure of anintegrated circuit design flow. Most STA tools break the analysis intotwo parts: 1) gate delay calculation and 2) interconnect or wire delaycalculation. It is widely accepted that computing gate delays usinga lumped total capacitance and computing the wire delay using theElmore model are grossly inadequate for the wire dominated designsof today. To address this drawback, various model order reductiontechniques such as AWE [2], PRIMA [9], etc., have been proposed toaccurately model the interconnect delay. On the other hand, the load,as seen by the driving gate, is modeled by a reduced-order model suchas the �-model [5]. Therefore, an “effective capacitance” techniquewas proposed [6] which provides a way to map the �-load to anequivalent capacitance (in the sense of gate propagation delay).

While these approaches exhibit good accuracies and are used forsign-off level, they can be too computation extensive to be used in thecontext of physical design optimization. Recognizing this shortcoming,there has been much research on deriving closed-form formulas ordelay metrics for wire delay estimation [3], [4], [8], [12]. However,these delay metrics introduce lots of error to the STA results and arenot reliable enough to be used for optimization. On the other hand, notmuch attention has been given on speeding up the gate delay calcula-tion, which as we show next, accounts for a significant portion of theoverall STA runtime.

We measured the time spent in various parts of a commercial sign-offSTA tool on many designs including two 90-nm technology micropro-cessor designs, which we call Design#1 and Design#2. Table I presentsimportant statistics for these two designs. It also reports the time that

Manuscript received October 6, 2005; revised April 17, 2006.S. Abbaspour is with the IBM Corporation, Hopewell Junction, NY 12533

USA (e-mail: [email protected]).M. Pedram is with the Department of Electrical Engineering Systems, Uni-

versity of Southern California, Los Angeles, CA 90089 USA.A. Ajami is with Magma Design Automation, Santa Clara, CA 95054 USA.C. Kashyap is with the Intel Corporation, Hillsboro, OR 97124 USA.Digital Object Identifier 10.1109/TVLSI.2006.887834

1063-8210/$20.00 © 2006 IEEE

Page 2: Fast Interconnect and Gate Timing Analysis for Performance Optimization

1384 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006

TABLE I90-nm MICROPROCESSOR DESIGN SPECIFICATIONS

TABLE IIDESIGN #2 FOM RESULTS

the tool spends on the “gate timing analysis,” “interconnect timing anal-ysis,” and the whole STA runtime. We find that, on average, about 60%of the CPU time of STA is spent on the gate timing analysis.

For accuracy purposes, figure of merit (FOM) metric has been usedto measure how poor the distribution of negative slacks is (i.e., worst-negative slack at end points) in the design. “FOM integral” representsthe summation of all the negative slack endpoints in the designs. Thismetric is chosen because it captures how many paths are timing criticaland need to be fixed. In comparison, the worst slack gives one numberthat indicates the worst negative slack of the design that need to befixed. “FOM Number” is the number of negative slack end points.

We applied different combinations of interconnect timing analysisalgorithms (AWE or Elmore) and gate timing analysis algorithms(effective capacitance and lumped capacitance) on many designsincluding both Design#1 and Design#2. As an example, we haveprovided the FOM results for Design#2 in Table II. It can be derivedthat although Elmore metric is efficient but can change the FOMresults by orders of magnitude. In addition, Ctotal can change theFOM results by orders of magnitude with respect to the golden FOMresults (i.e., using AWE for interconnect timing analysis and Ce� forgate timing analysis). Thus, it is important to have new interconnectand gate timing analysis algorithms which are capable of accuratelyand efficiently calculating interconnect and gate delay and slew alonga path.

Our first contribution in this paper is to present a filtering techniquefor speeding up the interconnect timing analysis step in an STA tool,while maintaining a reasonable level of accuracy. As we will see later inthis work, Elmore delay-based algorithms could be accurate for somecases of nets or interconnects and we do not need to use higher ordermoments-based algorithms for delay and slew calculation. For someother cases, we may need to use two moments for the interconnect delayand slew calculation, where a new efficient metric has been proposed.Finally, for other cases, we may need to use AWE-based algorithm forinterconnect delay and slew calculation.

Our second contribution in this paper is to present a filtering tech-nique for speeding up the gate delay calculation step in an STA tool,while maintaining a reasonable level of accuracy. The filtering tech-nique resorts to a necessary condition check to determine if Ctotal canbe used for the gate delay and/or output slew calculations without in-troducing a significant inaccuracy.

The remainder of this paper is organized as follows. In Section II,we present the threshold-based filtering algorithm for fast interconnecttiming analysis. In Section III, the fast gate timing analysis is presented.Section IV presents the concluding remarks.

II. FAST INTERCONNECT TIMING ANALYSIS

In this section, we focus on the interconnect timing analysis. Elmore[1] used the first moment of the impulse response transfer functionand approximated the median (the desired delay) by the mean of the

impulse response. It is well established that the Elmore delay metriccan be off by orders of magnitude in some cases. To conquer the ac-curacy problem, different delay metrics have been proposed by usinghigher moments [3], [4], [8], [12]. These delay metrics try to use a fixednumber of moments to find the delay and slew, accordingly. Using thefact that, for some nets, Elmore-based delay is accurate enough and forsome nets, the delay metrics based on two or higher moments shouldbe used; therefore, none of the previous works would give accurate, yetefficient, results.

This section presents a threshold-based filtering algorithm (TFA) forpropagation delay and output slew calculation of high-speed VLSI in-terconnects. The TFA partitions the circuit nets into three groups basedon their top-level characteristics: one group of nets called low com-plexity nets, lend themselves to accurate delay calculation with the El-more delay whereas the second and third groups of nets called mediumand high complexity nets, demand more sophisticated and time-con-suming delay calculations based on the first two or higher momentsof the impulse response transfer function, respectively. The idea of di-viding the circuit nets into different classes for the purpose of mini-mizing the computational workload of a delay calculation engine whileproviding reasonable accuracy for the computed delays is quite intu-itive and straightforward. The key challenge, however, is in being ableto do the examination and classification of the nets accurately. This isprecisely what we accomplish in this section by our threshold-basedfiltering algorithm, as will be shown later.

The remainder of this section is organized as follows. InSection II-A, by using the circuit theory, a new analytical closed-formequation for calculating the delay and output slew of an interconnectline under step and ramp inputs is presented. Section II-B uses theseanalytical equations as a signature function to sort the nets into simpleand complex ones. Experimental results are reported in Section II-C.

A. Analysis of the Threshold-Based Filtering Algorithm

The ratio of the voltage of the output node Vo(s) to the input voltageVi(s) for a linear time-invariant (LTI) system is called the voltagetransfer function H(s). For an RC tree, this ratio can be written as

H(s) =Vo(s)

Vi(s)=

k=0

mksk (1)

where mi is called the ith moment of the voltage transfer function. If aunit ramp input with � � �% rise time of Tin(���) is applied to suchan RC segment, then the �� �% output transition time can be writtenas [4], [7], [12]

Tout(���) = Tin (���)2+ RC ln

100� �

100� �

2

: (2)

Based on (2), if the ratio of the input slew to the corresponding RCvalue for two different RC circuits is the same, then the ratio of theiroutput transition times to the RC values will be the same. Consideringthe RC value to be an indicator for Elmore delay of a more general RCtree, this fact implies that the Input_slew/Elmore is a key characteristicfor the delay calculation, and interestingly, one of the most importantfactors when determining the degree of accuracy of an Elmore delaycalculator. Therefore, for an RC tree, the output slew can be calculatedas

Tout(���)�= Tin (���)2+ Elmore� ln

100� �

100� �

2

(3)

where Tout and Tin denote the transition times at the output and inputnodes of the RC tree and Elmore denotes the Elmore delay.

For an RC tree, considering only the first-order moment in delay cal-culation implies that the second-order moment is the square of the firstmoment, which is not always true due to the shielding effect of the

Page 3: Fast Interconnect and Gate Timing Analysis for Performance Optimization

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006 1385

Fig. 1. � and � versus output transition percentage.

wires. In general, this m2=m21 ratio varies from a number smaller than

1 to almost 50. Therefore, we need to consider the effect of higher mo-ments. By considering the first two moments of the impulse responsetransfer function, we can approximate H(s) by

~H(s) =1

1�m1s+ (m21 �m2)s2

=1 +m1s+m2s2 + (2m1m2 �m3

1)s3 + � � �: (4)

As a result, we approximate the � � �% output transition as

Tout(���) �= Tin(���)2+ (Elmore� ���)

2 (5)

where is a function of m2=m21. In addition, by approximating the

step response of a second-order system, we calculate the value in (5)as a linear function of m2=m

21 as follows:

� = ��m2

m21

+ ��: (6)

This linear approximation is accurate enough for the analysis and helpsus to understand the sensitivity of the delay and slew calculation to theshielding effect. However, one can use higher order terms and get amore accurate value.

The values of and � are calculated and shown in Fig. 1. From Fig. 1and (5) and (6), since � is multiplied by m2=m

21, it is obvious that the

10% to 90% of the transition time is sensitive to the m2=m21 change. It

also shows that the around 70% point is not as sensitive to the value ofm2=m

21 (and thereby to the shielding effect) as the 50% transition or

any other points are. Fig. 2 shows this scenario for different values ofm2=m

21. More precisely, if m2=m

21 changes by 20%, the 10% point to

90% point transition time changes by as much as 43%, whereas the 70%point output transition time changes slightly. Fig. 1 also can help us tounderstand how much error we can incur in our delay/slew analysis ifwe do not consider higher moments (m2;m3; . . .) for calculating thepropagation delay and slew.

Based on (4), considering only the first two moments of the impulseresponse transfer function is equivalent to assuming that the third mo-ment is equal to 2m1m2�m

31. Interestingly, the output transition times

are not sensitive to m3=(2m1m2�m31) as much as they are sensitive

to the m2=m21. However, to have an accurate interconnect timing ana-

lyzer, when m3=(2m1m2�m31) becomes larger than a critical value,

the AWE method needs to be used to find the delay and slew.The advantage of this methodology is that the latter scenario

occurs rarely in today’s high frequency digital circuits. Indeed, them3=(2m1m2 � m3

1) is linearly dependent on the m2=m21. Thus,

whenever m2=m21 value exceeds a critical limit, the effect of third

moment should also be taken into account by using the AWE method.This critical limit can change according to the degree of precisionneeded during the path timing analysis.

Fig. 2. Step response of a second-order system for three values of m =m .

B. Filtering Algorithm

As observed earlier, the Input_slew/Elmore is an extremely impor-tant factor in determining the propagation delay and slew. When thevalue of Input_slew/Elmore becomes greater than a critical limit, thenthere is one dominant pole in the voltage transfer function, and there-fore, the first moment would be sufficiently accurate for calculatingthe output delay and transition time. It can be observed that the Elmoredelay and Elmore slew errors are functions of the Input_slew/Elmore. Ifthe Input_slew/Elmore is greater than the critical threshold, the Elmoredelay error is quite negligible. However, when Input_slew/Elmore isless than this threshold, the Elmore delay may result in a considerableerror. The proposed filtering algorithm makes use of this behavior toclassify the stage delays based on the critical value of Input_slew/El-more. The parameters used in the filtering algorithm are defined as fol-lows.

��� Elmore threshold value. When the first moment of thevoltage transfer function is less than this threshold, then theestimation errors of the slew and delay (which are calculatedbased on Elmore metric) are small because the critical pathdelays are not sensitive to these estimation errors.

��� Dominant-pole cut off ratio. When the value of the input slewover Elmore delay is greater than �, then the Elmore-basedtiming analysis is accurate enough.

��� Second moment filtering-threshold value. If the value ofm2=m

21 is less than this threshold, (5) becomes the basis of

the timing analysis. For an interconnect line with m2=m21

greater than this threshold, the AWE method should be usedto calculate the higher moments. As � goes towards 1, thedelay and slew calculations become more accurate but theruntime increases.

Therefore, given the input slew Tr , the TFA for calculating the stagedelay is as follows.

Threshold-Based Filtering Algorithm

1. Calculate the first moment m1;2. if (jm1j � � k Tr=jm1j � �) {

Calculate Elmore-based delay and slew;Return; }

3. Calculate m2;4. if (m2=m

21 � �) {

Use (5) to calculate delay and slew;Return; }

5. Calculate higher moments;6. Use AWE to calculate the delay and slew;7. Return delay and slew values;

Page 4: Fast Interconnect and Gate Timing Analysis for Performance Optimization

1386 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006

TABLE IIIDESIGN#1 EXPERIMENTAL RESULTS

TABLE IVDESIGN#2 EXPERIMENTAL RESULTS

C. Experimental Results

To verify the accuracy of the proposed filtering technique, the al-gorithm was applied to many high-performance designs including De-sign#1 and Design#2. The design specifications are shown in Table I.All the experimental runs of the proposed algorithm were done on a2.0-GHz X86-based PC with 2 GB of RAM. The sign-off FOM re-sults (using AWE for interconnect timing analysis and Ce� for gatetiming analysis) are shown in the first row of Table III for Design#1and Table IV for Design#2. We changed the interconnect timing anal-ysis algorithm from AWE to Elmore and D2M and reported the resultsin the previously mentioned tables. As it is shown, the FOM resultschange by orders of magnitude when we apply Elmore and D2M, how-ever, the runtime decreases significantly. We also applied the TFA al-gorithm using � = 4 ps, � = 7, and � = 1:44. The proposed filteringalgorithm improves the interconnect timing analysis runtime by 65%.In addition, TFA resulted in a very small amount of error in FOM re-sults compared to AWE-based delay calculator results. For Design#1,the max/average/min errors are 6%/1%/-2% while for Design#2 themax/average/min errors are 8%/1%/-3%. Decreasing � and � and in-creasing � tends to increase the accuracy at the expense of higherruntime. In fact, the filtering algorithm with � ! 0, � ! 1, and� ! 0 simply resort to the AWE-based timing analysis. Similarly, with� ! 0, the proposed filtering algorithm reduces to the Elmore-basedfor delay and slew calculation.

Evidently, there is a tradeoff between efficiency and accuracy whenchoosing the threshold parameter values. As an example, ��� is thethreshold value that filters those cases where the Elmore delay calcu-lator returns a small delay value. Definition of “small” is, however,design and technology dependent: 10 ps in 180-nm technology maybe a small value while in 90-nm technology it may not be consideredsmall anymore. One can choose different ��� values in different stagesof the design flow, starting with a large ��� value but choosing smallerones as he/she proceeds from earlier design stages toward the sign-offstage.

III. FAST GATE TIMING ANALYSIS

In this section, we present a filtering technique for speeding up thegate delay calculation step in an STA tool, while maintaining a reason-able level of accuracy: The filtering technique resorts to a necessarycondition check to determine if Ctotal can be used for the gate delayand/or output slew calculations without introducing a significant in-accuracy. The motivation for the filtering approach is given in Fig. 3,where it is shown that the distribution of the “actual effective capaci-tance” over the “total capacitance” in a design is highly skewed towardsone. As shown in Fig. 3, for “Design#2,” the mean of the distribution ofCe�=Ctotal ratio is equal to 0.97. We have observed similar behaviorin many other large industrial designs.

Fig. 3. Distribution of the C =C ratio for Design#2.

This section is organized as follows. Section III-A reviews thebackground and previous work in the area of gate timing analysis.Section III-B describes the filtering technique previously mentionedfor speeding up the gate delay and slew analysis. Experimental resultsare reported in Section III-C.

A. Background

In VDSM technologies, we cannot neglect the effect of intercon-nect resistances of the output loads. Using the sum of all load capac-itances as the capacitive load is simple, but can be quite pessimistic[11]. A more accurate approximation for an nth-order load shown bythe gate (i.e., a load with n distributed capacitances to ground) is to usea second-order RC-�

:

model [5]. Therefore, the “effective capacitance”approach has been proposed [6], [10], [11], whereby the RC-� load isapproximated by an equivalent capacitance, Ce� .

All of effective capacitance approaches resort to the iterative cal-culation of Ce� for the given circuit scenario, which can be costly inthe context of physical design optimization tools. In this section, wepresent a filtering approach that resorts to a necessary condition checkto determine if theCtotal algorithm is sufficient for evaluating the delayand/or output slew of logic gates, and thereby, avoid effective capaci-tance calculations.

As shown in Fig. 3, for most cases of the gate timing analysis, Ce�

is very close to Ctotal, i.e., if we are able to identify these cases, itwill then be possible to use Ctotal algorithm for the gate delay and/oroutput slew calculation for these cases, and employ the Ce� algorithmfor the remaining cases. To find out the type of the STA case that wemust perform on a circuit configuration, we resort to an efficient andaccurate condition check.

1) Problem Statement: Given is a CMOS driver whose input risetime is Tin and drives an output RC-� load. The problem is to find arobust and efficient necessary condition check to distinguish betweencases that can be accurately handled by using the Ctotal algorithm andthose cases that need the iterative Ce� algorithm for gate propagationdelay and/or output slew calculation during the physical design opti-mization process.

B. Proposed Filtering Technique

In our quest for a robust and efficient necessary condition, we startwith the effective capacitance definition. Based on its definition, theeffective capacitance Ce� is a pure capacitance that replaces an RC-�load and has the property that it stores the same amount of charge as theRC-� load until a certain point of the output voltage transition (e.g., the50% point of the output transition). We assume that the output voltagewaveform for the CMOS driver behaves as a combination of ramp andexponential waveforms and, therefore, actual Ce� must be obtained asa simple average of the Ce� obtained for ramp output waveform andthe Ce� obtained for exponential output waveform.

In the following, we calculate Ce� for ramp and exponential wave-forms of the gate output voltage. Modeling gate output waveform as

Page 5: Fast Interconnect and Gate Timing Analysis for Performance Optimization

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006 1387

Fig. 4. Using k and k coefficients forC calculation (one iteration).

exponential voltage waveform, we have shown that the iterative effec-tive capacitance equation for matching any �% point of the gate outputtransition time can be written as (derivations are omitted for brevity)

CExpe� (�) =Cn + kExp (�)Cf

kExp (�) = 1 +y

�eln(1��)=y � 1

y = ln1� �

1� ��

R�Cf

TR(���): (7)

We have also derived that if the output voltage of a gate is approximatedwith a ramp voltage waveform with �% to �% rise time of TR(���),the iterative Ce� equation for matching any �% output transition canbe written as (derivations are omitted for brevity)

CRampe� (�) =Cn + kRamp (�)Cf

kRamp (�)= 1�x

�1�e��=x

x = (� � �)R�Cf

TR(���): (8)

Now, based on the assumption made, the iterative equation for actualCe� calculation for any �% point of the output transition time can berepresented as

Ce�(�) = Cn + [�kExp(�) + (1� �)kRamp(�)]Cf (9)

where 0 � � � 1 is the linear combination factor for exponential andramp waveforms. However, we observed that using � = 0:5 showsthe minimum error between the iterative Ce� equation in (9) and theactual sign-off Ce� value. We will refer to single iteration of (9) as thecondition check formula. Fig. 4 compares the plots of CExp

e� , CRampe� ,

andCe� for delay calculation using single iteration of (9) over “Ctotal”on the y-axis versus the “actual sign off Ce� ” for delay calculationover “Ctotal” on the x-axis. To do single iteration of (9), we use theoutput slew of the gate, when the gate sees the total capacitance as theload. Subsequently, we calculated “kRamp” and “kExp” and “(kRamp+kExp)=2.” As shown in this figure, the single-iteration Ce� using (9) isreasonably close to the actual sign-off Ce� value.

Before starting the discussion for filtering algorithm, we define thethreshold parameter �. � is the parameter which separates the cases thatutilize efficient delay calculation from the cases that employ iterativeCe� for delay calculation. To find out the type of the STA scenario thatwe encounter in practice, we resort to (9). First, we calculate the slewof the gate for the total capacitive load. Next, we find Ce� by usinga single-iteration of (9) and the output slew from the previous step.If Ce�=Ctotal is greater than a prespecified threshold value �, then wecall the gate library and find the gate propagation delay for the obtainedCe� . If Ce�=Ctotal � �, then we will have to resort to a more accurate

TABLE VRESULTS FOR DESIGN#1 USING DETAILED LIBRARY

TABLE VIRESULTS FOR DESIGN #2 USING DETAILED LIBRARY

way of calculating Ce� (use of the Thevenin equivalent circuit for thedriver) and obtain the gate propagation delay and output slew values.We report the results of the filtering technique for different thresholdvalues for Design#1 and Design#2 in Section III-C.

C. Experimental Results

To compare the accuracy and performance of the proposed tech-nique, the algorithm is applied on many high-performance industrialdesigns, including Design#1 and Design#2. Some of the characteris-tics of these two designs are shown in Table I. For accuracy purposesthe FOM metric has been used. We performed several experiments onDesign#1 (cf. Table V) and Design#2 (cf. Table VI). For the gate timingcharacteristics, we used the sign-off level gate library which containsdetailed and accurate k-factor equations for describing the timing be-havior of the logic gates. These equations are functions of the inputtransition time, the output load, Vdd, temperature, process parameters,etc. Since, we observed � = 0:5 introduce minimum error with respectto sign-off Ce� calculation, in this section, we set � = 0:5 in (9).

Experiment 1 is the golden experiment in terms of accuracy sinceit uses sign off STA for the timing analysis of the design. Experi-ments 2–7 apply the proposed filtering approach with different filteringthreshold values. As experiment 4 indicates, � = 0:95 gives a reason-able accuracy of within 1% error, while it improves the runtime a lot.Experimental results indicate that filtering algorithm improves the run-time of the sign-off Ce� by about 50%, while introducing an error ofonly 1% to the FOM results.

Experiment 9 makes use of Ctotal algorithm. As shown in Table V,the FOM results for experiment 9 suffer from very large errors. Thesingle-iteration effective capacitance is used in experiment 8. As it isshown, the error in the results is much less compared to the Ctotal

algorithm while the runtime is comparable to the runtime of Ctotal

algorithm.

Page 6: Fast Interconnect and Gate Timing Analysis for Performance Optimization

1388 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006

As mentioned before, the threshold values in the filtering algorithmare designer, technology, “step in design flow” dependent and a de-signer can choose these threshold parameter values based on his/herown tradeoff between desired accuracy and runtime, starting with asmall � value but choosing larger ones as he/she proceeds from earlierdesign stages toward the sign-off stage. One can run a few test casesfor each class of designs and in each technology node to obtain thethreshold parameter values for the filter. So deriving these parametervalues is rather straight-forward, but must be tailored to a particular de-sign and technology.

IV. CONCLUSION

In this paper, first, a threshold-based filtering algorithm forestimating the interconnect delay and slew in high performanceinterconnects was presented. The proposed algorithm filters a setof nets for timing evaluation using the Elmore-based delay andslew calculation engine. Furthermore, a closed-form expression forcalculating the delay and slew was provided for those interconnectlines with m2=m

21 less than a certain critical threshold. Experimental

results on large industrial designs show that the filtering techniqueresulted in a negligible error of 1% error while exhibiting about 65%improvement in the interconnect timing analysis runtime. Next, athreshold-based filtering technique was proposed to speedup the gatedelay and slew calculation in VDSM technologies. It was observedthat the distribution of the “actual Ce� over Ctotal” ratio in industrialdesigns is highly skewed toward one which led us to a novel filteringalgorithm. This algorithm utilizes the Ctotal for most circuit scenariosand a Ce� algorithm for the remaining rare scenarios. Experimentalresults on large industrial designs show that the filtering techniqueresulted in a negligible error of 1% error while exhibiting about 50%improvement in the gate timing analysis runtime.

REFERENCES

[1] W. C. Elmore, “The transient response of damped linear networks withparticular regard to wideband amplifiers,” J. Appl. Phys., vol. 19, no.1, pp. 55–63, Jan. 1948.

[2] L. T. Pillage and R. A. Rohrer, “Asymptotic waveform evaluation fortiming analysis,” IEEE Trans. Comput.-Aided Des. Integr. CircuitsSyst., vol. 9, no. 4, pp. 352–366, Apr. 1990.

[3] C. Alpert, A. Devgan, and C. Kashyap, “A two moment RC delaymetric for performance optimization,” in Proc. Int. Symp. Phys. Des.,2000, pp. 69–74.

[4] C. Alpert, A. Devgan, and C. Kashyap, “RC delay metrics for perfor-mance optimization,” IEEE Trans. Comput.-Aided Des. Integr. CircuitsSyst., vol. 20, no. 5, pp. 571–582, May 2001.

[5] P. R. O’Brien and T. L. Savarino, “Modeling the driving-point charac-teristic of resistive interconnect for accurate delay estimation,” in Proc.Int. Conf. Comput.-Aided Des., 1989, pp. 512–515.

[6] J. Qian, S. Pullela, and L. Pillage, “Modeling the “effective capaci-tance” for the RC interconnect of CMOS gates,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 13, no. 12, pp. 1526–1535, Dec.1994.

[7] H. L. Trentelman, A. A. Stoorvogel, and M. Hautus, Control Theory ofLinear Systems. New York: Springer, 2001.

[8] F. Liu, C. Kashyap, and C. J. Alpert, “A delay metric for RC circuitbased on the Weibull distribution,” in Proc. Int. Conf. Comput.-AidedDes., 2002, pp. 620–624.

[9] A. Odabasioglu, M. Celik, and L. T. Pileggi, “PRIMA: Passivereduced-order interconnect macromodeling algorithm,” in Proc. Int.Conf. Comput.-Aided Des., 1997, pp. 58–65.

[10] F. Dartu, N. Menezes, and L. Pillegi, “Performance computation forprecharacterized gates with RC loads,” IEEE Trans. Comput.-AidedDes. Integr. Circuits Syst., vol. 15, no. 5, pp. 544–553, May 1996.

[11] S. Abbaspour and M. Pedram, “Calculating the effective capacitancefor the RC interconnect in VDSM technologies,” in Proc. Asia SouthPacific Des. Autom. Conf., 2003, pp. 43–48.

[12] C. Alpert, C. Kashyap, F. Liu, and A. Devgan, “Closed-form delayand slew metrics made easy,” IEEE Trans. Comput.-Aided Des. Integr.Circuits Syst., vol. 23, no. 12, pp. 1661–1669, Dec. 2004.

Low Complexity Bit-Parallel Multipliers Basedon a Class of Irreducible Pentanomials

José Luis Imaña, Román Hermida, and Francisco Tirado

Abstract—In this paper, we consider the design of bit-parallel canonicalbasis multipliers over the finite field (2 ) generated by a special typeof irreducible pentanomial that is used as an irreducible polynomial in theAdvanced Encryption Standard (AES). Explicit formulas for the coordinatesof the multiplier are given. The main advantage of our design is that some ofthe expressions obtained are common to any irreducible polynomial, so ourmultiplier can be generalized to perform the multiplication over generalirreducible polynomials. Moreover, the obtained expressions can be easilyconverted to parameterizable code using hardware description languages.The theoretical complexity analysis also shows that our bit-parallel multi-pliers present a reduced number of XOR gates with respect to the best knownresults found in the literature.

Index Terms—Canonical basis, Galois fields, irreducible pentanomials,matrix decomposition, multiplication.

I. INTRODUCTION

Efficient hardware implementations of arithmetic operations in theGalois fieldGF (2m) are highly desirable for several applications, suchas coding theory, computer algebra, and cryptography. The efficiencyof the hardware implementations is measured in terms of the numberof gates (XOR, AND) and of the gate delays of the circuit (TX , TA).The representation of the field elements has a crucial role in deter-mining the space and time complexities of the arithmetic operations,particularly the multiplication, which is considered the most importantone because exponentiation, division, and inversion can be computedby repeated multiplications. Many approaches and architectures havebeen proposed to perform GF (2m) multiplication in which differentbasis representations of field elements are used. Among them, the mostwidely used are the canonical, normal, and dual bases. The complexityof the multiplier greatly depends on the basis used and on the selectedirreducible polynomial. For example, when the irreducible polynomialused is a trinomial f(x) = xm + xn + 1, less space and time com-plexity is needed for the multiplier , [2], [3], [5]–[7]. Unfortunately,an irreducible trinomial does not exist for every value of m, althoughan irreducible pentanomial can be used in this case. For this reason,the design of efficient multipliers using irreducible pentanomials is ofgreat relevance.

In this paper, a canonical basis multiplication method named trans-positional is used. This method was introduced in [2] but only for fieldsgenerated by irreducible trinomials. In this contribution, we generalizethe method for irreducible pentanomials and apply it to type 1 irre-ducible pentanomials, defined in [4] as f(x) = xm+xn+1+xn+x+1,where 2 � n � bm=2c�1. These pentanomials are especially impor-tant because there are many values of m for which these pentanomialsexist and because they are used in the Advanced Encryption Standard(AES) [1]. Specifically, the MixColumns and InvMixColumns opera-tions in the AES perform multiplications over the field GF (28) gener-ated by f(x) = x8 + x4 + x3 + x + 1. Therefore, any reduction of

Manuscript received October 3, 2005; revised January 26, 2006. This workwas supported by the Spanish Government under Research Grant CICYTTIN2005-5619.

The authors are with the Department of Computer Architecture and Sys-tems Engineering, Complutense University, 28040 Madrid, Spain (e-mail:[email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2006.887835

1063-8210/$20.00 © 2006 IEEE