Automated Synthesis Skew-Based Clock Distribution Networksdownloads.hindawi.com/archive/1998/072951.pdf · analytically derived clock skew with the SPICE simulationsandtheoriginally

VLSI DESIGN1998, Vol. 7, No. 1, pp. 31-57Reprints available directly from the publisherPhotocopying permitted by license only

(C) 1998 OPA (Overseas Publishers Association)Amsterdam B.V. Published under license

under the Gordon and Breach SciencePublishers imprint.

Printed in India.

Automated Synthesis ofSkew-Based Clock Distribution Networks

JOS] LUIS NEVES and EBY G. FRIEDMAN*

Department of Electrical Engineering, University of Rochester, Rochester, NY 14627

In this paper a top-down methodology is presented for synthesizing clock distributionnetworks based on application-dependent localized clock skew. The methodology isdivided into four phases: 1) determination of an optimal clock skew schedule forimproving circuit performance and reliability; 2) design of the topology of the clock treebased on the circuit hierarchy and minimum clock path delays; 3) design of circuitstructures to implement the delay values associated with the branches of the clock tree;and 4) design of the geometric layout of the clock distribution network. Algorithms todetermine an optimal clock skew schedule, the optimal clock delay to each register, thenetwork topology, and the buffer circuit dimensions are presented.The clock distribution network is implemented at the circuit level in CMOS

technology and a design strategy based on this technology is presented to implement theindividual branch delays. The minimum number of inverters required to implement thebranch delays is determined, while preserving the polarity of the clock signal. The clocklines are transformed from distributed resistive-capacitive interconnect lines into purelycapacitive interconnect lines by partitioning the RC interconnect lines with invertingrepeaters. The inverters are specified by the geometric size of the transistors, the slope ofthe ramp shaped input/output waveform, and the output load capacitance. The branchdelay model integrates an inverter delay model with an interconnect delay model.Maximum errors of less than 2.5% for the delay of the clock paths and 4% for the clockskew between any two registers belonging to the same global data path are obtained ascompared with SPICE Level-3.

Keywords: Clock distribution networks, clock scheduling, clock skew, clock tree, CMOS inverters,repeaters

1. INTRODUCTION

Most existing digital systems utilize fully synchro-nous timing, requiring a reference signal to controlthe temporal sequence of operations. Globally

distributed signals, such as clock signals, are usedto provide this synchronous time reference. Thesesignals can dominate and limit the performance ofVLSI-based systems. This characteristic is, in part,due to the continuing reduction of feature size

*Corresponding author.

31

32 J.L. NEVES AND E. G. FRIEDMAN

concurrent with increasing chip dimensions. Thusinterconnect delay has become of increasingsignificance, perhaps of greater importance thanactive device delay. Furthermore, the design ofthe clock distribution network, particularly inhigh speed applications, requires significantamounts of time, inconsistent with the high designturnaround of the more common data flowportion of a circuit.

Several techniques have been developed toimprove the performance and design efficiency ofclock distribution networks, such as buffer inser-tion [1] to reduce propagation delay and powerconsumption of clock distribution networks, sym-metric distribution networks [2] such as H-treestructures to ensure minimal clock skew, and zero-skew clock routing algorithms [e.g., 3, 4] to auto-matically layout high speed clock distributionnetworks in cell-based designs. A common traitof these approaches is that the clock distributionnetwork is designed so as to minimize the clockskew between each register, not recognizing thatlocalized clock skew [5, 6] can be used to improvesynchronous circuit performance and minimize thelikelihood of any race conditions. Furthermore, noknown techniques exist today that can automati-cally synthesize high speed and robust clockdistribution techniques while including distributedbuffers along the clock path. A novel methodologyis therefore presented in this paper for efficientlysynthesizing distributed buffer tree-structuredclock distribution networks that exploit non-zerolocalized clock skew to improve circuit perfor-mance. This methodology is represented as part ofthe overall IC design cycle in Figure 1.The top-down clock distribution design metho-

dology described in this paper is divided into fourmajor phases. The first phase is the determinationof an optimal clock skew schedule [7, 8], specifyingthe localized clock skew schedule which maximizescircuit performance and reliability. In the secondphase, a topological design of the clock distribu-tion network is obtained [9]. The topology isspecified in terms of the structure of the clockdistribution network, configured as a tree with

System Speeifieatin

Clock Tree Design Cycle

Timing HierarchicalInformation Circuit Netlist

Optimal Clock Clock TreeSkew Scheduling Topological

Design

ofMinimum Clock

Delay

Branch Delay Assignment

Clock Path Design ofAutomated Circuit StructuresLayout

Fabdeation Process ParameterInformation

FIGURE Block diagram of the clock tree design cycleintegrated with standard integrated circuit design flow.

minimum delay values assigned to each branch ofthe network. These branch delays satisfy the clockskew specifications derived from the optimal clockskew schedule. In the third phase, circuit elementsare designed to implement the delay values deter-mined during the topological design phase. Thefinal phase is the layout design of the clockdistribution network, in which the physical layoutof the clock distribution network is merged withthe overall structure of the VLSI circuit.The technology and layout affect the design of

clock distribution networks in two ways. First, theaccuracy of the clock distribution network requiresthat the design methodology consider the effects ofprocess variations on the implementation of thecircuit structures, otherwise unacceptable varia-tions in the values of the local clock skew may beinduced. Second, the layout of the clock distribu-

CLOCK DISTRIBUTION NETWORKS 33

tion network is highly dependent upon the designmethodology and the physical floorplan of thefunctional circuit. It is for these reasons that thelayout phase is treated separately and is thereforenot directly addressed in this paper. However,layout and technology related issues are discussedin the Appendix.The first three phases of the clock distribution

design methodology are addressed here. Assumingthe timing information of a circuit, namely thelogic, register, and interconnect delays, is known,an optimal clock skew schedule is obtained thatprovides the minimum clock period of the circuit.This material is presented in Section 2. Further-more, with the localized clock skew schedule andtechnology related information, an optimal delayis assigned to each clock path driving each registerin the circuit and is presented in Section 3. Withthe clock delay information, the topology of theclock distribution network can be determined withminimum delay values assigned to each branch ofthe network [9]. This topic is described in Section4. Circuit structures implementing these delayvalues are designed based on the temporalspecification derived from the topological designof the clock distribution network [10]. The branchdelay values are implemented with CMOS inver-ters, once the minimum number of invertersrequired to preserve the polarity of the clocksignal is determined. The inverters are described interms of transistor geometry (width/length ratio),the slope of the input and output signals, and theoutput capacitive load. The use of distributedinverters placed along the clock path converts aresistive-capacitive network into a purely capaci-tive network, greatly simplifying the delay modelsof the network and increasing the accuracy of thecircuit implementation. This process is describedin Section 5. Delay calculations for each clockpath and comparison with SPICE simulations forseveral circuit examples are presented in Section 6.The accuracy of this clock distribution designmethodology is measured by comparing theanalytically derived clock skew with the SPICEsimulations and the originally specified clock skew

schedule. Finally, in Section 7, the primary resultsof this paper are summarized. Thus, a fullyspecified circuit level description of a clockdistribution network exploiting non-zero localizedclock skew is developed using the methodologyand algorithms presented in this paper.

2. OPTIMAL CLOCK SKEW SCHEDULING

The first phase in the design of a clock distributionnetwork is the determination of the minimum localclock skews that will increase circuit performanceby reducing the maximum clock period whileensuring that no race conditions exist. This phaseis called optimal clock skew scheduling and hasbeen extensively studied [6, 7, 8, 11]. In this section,these existing approaches are summarized andapplied to the problem of clock distributionnetwork synthesis. Assuming the timing character-istics of the circuit are known, such as theminimum and maximum delay of each combina-tional logic block and the register delay character-istics, it is possible to determine the local clockskew of each data path and the minimum system-wide clock period (maximum clock frequency).This process is accomplished by formulating theoptimal clock scheduling problem as a linearprogramming problem and solving the set of linearinequalities with standard linear programmingtechniques [12]. In Section 2.1, a model of asynchronous circuit is presented. In Section 2.2,two types of race conditions that can occur in asynchronous circuit are presented. Furthermore,sufficient timing relationships that will prevent theoccurrence of these race conditions are described,relating the delay of the registers, logic blocks,interconnect, and dock skew. Chip-to-Chip (e.g.,board level) clock skew is discussed in Section 2.3.In Section 2.4, an algorithm to determine theminimum clock period and optimal clock skewschedule is presented. Finally, in Section 2.5, theprocess of determining an optimal clock skewschedule is illustrated with an example.


Before discussing optimal clock skew schedul-ing, a definition of clock skew is first presented.Clock skew is manifested by a lead/lag relationshipbetween the clock signals that control a local datapath, where a local data path is composed of twosequentially adjacent registers with, typically, com-binational logic between them.

DEFINITION Given two sequentially adjacentregisters, Ri and Rj, the clock skew between thesetwo registers is defined as

TSkew/-- TCD/- TCDj, (1)

where TCDi and Tcnj are the clock delays from theclock source to the registers Ri and Rj, respec-tively.

Furthermore, for a given local data path, if theclock delay to the initial register is less than theclock delay to the final register, the clock skew isdescribed in this paper as negative. Likewise, if theclock delay to the initial register is greater than theclock delay to the final register, the clock skew isdescribed as positive [13]. Finally, a global datapath is a data path consisting of one or more localdata paths.

2.1. Model of a Synchronous Circuit

A block diagram of a multi-stage synchronousdigital system is depicted in Figure 2. Betweeneach pair of registers there is, typically, a block ofcombinational logic with possible feedback pathsbetween the registers. Each register is either amulti- or single-bit register, and all the registeroutputs are assumed to change at the sametransition point of the clock signal. Only onesingle-phase clock signal source is assumed. Theregisters are composed of edge-triggered flip-flops,each assuming a single output state for each clockcycle. The combinational logic block is describedin terms of the maximum and minimum delayvalues, TLogic(max and TLogic(min), respectively.These logic delay values are obtained by consider-ing the delay of all possible input to output pathswithin each combinational logic block. For sim-plicity and without loss of generality, the blockdiagram in Figure 2 only considers a single input,single output system. The system is modeled by theregisters Ri through R, the logic blocks betweeneach of the registers, and the clock path delays, Cdgto Cd. The registers R and R model a local datapath inserted as a feedback path. The registers Rinand Rout are registers placed at the input and

in out

,clocksottrce

FIGURE 2 Multi-stage block diagram of a synchronous digital system.


output of the circuit, respectively. In Figure 2, 61and 62 are intentional delays inserted into the clockpaths. The delay 61 is the delay of the clock linesdriving the system interface registers, while thedelay 62 is the delay of the clock line driving theon-chip registers of the system. Although in Figure2 the clock signal path delays are shown as beingindependent of each other, in Section 4 a designstrategy is presented for determining the hierarchi-cal topology of a tree structured clock distributionnetwork.

2.2. Preventing Clock Hazards

To illustrate the existence of clock hazards in alocal data path, consider the circuit illustrated inFigure 3a. For an input data signal to be

Event

cclock

Event

cc

clocksource

Event

cc,

clocksource

(a) Local Data PathAssuming

Ts,,.up Tnota Tc.12 0

Tsic,,i, Tosicm

lclock Ttnt= 0

2 3 4 5 6 7(b correct operation

2 3 4 5 6 7(c) positive clock skew (zero clocking)

3 4 5 6 7 8 9(d) negative clock skew (double clocking)

Coj

Toc 3

Tce 4

time

Coj=

Tce 4

time

time

FIGURE 3 Example of clock hazards. (a) block diagram;(b) correct timing operation; (c) positive clock skew; (d) negativeclock skew.

transferred correctly from the input of Ri to theoutput of Rj, the input data should be transferredto the output of Ri upon arrival of a clock pulseand transferred to the output of Rj with the arrivalof the following clock pulse. This situation isillustrated in Figure 3b. Race conditions in thesesynchronous systems may occur for two reasons.First, a race condition occurs if a data signalappearing at the output of Ri upon arrival of aclock pulse is not available at the input of Ry whenthe following clock pulse arrives at R. This case iscalled zero clocking [6] due to the excessive use ofpositive clock skew [5, 13], and is illustrated in Fig-ure 3c. For example, the data appearing at theoutput of Ri, upon arrival of a clock pulse, isavailable at the input of Ry in 7 tu (time units),while the following clock pulse arrives at Ry in 6tu. In the second race condition, the data signalappearing at the output of R upon arrival of aclock pulse is available at the input of Rj before thesame clock pulse arrives at Ry. This case is calleddouble clocking [6] due to the excessive use ofnegative clock skew [5], and is illustrated in Figure3d, where the clock pulse that triggered Rg arrivesat Ry in 6 tu, while the data signal appearing atthe output of Ri arrives at Ry in 5 tu.To prevent both types of clock hazards, a set of

inequalities must be satisfied for each local datapath. These inequalities are described in terms ofthe system clock period Tcp and the individualdelay components of the local data paths [5, 6, 7, 8,10]. To avoid negative clock skew creating a racecondition between two sequentially adjacent re-gisters, R and Ry, the following inequality must besatisfied,

TSkcw/j THold- TpD(min), (2)

where

rpD(min rc_Qi - rLogie(min -[- Tin -[-- rset-upxj

and where Tskewj is the difference in delay betweenthe clock path delays Tcoi and TCDj, THold is the


amount of time the input data must be stable afterthe clock signal changes state, TC_Q is the timerequired for the data signal to leave Ri once it istriggered by a clock pulse Ci, rLogic(rnin) is theminimum propagation delay through the logicblock between the registers Ri and Rj, Tint is theinterconnect delay, and rset.up is the time requiredto successfully propagate to and latch the datasignal within Rj. To avoid positive clock skewcreating a race condition (by limiting the maxi-mum clock frequency) between two sequentiallyadjacent registers, Ri and Rj, the following in-equality must be satisfied

TCP TSkcwij + TpD(max), (4)

where

TpD(max) TC-Qi + TLogic(max) + Tint + Tset-upj,

and where rLogie(max is the maximum propagationdelay through the logic block between the registersRi and Rj.

2.3. Off-Chip Zero Clock Skew

Although it is possible to have off-chip non-zeroclock skew, it is desirable to ensure that the clockskew between the VLSI circuit I/O’s approachzero, in order to avoid complicating the specifica-tion of the interface of the circuit with othercomponents also controlled by the same clocksource, typically at the circuit board level. Toavoid race conditions and limiting clock rates, thedelay of the clock path driving the output registeris chosen equal to the delay of the clock pathdriving the input register. Consider, for example,the system illustrated in Figure 4a. In Figure 4b,the delay of the clock path driving Rout is less thanthe delay of the clock path driving Rin, causing arace condition since any data latched into Rout isalso latched into Rin by the same clock pulse.Nevertheless, observe that a negative clock skew ispermissible between registers Rout and Rin since the

amount of skew is less than the clock period of thesystem. The operation of the system where bothclock signals have the same delay is illustrated inFigure 4c.

If a system is composed of more than twocircuits, all driven by the same clock source, thesame approach to applying zero clock skewbetween the input and output registers is used.Therefore, a clock skew relationship can beestablished between the input and output off-chipregisters along the internal global data paths of thecircuit, and is expressed as

TSkew in,out --TSkew in,1 + + TSkew i-l, + +TSkew n, out 0. (6)

Note that (6) is only valid if the interface circuitryis controlled by the same clock source. Theconstraint does not apply to asynchronous circuitsor synchronous circuits that communicate asyn-

System clock

(a)Event

Dn

clock2 3 4 5 6 7 8 9

(b)

Event

Din

CRy,,

Co.

clock2 3 4 5 6 7 8 9

(e)

time

time

Co =3

Tce 4

Tt.

Cot"Tce= 4

Tt.,=

FIGURE 4 Matching of I/O clock skew between VLSIcircuits, (a) block diagram; (b) data transfer without matchingclock skew; (c) data transfer with matching clock skew.


chronously. Furthermore, restricting the off-chipclock skew to zero does not preclude the VLSIcircuit from being optimized using intentionalnon-zero clock skew. The primary difference isthat the performance improvement of the VLSIcircuit is less than what would be obtained withoutthis constraint. This concept of applying non-zeroclock skew at the off-chip level is further exempli-fied in Section 2.5.

2.4. Minimum Clock PeriodFIGURE 5 Circuit example composed of five global datapaths.

The system-wide clock period is minimized byfinding a set of clock skew values that satisfy (2)-(6) for each local data path. Again, the timingcharacteristics of each local data path is assumedto be known. The minimum clock period isobtained when the problem is formalized as alinear programming problem, such asminimize:

TCp

subject to the local and global timing constraints:

TSkewij >_ THold- TpD(minTCp >_ ZSkew/j-f- TpD(max)

TpD(max) Zc-Qi + ZLogic(max) -f- Tint + Zset-upj

TpD(min) Zc-Qi-f" ZLogic(min) -4- Tint -4- Tset-upj

TSkew in, out TSkew in,1 +"" + TSkew i-l,i -4-’"

+ Tskew n, out 0.

2.5. Circuit Example

A circuit example for determining the minimumclock period of a multi-stage system with feedbackpaths is illustrated in Figure 5. The numbers insidethe logic blocks are the minimum and maximumdelays of each block, respectively. Without loss ofgenerality, the register timing parameters areassumed to be constant and equal for each register.The example is intended to represent a smallcircuit, composed of five global data paths. Thesedata paths illustrate different timing characteristics

of a synchronous circuit. The first data path, R1 toR3, illustrates the specification of zero clock skewbetween each register. The second data path, R15 toR20, illustrates non-zero clock skew and the effectof feedback paths on the clock period. The thirdand four paths, R4 to R9 and Rll to R14, illustratenon-zero clock skew and the existence of inter-connections between global data paths. The lastdata path, R7 to R10, illustrates the effect of non-zero clock skew on reducing the clock period. Ineach of these paths it is assumed that the first andlast registers of each global data path are off-chipregisters. Therefore, the clock skew between thoseregisters must satisfy (6). Each global data path isanalyzed independently to obtain the optimal clockskew schedule as well as the minimum clock period.The minimum clock period of the example circuit isthe maximum value of the clock periods obtainedfrom each data path.The linear relationships that determine the

minimum clock period and the optimal clock skewschedule for the circuit shown in Figure 5 isdescribed in terms of each independent global datapath within the circuit. The constraints of eachglobal data path are listed in Table I, where thenumbers represent the maximum and minimumdelays of the logic blocks between sequentiallyadjacent registers. For simplicity, the clock-to-Qand set-up times of the registers are assumed to bezero. The solution of the linear inequalities givenin Table I is shown in Table II, where an optimalclock skew schedule as well as the minimum clockperiod is provided.


Path R-R3

TABLE Constraint list for the data paths in the example circuit shown in Figure 5

Path R15-R20 Path R4-Rg; Rll-R14 Path R7-RlO

R1-R2:Cdl--Cd2TSkewl-2_--4TSkewl_ Tcp_-6

R2-R3:Cd2--Cd3TSkew2_ --4TSkew2_3-Tcp -6

TSkewl_ q- TSkew2_ 0

R15-R16;Cd15-Cd16TSkew15_16

_--3

TSkew15-16-- TCp - -4

R16-R17; Cdl6--Cdl7TSkewl6_17

_--7

TSkewl6_17-Tcp

_-11

R17-R18:Cd17-Cd18TSkewl7_18 --4TSkewl7_18

_--7

R18-RI9:Cdla--Cdl9Tskewl8-19_--5Tskewl8_19- Tcp_-6

R19-R2o:Cdla-Cd2oTSkew19-20 -- 5TSkew19_20 Tcp ._ -6

R19-R17:Cdla-Cdl7Tskewl9-17

_--5

Tskewl9_17-Tcp

_-6

Tskewl5-16 d- Tskewl6_ 17 "+"TSkewl7-18 Tskew18-19 q-

Tskewl9-20 0

TSkewl7-18 q- TSkew18-19TSkewl9_17

R4-Rs".Cd4-Cd5Tskw4-5 -> 3Tskew4_5-Tcp _( --5

R5-R6:Cds-Cd6TSkewS_

_--5

TSkewS_ -Tcp -8

R6-Ra:Cd6-Cd9TSkew6_ --8TSkew6_ Tcp

_10

TSkew4_ q- TSkewS_ q-

TSkew6-9 0TSkew4_ q- TSkewS_ +TSkcw6-12 q" TSkew12-13 --TSkcwl3-14 0TSkewl 1-6 q- TSkew6-12 0

Rll-R12; Grill--UtilETSkewll- 12-- -4TSkew11-12 TCp - -7

R12-R13." Cd12--Cd13Tskew12-13 -- --6Tskew19-20 Tcp

_10

R13-R14." Cd13--Cd14TSkew13-14 --6TSkewl3_ 14-- Tcp -7

R1 l-R6:Cdll--Cd6TSkewl _6_ --2Tskewll_6-Tcp _-5

R6-R12:Cd6--Cd12TSkew6_ 12

_--8

TSkew6_ 12--Tcp 10TSkew11 12 d- TSkew12- 13q- TSkewl3_14 0

R7-Rs:Cd7-CdsTSkew7_

_-6

TSkew7_8- Tcp

R8-Rlo; Cd8--Cdl0TSkew8-10 3TSkewS- 10-- TCp -4

TSkew7_ q- TSkewa_10--0

TABLE II Optimal clock skew schedule and clock period for the example circuit shown in Figure 5

TSkewl_2 0 TSkewl6_ 17 -4 TSkewl9_20 0 TSkew5_6 0 TSkewl2_ 13 -2 TSkew6_ 12 -2TSkew2_3--0 TSkewl7_18----0 TSkewl9_17-- TSkew6_9"---3 TSkew13_14-" TSkew7_8----2TSkewl5-16-- 3 TSkewl8-19"- TSkew4_ 3 TSkewl 1_ 12 TSkewl 1_ 3 TSkew8-10"- 2

Tcp= 8

Ifzero clock skew between off-chip registers (andon-chip registers) is not required, the minimumclock period is Tcv 7 tu. Although the restrictionof zero clock skew between system I/O increasesthe minimum clock period to 8 tu (as shown inTable II), a performance improvement is stillobtained by applying localized non-zero clockskew within the circuit. If zero clock skew isassumed throughout the circuit, the minimumclock period becomes Tcp 11 tu, due to the worstcase path delay of the local data path between

registers R16 and R17. Thus, a 27% improvement inthe minimum clock period is achieved by optimallyscheduling the on-chip localized clock skews Whileretaining zero off-chip clock skew.

3. DETERMINATION OF MINIMUM CLOCKPATH DELAY

Synthesizing a clock distribution network requiresa description of the circuit at the register transfer


level (RTL) and the local clock skew between eachpair of sequentially adjacent registers. The localnon-zero clock skew is obtained from the optimalclock scheduling process described in Section 2. Inorder to determine the minimum delay of eachclock path, information describing both the localclock skew and the relationship between the clockskew of each local data path belonging to the sameglobal data path is required. In Section 3.1, thisrelationship for global data paths with forwardand feedback connections is formalized. In Section3.2, an algorithm is presented for calculating theminimum clock path delays.

3.1. Theoretical Background

To determine the minimum clock delay from theclock source to each register, it is necessary to noteany relationships that may exist among the clockskews of sequentially adjacent registers occurringwithin a global data path. Specifically, the effectsof feedback paths within global data paths onclock skew must be considered. The relationshipsamong the clock skews of sequentially adjacentregisters in a global data path is called conserva-tion of clock skew. These relationships areformalized below:

THEOREM For any given global data path, clockskew is conserved. Alternatively, the clock skewbetween any two registers which are not necessarilysequentially adjacent is the sum of the clock skewsbetween each pair of registers along the global datapath between those two same registers.

Proof." This theorem is proved by induction. Fora global data path with only two registers, theclock skew is defined by Definition 1. Now add an

extra register to the global data path as illustratedin Figure 6. The clock skew between registers Riand Rj is TSkewq TCDi TCDj, and the clock skewbetween registers Rj and Re is Tskewie TCDj--TCDk. Substituting TSkewik into Tskew,Tskewj+TSkewk TCDi- TCDk is obtained, which is theclock skew between registers Ri and Re. By addingn registers to the global data path, the clock skewbetween the first and the last register is TSkew’nTSkewij-F"" d- TSkewin-l,n-- TCD- TCDn. If oneregister is added to the global data path, the clockskew between registers Rn and Rn+l is TSkewn,n+lTCDn- TCDn + 1. Substituting TSkewn,n+l into

TSkewi,n TSkewi,n+ TSkew/j -at-... -+- TSkewn,n+ sobtained. Since n can be any value greater thantwo, this theorem is proved.

Although clock skew is defined between twosequentially adjacent registers, Theorem showsthat clock skew can exist between any two registersin a global data path. Therefore, it extends thedefinition of clock skew introduced by Definitionto any two non-sequentially adjacent registersbelonging to the same global data path. Theoremalso illustrates that the clock skew between anytwo non-sequentially adjacent registers which donot belong to the same global data path has nophysical meaning.A typical sequential circuit may contain sequen-

tial feedback paths, as illustrated in Figure 7. It ispossible to establish a relationship between theclock skew in the forward path and the clock skewin the feedback path, because the initial and finalregisters in the feedback path are also registers inthe forward path. As shown in Figure 7, the initialand final registers in the feedback path Rt-R arethe final and initial registers of the forward path

R-Re-Rt. This relationship is formalized below:

FIGURE 6 Example of a global data path.


FIGURE 7 Data path with feedback paths.

THEOREM 2 For any given global data pathcontaining feedback paths, the clock skew in a

feedback path between any two registers, say RI andRj, is related to the clock skew of the forward pathby the following relationship,

Tskewfdbaok, Tskewfoard, jt" (7)

Proof The theorem will be proved by contra-diction by first showing that (7) is valid, and thenby showing that if (7) is not valid, Theorem andDefinition are not satisfied. Applying Theoremto the forward path between registers Rt and Rj,Tskew//= TCDt Tcoj is obtained. Likewise, apply-ing Theorem to the feedback path betweenregisters Rj and Rt, Tskewjt TCD- TCD isobtained, which is the negative of TskewO’. Considerthat the clock skew in the feedback path, TskewijTCD- TCDt. However, this assumption contra-dicts Theorem 1, and therefore the assumption isfalse, validating the theorem. Vq

3.2. Clock Path Delay Algorithm

A synchronous circuit is formed by one or moreglobal data paths as shown in the exampleillustrated in Figure 5. For each global data path,the clock delay of the registers is calculated by firstchoosing the local data path with the largest clockskew. If the clock skew is positive (negative), theclock delay of the final (initial) register of the localdata path is assigned a constant K, and the clockdelay of the initial (final) register is assigned theconstant K plus (minus) the clock skew of the final(initial) register, where the constant K is theminimum clock delay of the circuit. The clockdelay of the following register connected to the

final register of the local data path is the clockdelay of the final register minus the clock skewbetween them. Similarly, the clock delay of theprevious register connected to the initial register isthe clock delay of the initial register plus the clockskew between them. Proceeding in this fashion forthe remaining registers, the clock delay to each ofthe registers in the global data path is calculated interms of K. If the clock skew is zero for every localdata path in the global data path, the clock pathdelay is the same for each register.To illustrate how the clock delay is calculated,

consider the synchronous circuit illustrated inFigure 5. This circuit is represented in Figure 8by a graph where a vertex represents a registerinstance, a directed edge represents a physical pathbetween a pair of registers, and an edge weightrepresents the clock skew of a local data path.Consider, for example, the global data path, R15 to

R20. The maximum absolute clock skew is 4 tu forthe local data path (R16, R17) in Figure 8.Therefore, the clock delay to R16 from the clocksource is the constant K, and the remaining clockdelays are given as a function of K plus the local

(k) (k)

(k+3) (k) (k+4) (k+4) (k+3) (k+3)

(k+3) (k+2) (k+4) (k+3)

(k) (k+2) (k)

FIGURE 8 Optimal delay assignment for the example circuitin Figure 5.


values of clock skew. Observe that the constantvalue for the global data path composed ofregisters R1 to R3 may be different from theconstant value for the global data path composedof registers R15 to R20, since there are no commonregisters (i.e., connections) between these twopaths.The procedure to calculate the minimum clock

delay is formalized in an algorithm calledPath_Delay, which is summarized in Figure 9.After determining the local data path with thegreatest clock skew, each of the clock delays tothe other registers is calculated by traversing theglobal data path and attributing to each nodethe clock delay value that enforces the desiredclock skew specification. If the initial clock skewspecification satisfies Definition and Theoremsand 2, each node is visited only once and the timecomplexity of the algorithm is O(n), where n is thetotal number of nodes in the graph. The clockdelay of each register obtained with this algorithmis represented by an expression in terms of Kattached to each vertex, as illustrated in Figure 8.Assuming a single clock source and a clock

distribution network that produces the delayvalues found from the previous procedure, the

Algorithm Path_DelayObegin

Circuit G(v,e);for each Data path D in Circuit G dofind MAX TSkew)if TSkew < 0

vi K;v K. TSkew

dsevj K;vi K + TSkewij

while (vi,vy not visited) and (vi,vy in Data path D) doDelay v+l = delay vy + TSkewj,+1;Delay v.l delay vi + TSkew.l,;Mark via, v+l as visited;

end Path_Delay

FIGURE 9 Algorithm to find the optimal clock delay to eachregister.

clock skew requirements of the circuit are satisfied.Such a network is illustrated in Figure 10 for theexample circuit shown in Figure 5, where thenumerals inside the rectangles represent clockdelays and the value of K is assumed to be 1 tu.

Providing independent clock path delays foreach register in a circuit, as is done in Figure 10, isimpractical, due to the unnecessarily large capaci-tive load placed on the clock source and theinefficient use of die area. It is preferable to reducethis load by allowing the paths of the faster clocknets to be built from the paths of the slower clocknets of neighboring clock paths. This approachsuggests a tree structure for the clock distributionnetwork, where the branching points are selectedaccording to the delay at each individual node andthe relative physical location of the clockedregisters. Such an approach for determining thestructure of a clock distribution network isdescribed in the following section.

4. TOPOLOGY OF CLOCK DISTRIBUTIONNETWORK

In this section, a methodology is presented fordesigning the topology of a clock distributionnetwork that implements the delay values obtainedfrom the algorithm Path_Delay introduced inSection 3. A reasonable initial strategy is toprovide an independent path delay for eachregister, as shown in Figure 10. This approachhas the advantage of isolating each clock signaland requires a simple circuit composed of cascadedinverters and interconnect sections, for whichdelay models have been studied in great detail[e.g., 1, 13, 15]. However, independent clock pathsfor each register are impractical because a typicalVLSI circuit may contain many thousands ofregisters, expending excessive area.

In this paper the clock distribution network for aspecific circuit is designed in a tree structure usingthe hierarchical information of the original circuitdescription and the clock delay values obtainedfrom the algorithm Path_Delay. In Section 4.1,


Source

Rs R R5 R# R7 Rs R Ro R R R R14 R5 R16 R17 Rs R9 Ro

FIGURE 10 Clock distribution network for the example circuit shown in Figure 5.

the construction of the clock tree based on thehierarchical description of the circuit is described.Techniques to calculate the branch delays arepresented in Section 4.2. In Section 4.3, amethodology for constructing clock distributionnetworks with minimal or no hierarchical infor-mation is described.

4.1. Clock Network Topology

Almost any large VLSI circuit is composed ofseveral levels of hierarchy, where the number oflevels is defined by the size and organization of thesystem and related factors such as the complexityof the circuit, the number of transistors, and thedesign methodology. During circuit placement, theelements of each hierarchical module are typicallyplaced physically close to each other. Based on thisreasoning, the hierarchical description of thecircuit can be used to constrain the structure ofthe clock distribution network. In Section 4.3, theapproach described herein to develop the topologyof a clock distribution network is extended tohandle flat non-hierarchical databases.To build the clock distribution network, the

hierarchy of the circuit is extracted from the circuitnetlist and represented as a tree structure. The rootvertex is the clock source, the internal vertices arethe branching points derived from the hierarchy,and the leaf vertices are the registers. Thehierarchical tree structure of the circuit exampleshown in Figure 5 is illustrated in Figure 11. For

the purpose of calculating the individual clockdelays, the branches of the tree are classified aseither external or internal. The external branchesare the branches connected directly to the registers.All other branches are classified as internal.

4.2. Calculation of Branch Delay

The minimum delay values obtained from thealgorithm Path_Delay are insufficient to comple-tely determine the delay of each branch of theclock distribution network due to two character-istics of the hierarchical representation:

1) The clock skew between sequentially adjacentregisters may not depend on the delay of theinternal branches. Consider, for example, theglobal data path R1 to R3 in Figure 5. Theseregisters are driven by the same branchingpoint, as depicted in Figure 11, meaning thatthe clock skew specifications are satisfied solelyby the delay of the individual external branchesdriving R1, R2 and R3.

2) The clock skew specifications can only besatisfied if the delay of some of the internalbranches is known. As an example, the globaldata path formed by R15 to R20 contains threeregisters, R15 to R17, driven by one branchingpoint, while the remaining three registers, R18to R20, are driven by a second branching point.The clock skew between registers R17 andR18 can only be satisfied if the delay of theinternal and external branches driving these


(-2) (2)

(0) (0)

(3) (-4) (0) (1) (0)

FIGURE 11

(-2)

(3) (0) (-3)

Hierarchical representation of the example circuit shown in Figure 5.

two registers is known. A similar situation existsbetween dependent global data paths belongingto separate hierarchical levels, such as the clockskew between registers R6 and R12 (see Fig. 8)

Note that the first characteristic does notrequire that the delay of the internal branchesbe known, while the second case requires thisinformation. Therefore, the individual delay ofeach of the external branches must be known,although the delay of the internal branches maynot be known. Without placement information,the precise location of the registers is indetermi-nate, therefore it is not possible to know theprecise delay of some of the internal branches.Since it is necessary to assign delay values to someinternal branches, a constant unit value, called A,is chosen for this purpose. Once an estimate of thelayout induced impedances are determined, morerealistic delay values for the internal branches canbe calculated to generate a more precise estimate

of the path delays throughout the clock distribu-tion network.Delay of External Branches When both re-

gisters within a local data path are driven by thesame branching point, the clock skew specifica-tions are completely satisfied by the delay valuesassigned to the external branches. These delayvalues are calculated using the algorithm Path_

Delay and are expressed in absolute terms,provided that a value is assigned to the constantK. Since the delay of a branch cannot be zero,unit delay is assigned to K without loss ofgenerality.

Delay of Internal Branches It is possible,however, to have a global data path driven by morethan one branching point. For example, the globaldata path R15 to R20 is driven by two separatebranches of the clock distribution network. Start-ing with the common vertex that drives the datapath (vertex VL2.2), variables are assigned to eachof the branches, as illustrated in Figure 12(a).


Clock delay equations are chosen to satisfy theclock skew specifications. The set of equations are

c-d= 3,d- e -4,a+e=b+f,f-g= 1,g=h.

This system of equations has multiple solutions,since more unknowns exist than constraints. Toobtain a solution, the delay of the external

branches is initially calculated, providing a solu-tion for variables c to h, respectively. Rewritingthese equations, it is found that a= b. Attributingunit delay to these variables, a solution is found,symbolized by "[x]" and shown in Figure 12(b),which satisfies the original clock skew specifica-tions.Delay Equalization Once the clock path delay

has been determined for each register in the circuit,it is necessary to match the clock path delaysbetween all the input and output registers. Thisstep is performed by identifying the input/outputregister with the largest clock delay and equating

a b

c.d=3d-e= -4a+e=b+ff.g=lg=h

[4]

[z] [1]

(3) (-4) (0) (1) (0) (3) (-4) (0)

(a) (b)

(o)

[] [4]

[4] [z]

(3) (-4) (0) (1) (0)

(e)

FIGURE 12 Calculation of the internal branch delays.


the clock delay of the other input/output registerswith this value, without invalidating the optimalclock skew assignment determined previously. Forthe circuit example, it was assumed that the initialand final registers of each global data path are alsothe input and output registers of the circuit. Inorder to satisfy (6), the clock path delay drivingthe initial and final registers must be the same andequal to the largest clock path delay driving an I/Oregister. After the delay 9f the internal branchesare determined, the clock path delay driving theI/O registers R15, RE0, R4, R9, Rll and R14 is 7 tu,while the clock path delay driving the I/O registersR1, R3 is 3 tu and the I/O registers R7, RI0 is 2 tu.Therefore, the clock path delay of the registers R1,R3, R7, and Ri0 must be increased to 7 tu.

Delay Shifting It is possible to reorganize thedelay of the external branches to reduce the totalnumber of delay units needed to build the clockdistribution network, thereby reducing die area.Consider, for example, Figure 12(b). The external

branches connected to registers R18 to R20 mayeach have their branch delay reduced by three, ifthe delay of the internal branch driving each ofthese registers is increased by three. Anotheradvantage of shifting the delay is the increasedflexibility of the circuit implementation, since thedelay can be shifted among branches to accom-modate variations in the layout placement. Anexample of delay shifting is illustrated in Figure12(c). Extending this procedure to the remainingbranches of the clock tree depicted in Figure 11, itis possible to determine the minimum delay valueof each branch of the clock distribution network,as illustrated in Figure 13.

4.3. Reorganization of the Clock Tree

The methodology for constructing the clockdistribution network presented in this sectionassumes that the circuit netlist is describedhierarchically. It is possible, however, to have

[5]

[1] [1] [1] [41

(0) (0) []

[6]

(3) (-4) (0) (1) (0)

[3]

(-2) (2)

[3]

[3]

(3) (0) (-3)

(-2)

FIGURE 13 Delay assignment for each branch of the clock distribution tree.


circuit descriptions which are completely orpartially fiat. For those circuits described non-hierarchically, the result of Section 4.2 is a clockdistribution network with independent clock pathsdriving each register, similar to that shown inFigure 10. In this section, a strategy for transform-ing these types of inefficient clock distributionnetworks into tree structures is presented. As anexample, the clock distribution network for the flatrepresentation of the circuit in Figure 5 istransformed into a clock tree with several levelsof hierarchy, as illustrated in Figure 14.Without regard to placement information, the

clock distribution tree can be built by placingthe shorter delay clock paths in branches close tothe clock source and the longer delay clock pathsin branches farther from the clock source. For thispurpose, the branch with the longest delay ispartitioned into a series of segments, where eachsegment emulates a precise quantified delay value,A. Between any two segments, there is a branchingpoint to other registers or sub-trees of the clocktree, where several segments with delays greater orequal to A are cascaded to provide the appropriatefinal delay at each leaf node. The result of thisapproach is illustrated in Figure 14, where asignificant saving of segment delays, as comparedto Figure 13, is obtained (from 67 A to 31A).

A further area improvement can be obtained byreorganizing the clock distribution network withthe assignment of a separate sub-tree for eachglobal data path that does not contain any registerin common with any other global data path. Forexample, the circuit in Figure 8 has four globaldata paths with this characteristic, two of thembeing the global data path formed by the registersR1 to R3 and the global data path formed by theregisters R15 to R20. The algorithm Form_Units,summarized in Figure 15, is used to determinethose global data paths that contain commonregisters. This result is accomplished by comparingthe set of registers of two global data paths. If theresult of the comparison is that two global datapaths are not independent, both sets are groupedtogether into a new set, formed by the union of theregisters of the two global data paths. For amodule with n global data paths, the algorithm isO(n) in the best case, where all the global datapaths have registers in common, and is O(n2) inthe worst case, where all the global data pathshave no registers in common.

5. DESIGN OF CIRCUIT DELAY ELEMENTS

The delay of the circuit structures that emulatesthe delay values associated with each branch of thenetwork requires high precision, because varia-tions in the delays of the internal branches are

R5

R

RI3

RII

FIGURE 14 Delay assignment of non-hierarchically definedclock distribution network.

Algorithm Form_Units(M)gdp_set all global data paths module M;unit_set ;while gdp_set not do

unit first gdp in gdp_set;for each gpdj in g@._set do

if( register of gdpj unit

unit unit u gpdj;gdp_set gdp_set- gdpj;

end Form_Units

FIGURE 15 Algorithm to group global data paths.


propagated throughout the network, causingunacceptable variations in the desired clock skew.It is important to note that it is much more difficultand important to satisfy the clock skew betweenany two clock paths rather than to satisfy eachindividual clock path delay.Delay elements can either be composed of

passive or active circuit elements. Implementingthe delay elements within the clock distributionnetwork as a passive RC network is unacceptablefor several reasons; 1) the delay of each branch ishighly dependent on the delay of every otherbranch, 2) the clock signal waveform woulddegrade, limiting system performance and relia-bility, 3) an accurate delay model of a passiveclock distribution network for a circuit withthousands of registers is difficult to obtain, and4) the layout of the passive RC network is highlysensitive to small variations in position or lengthof the clock lines, producing unacceptable varia-tions in the localized clock skew. Therefore, twocriteria must be met to successfully design a clockdistribution network. First, the delay of eachbranch must be implemented such that eachbranch is independent of the delay of the otherbranches. Second, the clock branches must bedesigned such that there are no physical layoutconstraints difficult to implement.To satisfy both criteria, the strategy adopted is

to implement the delay segments with activeelements, specifically CMOS inverters. Due tothe high input impedance of a CMOS inverter, theinverter effectively isolates each clock branch fromeach other. Additionally, the interconnect lines areconstrained to behave as purely capacitive lines byproperly inserting these distributed CMOS inver-ters as repeaters along the clock signal path [15].The insertion points are chosen such that theoutput impedance of each inverter is much greaterthan the resistance of that portion of the inter-connect section being driven. This strategy permitsthe length of a single interconnect line to beaccurately modeled as a lumped capacitance withnegligible resistance. However, the strategy alsoplaces a maximum constraint on the length of an

individual portion of an interconnect line betweeninverting repeaters, thereby limiting the physicalplacement of a clock branch within the circuitlayout. The maximum interconnect length for twoCMOS processes is illustrated in Appendix A,describing the limiting length a distributed RC linecan be accurately modeled as a lumped capacitor.

5.1. Preserving Clock Signal Polarity

Using a single inverter to produce a specific branchdelay may invert the polarity of the clock signal forthose clock paths consisting of an odd number ofbranches. To maintain the proper signal polarity,the tree structured graph representing the topologyof the clock distribution network is searched toidentify those branches that require two inverters,ensuring that the number of inverters from theclock source to every register remains even (orodd) while utilizing a minimum number ofinverters.The number of internal and external branches in

a clock path is called the depth of the path, wherethe depth of each clock path is obtained bytraversing the graph representing the clock tree[17]. If the depth is even, the polarity is preservedand a single inverter is assigned to each branch ofthe clock path. If the depth is odd, an extrainverter must be inserted to preserve the signalpolarity of the clock signal. The choice of whichbranch should receive the extra inverter is im-portant in determining the minimum number ofinverters and for preserving the polarity of theclock paths with an even number of branches. Theexample circuit illustrated in Figure 4 is used inFigure 16, where the number inside each leaf noderepresents the depth of the clock signal path atthat branching point. As shown in Figure 16, theclock paths that require an extra inverter are thosebranches which drive R1, RE, R3 and R14, sincethese paths have an odd number of branches. Theprocedure to obtain the minimum number ofinverters to implement the branch delay of theclock distribution network is presented below andillustrated in Figure 16.


RlS R16 R17 RIg R19 Rzo

R14

FIGURE 16 Inverter assignment of the branches of the clockdistribution network.

Buffer Assignment Procedure- The assignmentprocedure begins at the leaf node of the clock pathwith odd depth and proceeds backwards, one nodeat a time towards the root node, until the branchthat is assigned the extra inverter has beendetermined or the root node is reached. Startingat a leaf node and moving back one node in thepath, the depth of each clock path connected tothat node is determined, excluding the path beinganalyzed. If at least one depth value is even, twoinverters are assigned to the branch in the presentpath. If all the depths are odd, only one inverter isassigned to the branch, and another node in thepath is chosen. The inverter assignment procedurecontinues until every clock path with odd depthhas been searched and an extra inverter isallocated to that clock signal path.

Consider, for example, the clock path drivingregister R14 in Figure 16. Going backwards onenode in the clock path, node A is reached.Determining the depth of all the leaf nodes directlyor indirectly driven by node A requires determin-ing the depth of the clock paths driving registersRll, R12 and R13. Since the depth of these nodes iseven, two inverters are assigned to the branchbetween node A and the leaf node driving registerR14. Observe that at this point the procedureterminates searching the clock path driving register

R14, because the number of buffers driving thepath is now even, and begins searching anotherclock path with odd depth. Consider the clockpath driving register R1, also illustrated in Figure16. Moving backwards one node in the clock path,node B is reached. The depth of the leaf nodesdirectly or indirectly connected to node B is odd.Therefore, one inverter is assigned to the branchdriving register R and the search advances onenode in the clock path, moving to node C.Evaluating the depth of all the nodes connectedto node C, it is determined that the depth of thenodes driving registers R15 through R20 is even,and therefore two inverters are assigned to thebranch between nodes C and B. Observe thatsolving the inverter assignment problem for theclock path driving register R immediately solvesthe signal polarity problem for the clock pathsdriving registers R2 and R3.

5.2. Implementation of Clock Path Delay

A clock tree is composed of many clock signalpaths (branches) each driving a storage element(leaf). Clock signal paths may have branches incommon, depen.ding on the topology of the clocktree. The total delay of a clock path tepd is thesummation of the delays of each individual branchalong the clock path 7"bi,

n

/cpd Y "r’bi. (8)i=1

In order to accurately sum the individual delaycomponents along a clock signal path, three issuesare of importance: 1) isolation of each branchdelay; 2) integration of the inverter and inter-connect delay models used to calculate the delay ofeach clock path, and 3) integration of the wave-form shape into the inverter delay model.The capacitive load at the output of the

inverting buffer includes both the capacitance ofa single interconnect line and the fanout of eachnode of the clock distribution network. Therefore,the output capacitive load is composed of the


capacitance of an interconnect line plus the inputcapacitance of each inverter driven by thatinterconnect line. Multiple inverters being drivenby the same inverter occur at the branching pointswithin the clock distribution network. The inputgate capacitance of an inverter is modeled asCg CoxWL, where Cox is the oxide capacitanceper unit area and is dependent on the thicknessand permitivity of the oxide. Therefore, the loadcapacitance seen by an inverter is given by

CL Cline + # of branches Cg, (9)

assuming that the inverters are of equal size. If theinverters are not of equal size, the second term in(9) becomes a summation among all the branchesconnected to the driving point.Due to the high input impedance of a CMOS

inverter, the inverter effectively isolates each clockbranch from the following branch. Furthermore,each branch can be implemented with one or morerepeaters, such that the output impedance of eachrepeater is much greater than the resistance of thedriven interconnect section [15], as illustrated inFigure 17. Under this constraint, each inverterdrives a load that can be modeled as a lumpedcapacitor composed of the capacitance of aninterconnect line plus the input capacitance ofeach inverter connected to that interconnect line,as described in Appendix A. As a consequence ofthe output impedance of the repeater buffer beingmuch greater than the resistance of the intercon-

gi

CloSource RC RC

FIGURE .17 Modeling an RC network with buffers andlumped capacitors.

nect section, the slope of the input signal of aninverter connected to a branching point can beapproximated as the slope of the output signal ofthe inverter connected to that branching point [10].Under these conditions, the slope of the output

waveform of the driving inverter is equal to theslope of the input waveform of the driveninverter(s). The slope of the input/output wave-form can be characterized by two parameters, thegeometry and the capacitive output of the drivinginverter. Since the capacitive load includes theinterconnect line capacitance and the input gatecapacitance of each driven inverter, the transistorsize and output load of each branch are eachdetermined such that the branch delay and thetotal delay of a clock path are both satisfied.

5.3. Branch Delay Modeling

The delay value assigned to each branch of theclock distribution network during the topologicaldesign phase is implemented with one or twoCMOS inverters, according to the assignmentprocedure described in Section 5.1. The delayequations describing an inverting repeater, shownin (10)-(12), are. used to determine the geometricdimensions of the transistors and are based on theMOSFET I-V a-power law I-V model developedby Sakurai and Newton [18].

O(VGs

_Vth "cutoff)

los= e,D-’- (Vis<Vs "linear)Ibs (VDs > Vs saturation),

(10)

where

Vth )Is IDS VDD VthW7-- ec( Zos Vth)/-,off

(11)

VDS-- Wth a/2 )l)s VDS VDD VthW

Pv(V6s- Vth)/:

(12)


and where/DO is the drain current at V6s VDSVDD VDO is the drain saturation voltage at

V6s VDD, Vth is the threshold voltage, a is thevelocity saturation index, and VDD is the powersupply. The parameters a, Voo, Ioo, and Vth aredetermined as explained in [18]. The inverter isdriven by a ramp signal with rising and fallingslopes, sr and sf, respectively,

srt rising inputl/n (13)

VDD sft falling input.

The slope Sr(Sf) is selected such that duringdischarge (charge), the effects of the PMOS(NMOS) transistor can be neglected [19]. Thisassumption is possible for fast input ramps, whichoccur when the crossing point between the inputand output waveforms is approximately [VDD-- Vth][19]. The structure of the delay element is illus-trated in Figure 18a, with the shape of the inputand output waveforms illustrated in Figure 18b.The time td from the 50% point of the input

waveform to the 50% point of the output wave-form is defined as the delay of the circuit element.Equation (14) is derived from (10) (12) [18] anddescribes the load capacitance CL of a CMOS

inverter in terms of its delay,

CL --’VDD td 2Vthwhere pr--VDD

(14)

The output waveform of the driving inverter is theinput signal to all branches connected to thisinverter and is approximated by a ramp shapedwaveform. This approximation is achieved bylinearizing the output waveform (from 0.1 VDO to0.9 VDD) and is accurate as long as the interconnectresistance is negligible as compared to the inverteroutput impedance. The slope of the output wave-form is expressed as

t0.9 t0.1 CL VDD (0.9 VDO Ins= 0.=Io--- b---+0.8VD----10VDo )eVDD

(15)

Equations (14) and (15) provide the necessaryrelationships to design the circuit elements of aclock signal path, as explained below:

Design of clock signal path For each clockpath within the clock tree, the procedure to designthe CMOS inverters is as follows: 1) the load of theinitial trunk of the clock tree is determined from

(a)

50%Vo

t ramp approximation

D Vth

time

()

FIGURE 18 (a) Delay element; (b) Input/output waveforms of the delay element.


(14), assuming a step input clock signal; 2) theslope of the output signal is calculated from (15);3) this value is applied in (14) to determine the loadof the following branch, and (15) is used again todetermine the slope of the output signal; and 4)step 3 is repeated for each subsequent branch ofthe clock path. Steps 1-4 are applied to theremaining clock paths within the clock tree.Observe that if the slope of the output signal ofbranch bi does not satisfy

between the analytically derived delay obtainedfrom the topological design and implemented ascircuit delay elements and the simulated delayderived from SPICE is 1.5% for the path drivingR6 and 2.3% for the path driving R12. Furthermore,the per cent difference for the clock skew betweenthe clock signals driving both registers is 4.0%.

6. SIMULATION RESULTS

1( VDDCLi+I ) (16)Sri tdi+(1/2 l__z._exa 2Iool+a

(14) is no longer valid. The slope sri can be reducedby increasing the output current drive of theinverter in branch bi. However, from (14), increas-ing Iooi would increase the capacitive load CLi inorder to obtain the desired propagation delay tdifor branch bi with a single inverter. Therefore, ifthe propagation delay tdi of the branch bi is toolarge, the slope associated with branch bi can onlybe reduced if the branch bi is implemented withmore than one inverter. The number of invertersrequired to implement the propagation delay tdi ischosen such that (16) is satisfied for each inverterstage and the polarity of the clock signal drivingbranch bi+l does not change.

Equations (10) and (14)- (16) are sufficient todetermine the buffer geometry and load capacitanceof each inverter along every branch of the clockdistribution network. As an example, the clock pathdriving registers R6 and R12 are illustrated inFigure 19, assuming CMOS inverters with a

We/Wv ratio equal to two. The per cent difference

Ins Ins Ins

Ins

cl -O-ocl,,’’ I,.4 L3.5 R

..239fF ._52fF 190fF _.55fF ._,110fF _ll8fF

FIGURE 19 Example of clock signal paths driving registersR6 and R12.

The efficiency of this top-down design methodo-logy has been verified by applying the first threephases of the methodology to a number ofexample circuits. In the clock skew schedulingand topological phases, the clock tree is deter-mined in terms of the minimum clock path delayand the number of delay units for both thehierarchical and non-hierarchical implementa-tions. In the circuit design phase, the accuracy ofimplementing the clock path delays and clock skewis compared with SPICE simulations.

Certain features of the topological phase areillustrated in Table III for several circuit examples.The circuit illustrated in Figure 8 is example cirl,cited in Table III. The second example, cir2,illustrates a clock distribution network in whicheach of the registers of a global data path areinterconnected. The final circuits, cir3 and cir4,exemplify the effects of having a large number ofregisters driven by the same branching point.The second column in Table III shows the

number of registers within each circuit. The thirdand fourth columns depict the maximum clockdelay of the clock distribution network based on anon-zero clock skew schedule, without and withhierarchy. The final two columns describe the totalnumber of delay units required to implement thecircuit without and with using the hierarchicalinformation of the circuit. Although the minimumclock delay is always less for a clock distributionnetwork with individualized clock paths drivingeach register, it does not reflect the reality ofcurrent VLSI circuits with many thousands ofregisters, requiring excessive area to implement


Circuit

TABLE III Minimum clock delay and total number of delay units for several example circuits

Registers Minimum clock delay Total delay units

No hierarcy Hierarchy No hierarchy Hierarchy

cir 20 5 8 56 57cir 2 15 7.2 9.2 57 53cir 3 56 11.1 13.2 184 115cir 4 37 8 10.6 163 124

such a clock network. On the other hand, due tobranch sharing between common clock signalpaths, less layout area is required in tree structuredclock distribution networks since the branchdelays are smaller.The difference between the calculated and

simulated clock path delays for the circuit shownin Figure 13 is compared in Table IV. The secondcolumn depicts the target delay obtained fromthe topological and circuit design of the clockdistribution network. The third column shows thedelay values of each clock path derived fromSPICE circuit simulation using Level-3 devicemodels, while the fourth column depicts the percent error between the calculated and the numeri-cally derived delay. Note that the maximum errorfor these examples is 2.5%.The delays of the clock paths in Table IV

corroborate two very important characteristics ofthe delay model presented herein. Keeping theinterconnect resistance small as compared with theoutput impedance of a buffer guarantees that eachbranch delay can be modeled independently. Also,the accuracy of the total delay of a clock path is

TABLE IV Comparison between calculated and simulatedclock path delay

Clock Path Delay (ns) SPICE (ns) Error

R1, R2, R3 7.0 6.84 2.3%R15, R4, R9 7.0 7.11 1.6%R18 8.0 8.10 1.3%R19, R2o 7.0 7.04 0.6%Rs, R6, R16 4.0 4.06 1.5%R17 8.0 8.06 0.8%R14 7.0 7.17 2.4%Rli 7.0 7.16 2.3%R12 6.0 6.14 2.3%R13 8.0 8.20 2.5%R7Rlo 7.0 6.90 1.4%R8 9.0 8.90 1.1%

greatly enhanced by integrating an interconnectdelay model into an inverter delay model byincluding the effects of input/output waveformshape and the input capacitance of all the branchesconnected to the output of an inverter. Forexample, the clock path driving register R15 sharesbranches with the clock paths driving registers R1and R19. Nevertheless, the maximum error be-tween the calculated and simulated delays is 2.5%.Likewise, the clock path that drives R6 hasbranches in common with the clock paths thatdrive registers R12, R14 and R4, exhibiting an errorless than 2%.A more significant measure of the effectiveness

of the clock distribution network design metho-dology presented in this paper is to guarantee thatthe clock skew between any pair of registers in thesame global data path is accurately implemented,rather than the delay of each individual clock path.The clock skew between registers for the circuitillustrated in Figure 4 is listed in Table V. Thescheduled clock skew implemented with the designmethodology described in this paper for the pair ofregisters presented in column one is shown incolumn two. The values obtained from SPICEcircuit simulation are listed in column three. While

TABLE V Comparison between specified and simulatedclock skew values

Registers Specified Skew (ns) Simulated (ns) Error

R1-R3 0.0 0.0 0.0%R15-R16 3.0 3.0 0.0%R12-R13 -2.0 -2.6 3.0%R17-R19 1.0 1.03 3.0%R4-R14 0.0 0.06R6-R12 -2.0 -2.8 4.0%Rs-R13 -4.0 -4.14 3.5%


the per cent error between the targeted andsimulated values are shown in column four. Notethat the maximum error is 4%, a number wellwithin practical and useful limits.The individual data paths have been selected to

illustrate several types of clock skew situations,such as non-zero clock skew between registers inthe same data path or in separate data paths. Morespecifically, zero clock skew between registers inthe same data path is. illustrated by the pathbetween registers R1 and R3. The path betweenregisters R15 and R16 illustrates positive clock skewfor registers in the same data path, while the pathbetween registers R12 and R13 illustrates negativeclock skew for registers in the same data path. Inthese three examples, the clock skew is onlydependent upon the delay of the external branches,and therefore these clock paths are independent ofthe delay variations of the internal branches of theclock distribution network. The following exam-ples in Table V are more illustrative of the possibleeffects of variations of the internal branch delayswithin the clock path. The path between registersR17 and R19 illustrate non-zero clock skew in a datapath with feedback which is dependent on the delayof its internal branches. The last three examples inTable V illustrate the clock skew between tworegisters belonging to interconnected data paths. Inthe first example, a new path is formed betweenregisters R4 and R14 due to the connection betweenregisters R6 and R12. The final example alsoillustrates the effect of negative clock skew ontwo registers, R5 and R13, in this path. Observe thatin the last examples, the error tolerance of the clockskew is still within acceptable margins, exhibitinggood accuracy, even for those paths in which theclock paths driving two sequentially adjacentregisters do not share a significant portion of theclock distribution network.

7. CONCLUSIONS

VLSI/ULSI-based synchronous systems requirethe efficient synthesis of high speed clock distribu-

tion networks in order to obtain higher levels ofcircuit performance and improved design turn-around. In this paper, circuit performance andreliability are improved by using non-zero loca-lized clock skew to reduce the minimum clockperiod and ensure that no race conditions occur.An integrated top-down methodology is presentedfor synthesizing clock distribution networks. Thismethodology is composed of four phases, 1)optimal clock scheduling, 2) topological design ofthe clock distribution network, 3) design andmodeling of the circuit delay elements, and 4)layout implementation. In this paper, clock schedu-ling, topological design, and the design andmodeling of the circuit delay elements are pre-sented. The fourth phase, layout implementation,is a well developed design technology and istherefore not discussed in this paper.The optimal clock schedule, described in terms of

the non-zero clock skew between each pair ofsequentially adjacent registers, is derived from thecircuit timing information, namely, the delay of thecombinational logic, interconnect, and registers.The topology of the clock distribution network is atree structure derived from the hierarchy of theoriginal circuit description. Minimum delay valuesare assigned to each branch of the clock network,satisfying a specific clock skew schedule. Anapproach for developing the topology of thoseclock distribution networks for circuits with mini-mal or no hierarchical information is also described.A strategy for choosing and implementing the

branch delay values of the clock distributionnetwork is presented, using CMOS inverters forthe delay elements. The minimum number ofinverters to satisfy the branch delay is obtained,while preserving the polarity of the signal drivingthe clock input of each register. Delay equations ofan inverter, derived from the a-power law I-Vmodel, are described. The inverter delay modelaccurately determines the delay of each clock pathby considering the fanout, interconnect capaci-tance, and the slope of the input and outputwaveforms of each branch along the clock path.Comparisons between expected and simulated


delays of each individual clock signal pathproduces circuits with a maximum error of lessthan 2.5%. Furthermore, the maximum errorbetween scheduled and simulated clock skew forany two registers belonging to the same globaldata path is 4%.

Thus, an integrated methodology for synthesiz-ing tree-structured clock distribution networks forhigh performance VLSI circuits is presented. Thismethodology, based on inserted delay elements,accurately synthesizes localized clock skews.Furthermore, this methodology utilizes non-zeroclock skew to improve overall system performanceand reliability.

Acknowledgement

This research was supported in part by Grant200484/89.3 from CNPq (Conselho Nacional deDesenvolvimento Cientifico e Tecnoltgico-Brasil),by the National Science Foundation under GrantNo. MIP-9208165 and Grant No. MIP-9423886,by the U.S. Army Research Office under GrantNo. DAAH04-93-G-0323, and by a grant from theXerox Corporation.

References[1] Pullela, S., Menezes, N., Omar, J. and Pillage, L. T.

(November 1993). "Skew and Delay Optimization forReliable Buffered Clock Trees," Proceedings of the IEEEInternational Conference on Computer-Aided Design, pp.556- 562.

[2] Bakoglu, H. B., Walker, J. T. and Meindl, J. D. (October1986). "A Symmetric Clock-Distribution Tree andOptimized High-Speed Intereonnections for ReducedClock Skew in ULSI and WSI Circuits,"Proceedings ofthe IEEE International Conference on Computer Design,pp. 118-122.

[3] Chao, T.-H, Hsu, Y.-C, Ho, J.-M, Boese, K. D. andKahng, A. B. (November 1992). "Zero Skew ClockRouting with Minimum Wirelength,"IEEE Transactionson Circuits and Systems-II." Analog and Digital SignalProcessing, ol. CAS-39(11), pp. 799 814.

[4] Tsay, R.-S. (February 1993). "An Exact Zero-Skew ClockRouting Algorithm," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.CAD-12(2), pp. 242-249.

[5] Friedman, E. G. (June 1989). Performance Limitations inSynchronous Digital Systems, Ph.D. Dissertation, Uni-versity of California, Irvine, California.

[6] Fishburn, J. P. (July 1990). "Clock Skew Optimization,"IEEE Transactions on Computers, Vol. C-39(7), pp. 945-951.

[7] Sakallah, K. A., Mudge, T. N. and Olukotun, O. A. (June1990). "checkTc and minTc: Timing Verification andOptimal Clocking of Synchronous Digital Circuits,"Proceedings of the IEEE/ACM Design Automation Con-ference pp. 111 117.

[8] Szymanski, T. G. (June 1992). "Computing OptimalClock Schedules," Proceedings of the IEEE/ACM DesignAutomation Conference, pp. 399-404.

[9] Neves, J. L. and Friedman, E. G. (August 1993)."Topological Design of Clock Distribution NetworksBased on Non-Zero Clock Skew Specifications," Proceed-ings of the IEEE Midwest Symposium on Circuits andSystems, pp. 461-471.

[10] Neves, J. L. and Friedman, E. G. (May 1994). "CircuitSynthesis of Clock Distribution Networks based on Non-Zero Clock Skew," Proceedings of the IEEE InternationalSymposium on Circuits and Systems, pp. 4.175-4.178.

[11] Gray, C. T. (July 1993). Optimal Clocking of WavePipelined Systems and CMOS Applications, Ph. D.Dissertation, North Caroline State University, Raleigh,North Carolina.

[12] Papadimitriou, C. H. and Steiglitz, K. (1982). Combina-torial Optimization Algorithms and Complexity, Prentice-Hall.

[13] Friedman, E. G. (1995). Clock Distribution Networks inVLS! Circuits and Systems, IEEE Press.

[14] Friedman, E. G., Mulligan, J. H. Jr. (April 1991). "ClockFrequency and Latency in Synchronous Systems," IEEETransactions on Signal Processing, Vol. SP-39, pp. 930-934.

[15] Dhar, S. and Franklin, M. A. (January 1991). "OptimumBuffer Circuits for Driving Long Uniform Lines," IEEEJournal of Solid State Circuits, Vol. SC-26(1), pp. 32-40.

[16] Sakurai, T. (January 1993). "Closed-Form Expressionsfor Interconnection Delay, Coupling, and Crosstalk inVLSI’s," IEEE Transactions on Electron Devices, Vol.ED-40(1), pp. 118-124.

[17] Tremblay, J.-P. and Sorenson, P. G. (1986). AnIntroduction to Data Structures with Applications, 2ndEdition, McGraw-Hill.

[18] Sakurai, T. and Newton, A. R. (April 1990). "Alpha-Power Law MOSFET Model and its Applications toCMOS Inverter Delay and Other Formulas," IEEEJournal of Solid State Circuits, Vol. SC-25(2), pp. 584-594.

[19] Hedenstierna, N. and Jeppson, K. O. (March 1987)."CMOS Circuit Speed and Buffer Optimization", IEEETransactions on Computer-Aided Design, Vol. CAD-6(2),pp. 270- 281.

[20] Wyatt, J. L. Jr. (1987). "Signal Propagation Delay in RCModels for Interconnect," in Circut Analysis, Simulationand Design, Chapter 11, Elsevier Publishers.

[21] Chern, J.-H., Huang, J., Arledge, L., Li, P.-C. and Yang,P. (January 1992). "Multilevel Metal Capacitance Modelsfor CAD Design Synthesis Systems," IEEE ElectronDevice Letters, Vol. EDL-13(1), pp. 32-34,

[22] Liu, S. and Nagel, L. W. (December 1982). "Small-SignalMOSFET Models for Analog Circuit Design," IEEEJournal of Solid State Circuits, Vol. SC-17(6), pp. 983-998.


[23] Sakurai, T. (August 1983). "Approximation of WiringDelay in MOSFET LSI," IEEE Journal of Solid StateCircuits, Vol. SC-18(4), pp. 418-426.

APPENDIX A

An interconnect line can be treated as a lumpedcapacitance only when the resistive impedance ofthe buffer driving the interconnect line is muchgreater than the resistance of the interconnect line[15]. Since the resistance of an interconnect line isproportional to the length of the line, a restrictionis imposed on the maximum line length in order toensure the line resistance is of negligible magni-tude.To exemplify how the resistive impedance of an

interconnect line is maintained smaller than theoutput impedance of the inverting buffer drivingthat node, the interconnect length for severalcapacitive loads, given a target fabrication process,is analyzed. Consider, for simplicity, that the delaymodel is a single pole model, as given in [20], andthe delay is measured at the 50% point of theoutput signal. The delay is

Tdi 0.693 Rm, Cn (A.1)m=l n=m

where Rm and C are the parasitic resistance andcapacitance of the interconnect line, respectively.The interconnect resistance is

Rint R # of [2]’s. (A.2)

where R[] is the resistance per square of theinterconnect line and # of [2]’s is the number ofsquares of interconnect. The capacitance of anisolated interconnect line consists primarily of twocomponents, the parallel plate capacitance and thesidewall fringing capacitance. In submicrometertechnology, the fringing capacitance cannot beneglected, and therefore the total capacitance canbe modeled as [21]

CCint-- " * WntLint + Cfr * 2(Wnt q- Lint), (A.3)

where C/A is the parallel plate capacitance per unitarea, Cfr is the fringing capacitance per unit length,and Wint and Lint are the width and length of theinterconnect line, respectively. For the purpose ofcalculating the interconnect length, given a desiredinterconnect capacitance, (A.3) can be rewritten as

Cint 2C fr * ntLint c__, Wint + Cfr * 2A(A.4)

The output impedance of an inverter is [22]

2LTRReff WTRCox Vdd Vth)’ (A.5)

where WTR and LTR are the width and length ofthe MOS transistor, respectively, Cox is the gateoxide capacitance, # is the channel mobility, Vth isthe threshold voltage, VDD is the power supply,and the transistor is assumed to operate in thesaturation region.

Equations (A.1), (A.2) and (A.4) are used inTable A.I to estimate the interconnect length andresistance for several values of interconnectcapacitance. For a value of interconnect capaci-tance (Column two in Tab. A.I), the length of aminimum width (WInt) interconnect line is ob-tained from (A.4) and shown in column three. Incolumn four, the interconnect resistance is calcu-lated from (A.2), and the percentage ratio betweenthe interconnect resistance and the output impe-dance of the buffer is presented in column five.

TABLE A.I Determination of the length of a single inter-connect line

Cint(fF) Lint(I.tm) Rint (’) Rint/Reff

2 tm CMOS 50 520 8.3 0.14%Rer 5.8 K 100 1041 16.6 0.29%Rr 0.48 f 150 1562 25.0 0.43%C=0.32 fF/m 250 2604 41.6 0.72%Cfr -’0 fF 350 3645 58.3 1.01%WInt-" 3.0 lm 500 5208 83.3 1.44%

tm CMOS 50 349 12.8 0.43%Roe 3.0 K 100 700 25.7 0.86%Rn=.055 f 150 1051 38.5 1.28%C=.035 fF/m 250 1753 64.3 2.14%Cfr 0.45 fF 350 2455 90.0 3.0%Wint’- 1.5 lxm 500 3562 128.6 4.27%


TABLE A.II Propagation delay of a single interconnect linewith variable load

Cint(fF) Dl(ns) D2(ns) Error

2 tm CMOS 50 0.29 0.29 0.0%/er 5.8 K 100 0.58 0.58 0.0%Rn 0.48 w 150 0.86 0.87 < 1%C 0.32 fF 250 1.44 1.45 < 1%Cfr 0 fF 350 2.02 2.04 1.0%Wlnt 3.01xm 500 2.88 2.92 1.5%

tm CMOS 50 0.15 0.15 0.0%Reef 3.0 K 100 0.30 0.30 0.0%Rr .055 w 150 0.45 0.46 2.1%C .035 fF 250 0.75 0.77 2.6%Cfr .045 fF 350 1:05 1.08 2.8%Wxnt 1.5 tm 500 1.50 1.56 3.8%

FIGURE A.1 Propagation delay of a single interconnect line.(a) Schematic diagram; (b) delay model without interconnectresistance: (c) delay model considering interconnect resistance.

From Table A.I it is noted that by includingfringing capacitance in the calculation of theinterconnect capacitance, the maximum intercon-nect length of the Ixm process as compared to the2 lxm process is significantly reduced. In Table A.I,it is also shown that as the feature size is reduced,the interconnect resistance increases, reducing thelength of interconnect which can be accuratelymodeled as a lumped capacitor.The values of resistance and capacitance, illu-

strated in Table A.I, are used in Table A.II toobtain the signal propagation delay of an inverterdriving an interconnect line. In Column two, thecalculated propagation delay of an inverter, (D1),derived from (A.1), driving an interconnect linewithout considering interconnect resistance isdescribed, (illustrated in Fig. A.lb). The effectiveresistance of a minimum size inverter, Refr, iscalculated from (A.5). Using the same values ofinterconnect capacitance that are used in TableA.I, the propagation delay through an intercon-nect line considering interconnect resistance (D2) isillustrated in Figure A. lc and presented in columnfour. The per cent error between both calculationsis listed in column five.As shown in column five in Table A.II, the

interconnect resistance begins to affect the propa-gation delay for load capacitance values ofapproxi-mately 150 fF, which correspond to interconnectlengths of approximately 1500 Ixm for a 2 lxmprocess and 1000 tm for a lam process, as shown in

column three ofTable A.I. Interconnect line lengthssignificantly greater than these values require theinsertion of additional inverting buffers in order topreserve the accuracy of the branch delay and tominimize any significant signal degradation.

Authors’ Biographies

Jos6 Luis Neves received the B.S. degree inElectrical Engineering and the M.S. degree inComputer Science from the Federal University ofMinas Gerais, Brazil in 1986 and 1990, respec-tively. He also received the M.S. degree inElectrical Engineering from the University ofRochester in 1991. He completed the Ph.D. degreeat the University of Rochester in 1995. Dr. Nevesis currently employed at IBM microelectronic, inEast Fishkill, New York working on the develop-ment of high performance VLSI-based designmethodologies and CAD tools.He received a doctorate fellowship from CNPq

(Conselho Nacional de Desenvolvimento Cientfico-Brasil) from 1990 to 1994 and has been a researchassistant since 1994. His research interests andpublications are in the areas of high speed andlow power digital CMOS IC design techniques;clock distribution design; synchronous and asyn-chronous timing issues; high-level synthesis; andalgorithm design for the development of CADtools targeted for high performance systems.Eby G. Friedman received the B.S. degree from

Lafayette College in 1979, and the M.S. and Ph.D.


degrees from the University of California, Irvine,in 1981 and 1989, respectively, all in electricalengineering.From 1979 to 1991, he was with Hughes

Aircraft Company, rising to the position ofmanager of the Signal Processing Design and TestDepartment, responsible for high performanceVLSI CMOS digital and analog IC’s and thedevelopment of supporting design and test metho-dologies and CAD tools. He has been in theDepartment of Electrical Engineering at theUniversity of Rochester since 1991, where he is

an Associate Professor and Director of the HighPerformance VLSI/IC Design and Analysis Labo-ratory. His current research and teaching interestsare in the areas ofhigh performance VLSI/IC designand analysis and related system specifications.He has published in the fields of high speed and

low power CMOS design techniques, pipeliningand retiming, and the theory and application ofsynchronous clock distribution networks, and hasedited three books on high performance VLSI-based circuits and systems.

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks


Documents

Automated Synthesis Skew-Based Clock Distribution Networksdownloads.hindawi.com/archive/1998/072951.pdf · analytically derived clock skew with the SPICE simulationsandtheoriginally