560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION ...cdr/pdf/Elgebaly - TVLSI 2007.pdf · 560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 5,

560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 5, MAY 2007

Variation-Aware Adaptive Voltage Scaling SystemMohamed Elgebaly, Member, IEEE, and Manoj Sachdev, Senior Member, IEEE

Abstract—Conventional voltage scaling systems require a delaymargin to maintain a certain level of robustness across all pos-sible device and wire process variations and temperature fluctua-tions. This margin is required to cover for a possible change in thecritical path due to such variations. Moreover, a slower intercon-nect delay scaling with voltage compared to logic delay can causethe critical path to change from one operating voltage to another.With technology scaling, both process variation and interconnectdelay are growing and demanding more margin to guarantee anerror-free operation. Such margin is translated into a voltage over-head and a corresponding energy inefficiency. In this paper, a crit-ical path emulator architecture is shown to track the changing crit-ical path at different process splits by probing the actual transistorand wire conditions. Furthermore, voltage scaling characteristicsof the actual critical path is closely tracked by programming logicand interconnect delay lines to achieve the same delay combinationas the actual critical path. Compared to conventional open-loopand closed-loop systems, the proposed system is up to 39% and24% more energy efficient, respectively. A 0.18- m technology testchip is designed to verify the functionality of the proposed systemshowing critical path tracking of a 16 16 bit multiplier.

Index Terms—Adaptive voltage scaling (AVS), circuit modeling,critical path tracking, deep submicrometer MOSFET.

I. INTRODUCTION

DESIGNING powerful and versatile computing systems isbecoming more feasible with technology scaling. Smaller

feature size enables more integration and allows more functionsto be built within the same area. This leads to an escalationin current density and the associated power dissipation. Powerreduction techniques are becoming essential in designing suchsystems in order to keep power dissipation under control. Dy-namic and leakage power are the main contributors to the overallpower dissipation and the main drain for energy. The third com-ponent is the short circuit power which is small and can be ig-nored for most modern CMOS designs [1].

Dynamic power is considered by far the largest power dissi-pation component. It can be expressed as

(1)

where is the supply voltage and is the operating fre-quency. is the average switching capacitance and is givenby , where , , andare the average switching gate, diffusion, and wire capacitancefor the chip, respectively.

Manuscript received November 9, 2005; revised September 4, 2006. Thiswork was supported by a strategic grant from the Natural Sciences and Engi-neering Research Council (NSERC) of Canada.

M. Elgebaly is with Montalvo Systems Inc., Santa Clara, CA 95054 USA(e-mail: [email protected]).

M. Sachdev is with the Department of Electrical and Computer Engi-neering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail:[email protected]).

Digital Object Identifier 10.1109/TVLSI.2007.896909

Fig. 1. Architecture of a DVS system.

Leakage power is becoming a major roadblock in the wayof technology scaling [2], [3]. Different leakage power com-ponents drain power and energy while the system is idle. Sub-threshold leakage is considered the main leakage mechanismwhich is given by

(2)

where , and are technology pa-rameters, is the threshold voltage, is the thermal voltage(26 mV at room temperature), and and are the device di-mensions.

Power dissipation control requires design effort on differentfronts [4]. System, architectural, circuit, and device level powerreduction techniques can be employed to keep power dissipa-tion within the tight power budget. For example, different strate-gies can be used in converting a high-performance chip to alow-power chip [5]. Voltage supply reduction has been shown tobe the most effective among all other power reduction method-ologies. Theoretically, the lower limit on supply voltage ,required for correct functionality of a static CMOS inverter wasderived in [6] and [7] and is given by

(3)

where is a constant between 3 and 4. Recently, a fast Fouriertransform (FFT) unit was shown to provide optimal energy effi-ciency at 350 mV [8]. The FFT unit was also shown to functioncorrectly at a supply voltage of 180 mV.

However, performance degradation is a direct consequenceof supply voltage reduction. In order to maintain the requiredthroughout, dynamic voltage scaling (DVS) systems are usedto adjust the supply voltage according to throughput require-ments. Fig. 1 shows the overall architecture of a generic DVSsystem. The performance manager uses a software interface topredict performance requirements. Once performance require-ment for the next task is determined, the performance managersets the voltage and frequency just enough to accomplish thetask. The target frequency is sent to the phase-locked loop (PLL)to accomplish frequency scaling. Based on the target voltage,

1063-8210/$25.00 © 2007 IEEE

ELGEBALY AND SACHDEV: VARIATION-AWARE ADAPTIVE VOLTAGE SCALING SYSTEM 561

the voltage regulator is programmed to scale the supply voltageup/down until target voltage is achieved [9]–[12].

DVS is also effective in leakage power reduction [13]. Usingtwo supply voltages, one for logic and one for storage elements,leakage power can be reduced. Both the combinational and se-quential supply voltages utilize dynamic voltage scaling to savepower during the active mode. During standby, the combina-tional supply voltage is collapsed (shut down) or put into sleepmode using sleep transistors [14]. Meanwhile, the sequentialsupply voltage is reduced to the level just enough to retain thestate of the system. Retaining the state saves the energy requiredto store and restore contents. Therefore, optimal power savingscan be achieved.

Unlike the conventional digital systems for which characteri-zation is performed at a certain operating voltage and frequencypair, DVS systems require characterization at least at the twoends of the operating range. In fact, characterization of DVS sys-tems depends on the underlying voltage scaling methodology.The conventional approach to perform voltage scaling utilizes aone-to-one mapping of voltage to frequency. In order to guar-antee a robust operation, the frequency–voltage relationship isdetermined via chip characterization at worst case conditions.Throughout this paper, the worst case condition refers to theworst case delay at a particular voltage. This technique is uti-lized in open-loop DVS systems where the frequency-voltagepairs are stored in a lookup table (LUT) with enough built-inmargin to cover for temperature and process variations.

Such margin required by open-loop systems can be recov-ered by probing the actual on-chip conditions via a feedbackloop mechanism. The closed-loop voltage scaling system uti-lizes on-chip circuit structures to provide the feedback requiredto adaptively track the actual silicon behavior. The critical pathof the system can be duplicated [15] to form a ring oscillator orcan be replaced by a fanout of four (FO4) ring oscillator [16] ora delay line [12]. Such a ring oscillator (or delay line) adaptsto environmental and process variations. Since there is a di-rect relationship between the actual performance of the core andthe speed of the ring oscillator, a closed-loop adaptive voltagescaling (AVS) system is formed to automatically adjust supplyvoltage to nearly the minimum level required to meet perfor-mance targets. A safety margin is added to account for any mis-match between the ring oscillator (or the delay line) and the ac-tual critical path.

Different design parameters are involved when selectingbetween the open-loop and the closed-loop voltage scalingconfigurations. Stability against temperature fluctuations is amain design parameter. The conventional open-loop systemstores the worst case performance numbers. Therefore, worstcase process variation is covered and temperature stability isguaranteed. However, the large margin added to compensatefor worst case process and temperature variations can reduceenergy savings significantly.

The closed-loop system compensates for process and temper-ature variations by monitoring the activity of the critical pathreplica. However, using a single reference for the critical pathin the feedback mechanism is becoming less feasible in moderndeep submicrometer technologies. The large variations spreadcan cause the critical path to change from one process corner or

one parasitic condition to another. A delay margin is requiredto maintain a fail-safe operation requiring higher than normalsupply voltages and reducing the energy savings achievable viavoltage scaling.

II. ROBUSTNESS AND ENERGY EFFICIENT VOLTAGE SCALING

Energy efficiency of voltage scaling systems is often tradedfor robustness. The higher the margin required for fail-safe oper-ation, the less the efficiency of the voltage scaling system. Whena unique critical path is identified and is guaranteed to remaincritical at all conditions, it is sufficient to replicate that path.This replica includes the combination of gates and the intercon-nection wires forming the critical path. The critical path replicaprovides the closest behavior to the actual critical path except forintra-die variations and cross-coupling capacitances which aredifficult to duplicate. This difference was somewhat accountedfor in [15] where two copies of the critical path are used. Oneof the two copies has a 3% margin for any mismatch with re-spect to the actual critical path. The two replicas are inserted inbetween flip-flops representing the longest single stage delayof the pipeline. A third path includes only two back-to-backflip-flops so that only clock delay is considered. The systemoperates by adjusting the supply voltage to guarantee that thereplica without margin runs without timing violations while thereplica with margin fails. As a result, supply voltage is adjustedto be just enough for correct functionality plus less than 3%margin. When supply voltage is too low for correct function-ality, both replicas fail timing and a command is issued to aprogrammable voltage regulator to increase the voltage. On theother hand, when both replicas pass timing, the programmablevoltage regulator is instructed to lower down the supply voltageuntil only the replica without margin passes timing.

The previously described critical path replica technique relieson the fact that a single critical path remains the most critical atall conditions. The energy savings achievable via voltage scalingoutweighs the additional 3%–5% delay margin required for fail-safe operation. However, selecting a unique critical path acrossall conditions is becoming a challenge as transistor dimensionsare scaled. Transistor and wire variations spread grows from onetechnology generation to the next. Moreover, the contribution ofinterconnect delay to the overall system delay increases. Whenseveral system paths have nearly the same delay while each onehas a different blend of logic and interconnect delay, the processof selecting a unique critical path for the system becomes chal-lenging.

The effect of both the process variations and the logic and in-terconnect contribution on delay can be illustrated using Fig. 2for two path delays, one is logic-dominated and the other isinterconnect-dominated, in a CMOS 0.18- m technology. Theinterconnect-dominated path represents a global bus with re-peaters optimally inserted to reduce the overall delay. The in-terconnect delay refers to the total delay of the driver plus theRC component of the wire. The logic-dominated path representsa datapath with a small contribution of interconnect delay. At asupply voltage of 1.8 V, the interconnect-dominated path has alonger delay. As voltage is scaled, the pure RC delay portionof the interconnect delay experiences almost no scaling while


Fig. 2. Delay margin required by conventional systems to compensate for thedifference in voltage scaling behavior between logic and interconnects.

the driver delay scales normally with voltage. Consequently,logic delay scales faster with voltage than interconnect delay.At low supply voltage, the logic-dominated path becomes crit-ical. Therefore, the critical path selection process should be pre-formed at both ends of the scaled supply voltage range. Further-more, a delay margin is required to maintain a unique criticalpath at all conditions. In Fig. 2, a 51% interconnect margin at1.8 V is added to the interconnect-dominated path in order tomake it the most critical across the entire supply voltage range.Alternatively, an approximately 48% delay margin can be addedto the logic-dominated path. In both cases, an increase in supplyvoltage is required to avoid timing violations resulting from theextra margin included. Therefore, the range of supply voltagescaling becomes limited and energy savings decrease signifi-cantly.

Not only the difference in voltage scaling behavior of logicand wire dominated paths but also process variations and en-vironmental conditions add more complexity when designingvoltage scaling systems. For example, a critical path at slowcorner may not remain critical at the fast process corner. Asa result, conventional systems require enough safety margin toaccommodate such variations while reliably scaling the supplyvoltage without causing a system failure. This margin is trans-lated to a voltage overhead and an associated energy loss.

In order to reduce such margin, the Razor approach has beenproposed in [17]. In this approach, an on-chip timing checker isused to check the time margin for a set of potential critical pathsas shown in Fig. 3. The timing checker uses a delayed versionof the system clock to capture the same data in a shadow latchthat the main, master, flip-flop captures. The additional shadowlatches are introduced where subcritical paths become critical.As supply voltage is scaled, the value latched in the masterflip-flop can be different from that latched by the shadow latchtriggering an Error signal. The Error signals from all shadowlatches are gathered to generate a single Error indicator. TheError is corrected by flushing the pipeline and reloading the statebefore the Error occurred. When the Error rate decreases beyonda certain limit, supply voltage is reduced till the point where

Fig. 3. Razor approach to reduce the voltage margin dictated by worst casecharacterization.

the error is acceptable. However, certain applications may notallow any predictable failures such as those allowed by Razor.Moreover, in order to guarantee a robust operation, system char-acterization at all conditions is required. This may require anincreased number of razor latches. Therefore, the error proba-bility may increase and the overhead of the error detection cir-cuitry may also increase. The latency resulting from flushing thepipeline may negatively impact the overall performance whenthe error rate increases. The complexity of the Razor also in-creases when the design has many critical paths close to eachother in timing requiring more shadow latches and resulting inan increased overhead and reduced efficiency.

This paper describes an AVS system that emulates the ac-tual critical path at different process and parasitic conditions.By closely tracking the actual critical path, the large margin re-quired to guarantee a fail-safe operation in conventional systemsis minimized. A test chip is implemented and tested to verifythe functionality of the proposed architecture. The rest of thispaper is organized as follows. Section III describes the proposedarchitecture. The efficiency of the proposed architecture com-pared to the conventional voltage scaling systems is analyzed inSection IV. Experimental results are presented in Section V.

III. CRITICAL PATH EMULATOR ARCHITECTURE

The effect of the growing complexity of critical path(s) selec-tion on margining requirements can be mitigated if the voltagescaling system can track the most critical path at any giventime. The critical path emulator (CPE) described in this paperattempts to effectively reduce conventional margining require-ments by adapting to process variations and to the resultingeffect on critical paths timing. Using this information, a cus-tomized path delay that has almost the same scaling character-istics of the actual critical path on the chip can be constructed.The customized path is reprogrammed to adapt to track the crit-ical paths on the chip. This way, the margin required by con-ventional systems can be reduced leading to enhanced energyefficiency.

The fail-safe margin required can be generated throughtiming characterization of the core running under voltagescaling. The characterization process has to be performed tocover all combinations of the different process and interconnectsplits. Furthermore, each corner requires characterization at thetwo ends of the supply voltage range and different temperatures.Instead of this lengthy and costly process, accurate modelingof both logic and interconnect delays is utilized.


A. Delay Modeling of Logic and Interconnects

Using accurate delay models, the critical path delay at dif-ferent conditions and different target speeds can be predictedand the delay lines can be programmed accurately. The lengthycharacterization process can be replaced by a simple, yet accu-rate, delay modeling process for logic and interconnects. In thispaper, the delay model for both logic and interconnects is basedon previously published models [18], [19]. Additionally, accu-rate modeling of the rising/falling input signals is used to furtherenhance the accuracy of the delay model.

Since the input ramp to one stage of the delay line reaches fullscale voltage ( ) before the output reaches the point,the input ramp is considered a fast ramp. For the fast input rampcase, the output transition time, defined in [19] and [20], is givenby

(4)

where is the load capacitance, is the maximum draincurrent at , is the drain saturationvoltage at normalized by , and is the channellength modulation. The subscripts and refer to the pMOSand nMOS parameters, respectively.

Using the fast input ramp definition, the inverter delay modelhas been developed in [21] based on the alpha-power model[19] and the concept of the inverter step response. The velocitysaturation index is considered to be unity in [21]. However,pMOS transistors usually have a velocity saturation indexslightly larger than unity and greater than nMOS transistorsfor current CMOS technologies. In this paper, delay modelsintroduced in [21] are generalized to include the nonunityvelocity saturation index, , for a better accuracy. Using therise/fall time given in (4), the rising and falling delay times ofan FO4 inverter for the fast input ramp case are expressed as

(5)

where , are the zero body-bias threshold voltage nor-malized by and and represent the input-to-output coupling capacitance for the pMOS and nMOS transis-tors, respectively.

HSPICE simulations are compared to (5) for an FO4 delayline implemented in a 0.18- m CMOS technology. Fig. 4 showsthat the maximum error between the delay model and simula-tions is 4%–5%. This small margin is taken into considerationwhen designing the emulator.

The FO4 inverter delay model described by (5) is used tomodel the buffered interconnects. For the interconnect-domi-nated delay line, buffers are inserted at optimal distances to min-imize the overall interconnect delay. In this case, the overalldelay of the buffered wire is found to be proportional to the

Fig. 4. Logic delay calculated using (5) versus HSPICE simulations of a0.18-�m FO4 delay line.

square root of the buffer delay [18]. Consequently, the intercon-nect delay is related to the buffer delay by the followingrelation:

(6)

where and are the resistance and capacitance per unit lengthof the wire.

Using (5) and (6) to model voltage scaling behavior of bothlogic and interconnect delays takes into account process andinterconnect variations. Using this model, the data required toemulate the critical path can be generated. An algorithm devisedto generate the required data is described in detail as follows.

B. Algorithm

The algorithm used to generate the critical path emulationdata is depicted in Algorithm 1. Logic speed and intercon-nect speed are used as indicators of process and interconnectvariations, respectively. In order to take process variations intoconsideration, the entire logic speed range is divided into binswith each bin is equal to . Similarly, the interconnect speedbin is . In order to facilitate the subsequent discussion, a fewterms are defined as follows.

• Worst case delay: The path delay at worst case process,90% of the supply voltage, and worst case temperature(125 C).

• Potential critical path: A path which becomes critical at acertain voltage or at a certain process, voltage, or temper-ature (PVT) corner.

• Logic speed: The actual on-chip logic speed. Logic speedis used to indicate how fast the actual process is comparedto the slow corner.

• Interconnect speed: The actual on-chip interconnectsspeed. Interconnect speed is used to indicate the condi-tion of the actual interconnect parasitics compared to theslowest corner.

• Interconnect delay ratio, : Ratio of the delay causedby the buffered interconnect wires in a certain path to thetotal delay of that path. The interconnect delay ratio for


the top timing critical paths can be obtained from the statictiming analysis (STA) of the chip. Both the logic and inter-connect delays are reported by STA. This delay informa-tion is used to compute the for each path.

• Target delay: The delay requirement specified by thesystem.

The algorithm is initialized at the worst case scenario (90% ofthe full supply voltage, worst case process, parasitics, and tem-perature). In addition to the maximum delay path, a set of poten-tial critical paths is selected from STA and is normalized to themaximum delay. for each path is recorded. Delay modelsgiven by (5) and (6) are used to predict the voltage scalingbehavior of each path in the set. The critical path emulator isconstructed using multiple logic and interconnect delay ele-ments. The total delay of these elements is chosen such that thetotal delay and the critical path and the emulator are equal and

is equal.

Algorithm 1 Critical path emulator

START:

Find a set of potential critical pathsFor each path in the set:

Compute

for dofor do

Find the maximum delay pathFind the subsequent potential critical pathswhile do

ComputeFind the critical path when :Record

end whileend for

end for.

The next step is to determine which and to use in emulatingthe actual critical path for the remaining target delays specifiedby the system’s software. Each and are specified based oneach logic speed and interconnect speed . First, the delay ofeach path in the set of potential critical paths is computed using(5) and (6). Then, the path which has a delay equal to the targetdelay is selected. In this case, delay of all other paths should beless than the target delay. Once the critical path is selected, its

pair is stored and copied to to be used for emulation.The same procedure is repeated for the next delay target. Oncethe generation of the pairs is completed, the data requiredfor one process and interconnect corner is determined. The in-formation needed for the entire variations matrix is determinedby repeating the above procedure for all logic and interconnectsplits. The resulting delay of the critical path emulator closelytracks that of the real critical path. More importantly, voltagescaling behavior is nearly the same for both the real critical pathand its emulator.

Fig. 5. Delay of the critical path emulator adapts to the delay of all other pathsfor the entire voltage range at both slow and fast process corners.

Algorithm 1 is verified by applying the different steps to a setof path delays with different logic and interconnect delay contri-butions. The CMOS 0.18- m technology parameters are used inthe experiment. The worst case delay path with anis selected. The effect of interconnect delay on the selection ofa unique critical path is illustrated through the examination ofa set of potential critical paths which have lower (morelogic delay). Since potential critical path delays scale faster withvoltage, a margin proportional to is required. Logic and in-terconnect speeds are divided into five regions each. ApplyingAlgorithm 1 and using , the logic and interconnectdelay information required for the 5 5 LUT matrix is gener-ated. The delay lines are programmed using 5 bits for logic andinterconnect delays (e.g., ). Considering four targetfrequencies and 5 bits for programming both the logic and inter-connect delay lines, 1000 bits are required to construct the CPELUT matrix. Additionally, 100 bits are required for the processmonitor assuming 10 bits are used to represent each processand interconnect corner. Therefore, approximately 1.1-kbits ofmemory are required to buildup all the LUTs.

Fig. 5 shows delays of the potential critical paths and the crit-ical path emulator after applying Algorithm 1 at both the slowand fast corners. For both process corners, the critical path em-ulator, shown as a solid curve, has an approximately 10% safetymargin above all the other paths for the entire supply range. Suchsafety margin corresponds to a voltage margin (V margin) sincethe critical path emulator operates at a higher voltage to achievethe same delay as the actual critical path. This safety margin isa design parameter that can be adjusted according the designcomplexity and requirements.

C. Architecture

The CPE system uses an on-chip process variations indicatorto program two delay lines, one for logic and one for intercon-nect, to emulate the real critical path on the chip at differentperformance points. Probing the on-chip process and intercon-nect variations is achieved by measuring logic-dominated and


Fig. 6. Logic and interconnect low-power high-resolution A/D.

interconnect-dominated ring oscillator frequencies. The logicring oscillator frequency provides an insight about the transistorprocess variations while the interconnect-dominated ring oscil-lator provides information about the back-end process varia-tions. A low-power high-resolution analog-to-digital (A/D) con-verter [22], [23] is used to determine the logic speed as shownin Fig. 6. FO4 inverter is used as the main delay cell since itsvoltage scaling characteristics is nearly similar to most staticCMOS logic gates [24]. The frequency of the ring oscillator issampled using a slow frequency clock (CLKin). When CLKingoes LOW, the ring oscillator is enabled and the counter startsto count the ring oscillator cycles. When CLKin goes HIGH thering oscillator is disabled and the counter holds the frequencycount. The output of the counter represents the number of edgesthat propagated through the ring during the sampling period.This represents the high-order bits of the logic speed vector. Bylatching the internal state of the ring at the end of the samplingperiod, the converter logic block determines how far an edge hastraversed through the ring to determine the lower order bits ofthe frequency count. Similarly, interconnect speed is also mea-sured using buffered interconnect segments. The logic-domi-nated frequency measurement is subtracted from the intercon-nect-dominated frequency to extract the portion related to theRC delay. In order to avoid device mismatch between logic andinterconnect buffers, the arrangement shown in Fig. 6 is used.The selection logic is constructed using a NAND–NAND configu-ration in order to track with FO4 inverter delay.

When measuring the ring oscillator frequency, process vari-ations and temperature are the major parameters affecting themeasurement. In order to eliminate the effect of temperatureon the estimation process, supply voltage is adjusted such thatperformance is almost temperature independent [25], [26]. Atthis voltage, temperature effect on delay is minimized leavingprocess and interconnect variations as the only factors affectingperformance [27]. For example, Fig. 7 shows the simulated fre-quency versus voltage characteristics for a logic path at dif-ferent splits in a 0.13- m CMOS process. Process identifica-tion is difficult to accomplish at high voltages due to the largerinfluence of temperature on performance at high voltages. Forexample, at 1.5 V, performance for the fast process at hot tem-perature (125 C) is almost the same as that of the typical processcorner at cold temperature ( 40 C). Therefore, it is better tofix the temperature at a certain level in order to identify theprocess corner during calibration. Temperature adjustment addsextra time and cost to the calibration process. By adjusting the

Fig. 7. Path frequency scaling with voltage for different process splits and dif-ferent temperatures.

supply voltage at the temperature insensitive point (approxi-mately 1.0 V in Fig. 7), the extra calibration time required fortemperature adjustment can be saved.

The CPE architecture is shown in Fig. 8. The process varia-tions are estimated using the logic/interconnect A/D describedbefore. The ring oscillator frequency is directly correlated to thespeed of the devices on the chip. For example, when the nMOSdevices are 10% faster and the pMOS devices are 10% slowerthan nominal, the ring oscillator frequency approximates thatto nominal performance which is the actual chip performance.Therefore, there is no need to identify the speed of the nMOSand pMOS devices individually. The logic frequency measure-ment is compared against prestored values in the logic speedsregister bank. Based on this comparison, the appropriate se-lection line in the logic speed vector ( ),where is the number of logic splits, is activated to enable arow in the LUT matrix . Similarly, the interconnect bin , is de-termined. The values of and are used to select a single LUTfrom the LUT matrix that contains the information required toprogram the delay lines. The selected LUT is a set of storageelements used to store the number of logic delay elementsand interconnect delay elements necessary to program boththe logic and interconnect delay lines, respectively. By using asimilar blend of logic and interconnect delay to that of the ac-tual critical path, voltage scaling characteristics become nearlyequivalent.

The actual critical path delay is emulated using two pro-grammable delay lines using the configuration shown in Fig. 9.A similar approach was reported in [28] where NAND gates withnominal and long channel transistors were used as the basiclogic delay cell. A wire delay line and an edge rate selectorwere used to emulate the wire delay. However, adapting toa changing critical path as a result of process and parasiticvariations was not considered in [28]. In this paper, the basiclogic delay element used to emulate the logic delay is the FO4inverter due to the small difference in voltage scaling behaviorof the FO4 inverter and other logic gates [24]. The interconnectdelay element is a long, minimum width and spacing intercon-nect with repeaters. The coupling capacitance of the long wire


Fig. 8. Critical path emulator architecture.

Fig. 9. Implementation of the programmable logic and interconnect delay lines.

to its neighbors is maximized by switching the neighboringwires opposite to the main wire. Therefore, worst case couplingis accounted for. The drivers are inserted at an optimal distanceto reduce the overall delay [18]. The logic and the interconnectdelay lines are programmed using the -bit and -bit vectors,respectively. The appropriate number of delay cells is config-ured using a multiplexer as shown in Fig. 9. The multiplexersare implemented using a static CMOS configuration in order tomaintain approximately the same voltage scaling behavior asthe FO4 inverter. Therefore, the multiplexer delay is consideredpart of the logic delay.

IV. ENERGY EFFICIENCY ANALYSIS

The delay margin required by voltage scaling systems tomaintain an error-free operation is a key to determine the

overall energy efficiency. The smaller the margin required, theless the voltage margin (Vmargin) as shown in Fig. 5, and themore the energy savings. By closely tracking the real criticalpath at all conditions, the critical path emulator system offershigher energy efficiency without compromising robustness.

Conventionally, the reference path is selected at the slowprocess corner and slowest interconnect parasitics. Therefore,conventional open-loop systems require a delay margin tocompensate for three factors: process variations, temperaturefluctuations, and the difference in voltage scaling characteris-tics between logic and interconnects. The margin due to global(inter-die) process variations is considered in the followinganalysis. Energy saved by adapting to global variations canreach up to 24% when considering a three-sigma-distributionand five process splits [27]. Conversely, the margin required tocover for the local (intra-die) variations is considered commonfor all three voltage scaling systems analyzed in the followingand is factored out from the delay margin calculations. Simi-larly, the delay margin required to cover variations caused bytemperature fluctuations is considered common and is excludedfrom the delay margin analysis. Nevertheless, these commonfactors should be accounted for when designing each individualsystem. For example, given the temperature profile of the chip,closed-loop systems can be made more efficient by placingthe performance monitor near the hot spot. When multiple hotspots are encountered at physically large distances, multipleperformance monitors can be placed to cover the entire tem-perature profile and minimize the delay margin required forcompensation. Counters can be used to sample the frequencyof the different performance monitors. In this case, the smallestfrequency count is used to the adjust the chip voltage to cover


the delay degradation resulting from the highest temperatureon the chip.

Utilizing a closed-loop feedback mechanism enables thevoltage scaling system to compensate for process variations.Therefore, a replica of the critical path can be sufficient to em-ulate a logic-dominated path while it remains unique. However,as the interconnect delay ratio increases, the probabilitythat a logic-dominated path becomes critical increases at lowvoltage. Similar to Fig. 2, the margin required to cover forsuch situation can be quantified when realizing that the delayof the logic path plus margin at the minimum supply voltageshould be at least equal to the interconnect path delay. Thisrelationship can be expressed as

(7)where and are logic and interconnect delaysat the minimum supply voltage, respectively. The numerator andthe denominator represent the reference path delay plus marginand the logic-dominated path delay, respectively. Both are di-vided into a logic delay portion and an interconnect delay por-tion. The margin is added in terms of logic delay to guaranteethat the overall reference delay remains the most critical at thelow end of the supply scale. Simplifying (7), the delay margincan be expressed in terms of as

(8)

For open-loop systems, process variations should be ac-counted for by adding an extra margin. Therefore, the totaldelay margin required to cover worst case becomes

(9)

where is the margin required to cover process variations.Using (8) and (9), the estimated delay margin required by theconventional open-loop and closed-loop systems to accommo-date for the worst case delay scenario is computed and plottedin Fig. 10 for both the 0.18- and 0.13- m technologies. Theopen-loop system requires approximately 48% margin to ac-commodate for transistor and RC variations. For closed-loopsystems, including the CPE, such margin is negligible for apure logic delay. As increases, the delay margin increases.For the CPE system, the margin slightly increases while theopen-loop and the closed-loop margins increase significantly.Due to the larger impact of process variations at the 0.13- mnode, a larger delay margin than that of the 0.18- m node is re-quired. For example, when the reference path delay is mainlydue to optimally buffered interconnects ( ), thedelay margin required is 73% and 78% for the 0.18- and the0.13- m technologies, respectively.

The 5% margin shown to be required by the CPE systemfor logic-dominated paths is used to cover for the mismatchbetween the delay model described by (5) and the actual crit-ical path. Additional margin is required with increasing tocover for the digitization of process splits. For example, when

Fig. 10. Delay margin required by voltage scaling systems as a function ofinterconnect delay ratio of the reference path.

using three process splits, slow, typical, and fast, a slower thantypical split is treated as a slow split by the CPE system. Due tosuch digitization, the CPE system can be programmed to track acertain path while the actual critical path is different. The marginrequired to cover for 10-split digitization is calculated using (8).Fig. 10 shows that the total delay margin required by the CPEis approximately 11% and 12% for the 0.18- and the 0.13- mtechnologies, respectively.

Energy efficiency of the three voltage scaling systems is ex-tracted from the delay margin required for robust operation.Using (1) and (2) and the fact that energy dissipation is equiv-alent to power dissipation per clock cycle, energy reduction ofthe CPE system compared to conventional voltage scaling sys-tems is given by

Energy Reduction

(10)

where and are the dynamic and leakage en-ergy reduction of the CPE system with respect to the conven-tional system, respectively. and are the supply volt-ages required to achieve the target delay in the conventional andCPE systems, respectively, and is a technology parameter[29]. is the ratio of dynamic to leakage power dissipation of thesystem. Note that becomes smaller as voltage is scaled. Thisis due to the fact that dynamic power is reduced quadraticallywhile leakage power is reduced linearly with voltage as can beseen from (1) and (2), respectively. For simplicity, is taken tobe constant across the entire supply range. This assumption isvalid for the 0.18- and the 0.13- m technologies due to the rela-tively small leakage power but can be under estimating leakagepower for smaller feature sizes.

Using the 0.18- m CMOS technology parameters, the energyefficiency of the proposed system compared to conventional


Fig. 11. Energy efficiency of the proposed architecture compared to the con-ventional open-loop and closed-loop systems as a function of interconnect delayratio of the reference path.

open-loop and closed-loop voltage scaling systems is computedusing (10) and plotted in Fig. 11. The CPE system is up to39% and 24% more energy efficient compared to conventionalopen-loop and closed-loop systems, respectively. When com-paring the energy efficiency using the 0.13- m technology pa-rameters to that of the 0.18- m technology, a decrease of about2%–4% is observed. This is a result of the increased contribu-tion of the leakage power component to the overall power dis-sipation as the technology scales. Therefore, it can be predictedthat the effectiveness of supply scaling to reduce overall powerdissipation is decreasing with technology scaling [30].

V. EXPERIMENTAL RESULTS

A test chip is implemented to validate the CPE architectureagainst two different delay paths, one is an interconnect-dom-inated and the other is a logic-dominated path in a 6-metalCMOS 0.18- m technology. For the 0.18- m technology used,the typical delay for a 1 mm of M6 (top layer) wire is estimatedto be approximately 50 ps. Since the FO4 inverter delay for thetypical process corner is approximately 100 ps, a 5-mm wirelength is suitable for estimating the effect of interconnect delaywith a reasonable accuracy.

The schematic of the CPE test chip is shown in Fig. 12. Due tothe lack of accurate precharacterization silicon results, shift reg-isters are used to construct the LUT matrix of the CPE systeminstead of using a ROM. The LUTs are initially loaded withpost-layout simulation data which is fine-tuned using the actualsilicon data obtained after measurements. The logic/intercon-nect estimator includes a 9-bit counter used to measure the logicand interconnect ring oscillator speeds by counting the numberof edges every 1 s. Ten 9-bit registers are used to store the dig-itized split information for five process and five RC splits. Thering oscillator is first configured to measure logic speed. The fre-quency count is then compared to the logic speed informationstored in the LUT to determine the process split. Similarly, inter-connect speed is identified by configuring the ring oscillator intothe interconnect speed measurement mode. Using the process

Fig. 12. Schematic for the CPE test chip.

Fig. 13. Die photo for the CPE test chip.

and interconnect information, a specific LUT in the 5 5 ma-trix is selected and the delay lines are programmed based on thedata stored for each target delay.

A 16 16-bit unsigned multiplier is used as a test vehicle toverify the functionality of the CPE architecture. All of the 32inputs of the multiplier are tied together to a single input whichis synchronized with the system clock (CLK). The same input isused as an input to the programmable delay line. This input tog-gles its value every clock cycle. Accordingly, the inputs to themultiplier switch from all zeros to all ones and back to all zerosand so on. Therefore, exercising the critical path in the multi-plier is guaranteed. Only two bits of the multiplier output aremonitored in the verification process. The least significant bit

, has the shortest logic delay. An interconnect delay is addedto this output bit to mimic an interconnect-dominated path asshown in Fig. 12. The second output to be emulated is the mostsignificant bit . This path represents a logic-dominated pathdelay and scales differently with voltage compared to thepath. Both multiplier outputs and the CPE output are latchedusing the system clock.


Fig. 14. Measured versus back-annotated logic ring oscillator frequency.

The die photo is shown in Fig. 13. The layout dimensions are0.9 0.9 mm excluding the pad ring. The CPE delay lines oc-cupy an area of approximately 0.35 0.45 mm . Such an areais required to accommodate for slow frequency emulation byreplicating the logic and interconnect unit delays as necessaryto cover the entire frequency range. For medium to high fre-quency systems, however, the area of the CPE delay lines isexpected to go down significantly since the full range of delayemulation will be smaller. Furthermore, the area required by theregister banks used for building the LUTs will be significantlyreduced when replaced by read-only memory (ROM) in produc-tion chips.

The result of the process binning step is shown in Fig. 14. Themeasured frequency of the logic ring oscillator is shown to befaster than the back annotated typical frequency. On a five-splitprocess space, this frequency is mapped to the forth processcorner, i.e., . Similarly, the interconnect corner isidentified through the measurement of the interconnect-domi-nated ring oscillator.

The functionality of the CPE architecture is verified at dif-ferent supply voltage points and the ability of the CPE output totrack the 16 16 multiplier is evaluated at each point. At eachtarget supply voltage, the frequency of operation of the multi-plier is determined by increasing the system frequency grad-ually until the output flip-flops latch an incorrect value. The“Error” indicator shown in Fig. 12 detects the timing relation-ship between the multiplier and the CPE outputs. When “Error”is LOW, the values captured in both the multiplier and the CPElatches are the same while “Error” goes HIGH when the latchedvalues are different indicating that the CPE exhibited a longerthan necessary delay and failed to track.

The measurement arrangement is shown in Fig. 15(a) andthe timing diagram is shown in Fig. 15(b). The input is toggledevery clock cycle and the outputs switch at half the clock fre-quency. For example, when the system clock is 50 MHz, the out-puts switch at 25 MHz. On the test chip, the output of the CPE,the logic, and the interconnect paths before the flip-flops are ob-served off chip. The mismatch between the three different pathsis considered in the layout to minimize the sources of error in thedelay measurement. The “Error” indicator helps validating thetracking ability of the CPE system as described before. Further-more, the phase error between the unlatched multiplier outputs

Fig. 15. (a) Schematic for the CPE test chip and (b) the associated waveforms.

and the CPE output, taken as the reference, is measured. A pos-itive phase error indicates that the CPE output is leading (faster)than the multiplier output and margin added to CPE output isnot sufficient.

The measured results for the CPE output (CH4) and the multi-plier output (CH2) are shown in Fig. 16. The CPE delay trackingwith the multiplier output is measured at different supply volt-ages and different operating frequencies. At 1.8 V and 45 MHzswitching frequency (CLK frequency is 90 MHz), the phaseerror between the CPE and the multiplier output signals is mea-sured directly using the digital scope as shown in Fig. 16(a). Byprogramming the CPE delay lines the phase error is adjusted toapproximately 1.2 degrees. The CPE output has a safety marginover the multiplier output. Therefore, the multiplier output isleading the CPE output in phase. This safety margin is a de-sign parameter and can be controlled using the CPE delay lines.Due to the limitations imposed by the package used for testingthe chip, the range of operating frequency is limited to 90 MHz(45-MHz switching frequency) as shown in Fig. 16(a). If theCPE output is leading the multiplier such as the case shown inFig. 16(c), the CPE delay lines can be reprogrammed to trackthe multiplier output by adding more margin to the CPE output.By reprogramming the delay lines of the CPE system, the mul-tiplier delay can be tracked at different supply voltages such asshown in Fig. 16(b) and (d) for the 1.4 and the 0.9 V. The mea-surement of the phase error was associated with some jitter andthe snapshots shown in Fig. 16 include a single value for thephase error. The jitter effect was limited to around 1–2 degrees.The actual phase error value can reach up to 4–5 degrees whichis translated to a maximum of 5% frequency margin.

The measured current consumption of the CPE architectureis shown in Fig. 17. The power consumption of CPE is a func-tion of frequency (the CPE delay lines switch at the same fre-quency as the main system). For example, at a frequency of


Fig. 16. Measured results of the CPE test chip. In each waveform plot, thetop waveform (CH2) and the bottom waveform (CH4) represent the unlatchedMultiplier output and the CPE output, respectively. (a) V = 1.8 V, frequency= 45 MHz. (b)V = 1.4 V, frequency= 35 MHz. (c)V = 1.0 V, frequency= 15 MHz. (d) V = 900 mV, frequency = 10 MHz.

Fig. 17. Measured current dissipation of the CPE architecture.

100 MHz, the current dissipated by the CPE system is 3 mA.Using the linear relationship between frequency and power, thecurrent dissipated at a frequency of 1 GHz is expected to beapproximately 30 mA at the same voltage. In fact, the 30-mAcurrent dissipation is the upper limit considering that the delaylines required to emulate the 1-GHz system are smaller thanthose required to emulate the 100-MHz system. In this example,the CPE overhead for a low-power, 1-GHz system is approx-imately 3% assuming that the overall system power dissipa-tion is 2 W at 1.8 V. This overhead becomes smaller for higherpower dissipation systems. Furthermore, the current consumedby the CPE architecture is shown to scale well with the supplyvoltage. Therefore, the power dissipation overhead remains ap-proximately constant across the entire supply voltage range.

VI. CONCLUSION

The large safety margin required by conventional voltagescaling systems to guarantee a robust operation, even whenthe critical path changes under any circumstances, is directlytranslated to a voltage overhead and a corresponding energyinefficiency. A critical path emulator which is shown to closelytrack the actual critical path at any condition that yields ahigher energy efficiency compared to conventional systems.Such close tracking is achievable across different process andinterconnect parasitic corners. As a result, the CPE is up to39% and 24% more energy efficient compared to conventionalopen-loop and closed-loop systems, respectively. Experimentalresults show how the CPE system can be programmed tominimize the margin required when a logic-dominated and aninterconnect-dominated paths are to be tracked simultaneously.

ACKNOWLEDGMENT

The authors would like to thank M. Nummer of the Univer-sity of Waterloo, A. Fahim and I. Kang of Qualcomm Inc. fortheir enlightening discussions, and L. Chua and D. Kelley forfacilitating chip testing.

REFERENCES

[1] A. Chatterjee, M. Mandibular, and I. Chen, “An investigation of theimpact of technology scaling on power wasted as short-circuit currentin low voltage static CMOS circuits,” in Proc. Int. Symp. Low-PowerElectron. Design, 1996, pp. 145–150.


[2] T. Sakurai, “Perspectives on power-aware electronics,” in Proc. IEEESolid-State Circuits Conf., 2003, pp. 26–29.

[3] P. Gelsinger, “Gigascale integration for teraops performance: Chal-lenges, opportunities, and new frontiers,” in Proc. Design Autom. Conf.,2004, p. XXV.

[4] A. Chandrakasan and R. Broderson, “Minimizing power consumptionin digital CMOS circuits,” Proc. IEEE, vol. 83, no. 4, pp. 498–523,Apr. 1995.

[5] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC micropro-cessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703–1714,Nov. 1996.

[6] R. Swanson and J. Meindl, “Ion-implanted complementary MOS tran-sistors in low-voltage circuits,” IEEE J. Solid-State Circuits, vol. SC-7,no. 2, pp. 146–153, Apr. 1972.

[7] J. Meindl, “Low power microelectronics: Retrospect and prospect,”Proc. IEEE, vol. 83, no. 4, pp. 619–635, Apr. 1995.

[8] A. Wang and A. Chandrakasan, “A 180-mV subthreshold FFT pro-cessor using a minimum energy design methodology,” IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 310–319, Jan. 2005.

[9] A. Stratakos, S. Sanders, and R. Broderson, “A low-voltage CMOSDC-DC converter for a portable battery-operated system,” in Proc.IEEE Power Electron. Specialists Conf. (PESO), 1994, pp. 619–626.

[10] G. Wei and M. Horowitz, “A fully digital, energy-efficient adaptivepower-supply regulator,” IEEE J. Solid-State Circuits, vol. 34, no. 4,pp. 520–528, Apr. 1999.

[11] A. Dancy et al., “High-efficiency multiple-output DC-DC conversionfor low-voltage systems,” IEEE J. Solid-State Circuits, vol. 8, no. 3, pp.252–263, Jun. 2000.

[12] J. Kim and M. Horowitz, “An efficient digital sliding controller foradaptive power-supply regulation,” IEEE J. Solid-State Circuits, vol.37, no. 5, pp. 639–647, May 2002.

[13] B. Calhoun and A. Chandrakasan, “Standby power reduction using dy-namic voltage scaling and canary flip-flop structures,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1504–1511, Sep. 2004.

[14] S. Mutoh et al., “1-V power supply high-speed digital circuit tech-nology with multithreshold-voltage CMOS,” IEEE J. Solid-State Cir-cuits, vol. 30, no. 8, pp. 847–854, Aug. 1995.

[15] T. Kuroda et al., “Variable supply-voltage scheme for low-power high-speed CMOS digital design,” IEEE J. Solid-State Circuits, vol. 33, no.3, pp. 454–462, Mar. 1998.

[16] T. Burd, T. Peringa, A. Stratakos, and R. Broderson, “A dynamicvoltage scaled microprocessor system,” IEEE J. Solid-State Circuits,vol. 35, no. 11, pp. 1571–1580, Nov. 2000.

[17] D. Ernst et al., “Razor: A low-power pipeline based on circuit-leveltiming speculation,” in Proc. Micro Conf., 2003, pp. 7–18.

[18] R. Ho, K. Mai, and M. Horowitz, “The future of wires,” Proc. IEEE,vol. 89, no. 4, pp. 490–504, Apr. 2001.

[19] T. Sakurai and A. Newton, “Delay analysis of series-connectedMOSFET circuits,” IEEE J. Solid-State Circuits, vol. 26, no. 2, pp.122–131, Feb. 1991.

[20] T. Sakurai and A. Newton, “Alpha-power law MOSFET model andits applications to CMOS inverter delay and other formulas,” IEEE J.Solid-State Circuits, vol. 25, no. 4, pp. 584–593, Apr. 1991.

[21] J. Daga and D. Auvergne, “A comprehensive delay macro modelingfor submicrometer CMOS logics,” IEEE J. Solid-State Circuits, vol.34, no. 1, pp. 42–55, Jan. 1999.

[22] A. Chandrakasan et al., “Data-driven signal processing: An approachfor energy-efficient computing,” in Proc. Int. Symp. Low-Power Elec-tron. Design, 1996, pp. 347–352.

[23] G. Wei et al., “A variable-frequency parallel I/O interface with adaptivepower-supply regulation,” IEEE J. Solid-State Circuits, vol. 35, no. 11,pp. 1600–1610, Nov. 2000.

[24] R. Gonzalez and M. Horowitz, “Supply and threshold voltage scalingfor low power CMOS,” IEEE J. Solid-State Circuits, vol. 32, no. 8, pp.1210–1216, Aug. 1997.

[25] A. Bellaouar et al., “Supply voltage scaling for temperature insensitiveCMOS circuit operation,” IEEE Trans. Circuits Syst. II, Analog, Digit.Signal Process., vol. 45, no. 3, pp. 415–417, Mar. 1998.

[26] K. Kanda et al., “Design impact of positive temperature dependence ondrain current in sub-1-V CMOS VLSIs,” IEEE J. Solid-State Circuits,vol. 36, no. 10, pp. 1559–1564, Oct. 2001.

[27] M. Elgebaly, A. Fahim, I. Kang, and M. Sachdev, “Robust and efficientdynamic voltage scaling architecture,” in Proc. IEEE ASIC/SOC Conf.,2003, pp. 155–158.

[28] M. Nakai et al., “Dynamic voltage and frequency management for alow-power embedded microprocessor,” IEEE J. Solid-State Circuits,vol. 40, no. 1, pp. 28–35, Jan. 2005.

[29] S. Martin, K. Flautner, T. Mudge, and D. Blaauw, “Combined dy-namic voltage scaling and adaptive body biasing for lower powermicroprocessors under dynamic workloads,” in Proc. Design Autom.Conf., 2002, pp. 721–725.

[30] L. Yan, J. Luo, and N. Jha, “Joint dynamic voltage scaling and adaptivebody biasing for heterogeneous distributed real-time embedded sys-tems,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol.24, no. 7, pp. 1030–1041, Jul. 2005.

Mohamed Elgebaly (S’98–M’05) received the B.Sc.degree in electrical engineering from Minia Univer-sity, Minia, Egypt, in 1994, and the M.Sc. and Ph.D.degrees from the University of Waterloo, Waterloo,ON, Canada, in 2000 and 2005, respectively, wherehe focused on energy efficient digital circuits and ar-chitectures.

In 2005, he joined Montalvo Systems, San Jose,CA, where he works on low-power and high-perfor-mance circuits and architectures for media proces-sors. He was with Qualcomm Inc., San Diego, CA,

from 2004 to 2005, working on energy efficient techniques and methodologiesfor mobile processors. His research interests include low-power, low leakage cir-cuit design and energy efficient digital architectures. He holds three U.S. patents.

Manoj Sachdev (SM’97) received the B.E. degree(with honors) in electronics and communicationengineering from University of Roorkee, Roorkee,India, and the Ph.D. degree from Brunel University,London, U.K.

Since 1998, He has been a Professor in theElectrical and Computer Engineering Department,University of Waterloo, Waterloo, ON, Canada.He was with Semiconductor Complex Limited,Chandigarh, India, from 1984 to 1989, where hedesigned CMOS integrated circuits. From 1989 to

1992, he worked in the ASIC Division, SGS-Thomson, Agrate, Milan, Italy. In1992, he joined Philips Research Laboratories, Eindhoven, The Netherlands,where he researched on various aspects of VLSI testing and manufacturing.His research interests include low-power and high performance digital circuitdesign, mixed-signal circuit design, and test and manufacturing issues ofintegrated circuits. He has written three books, three book chapters, and hascontributed to over 125 technical articles in conferences and journals. He holdsmore than 15 granted and several pending U.S. patents in the broad area ofVLSI circuit design and test.

Dr. Sachdev was a recipient of several awards including the 1997 EuropeanDesign and Test Conference Best Paper Award, the 1998 International Test Con-ference Honorable Mention Award, and the 2004 VLSI Test Symposium BestPanel Award.

Documents

560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION ...cdr/pdf/Elgebaly - TVLSI 2007.pdf · 560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 5,