Application of Power-Management Techniques for Low …ece.utah.edu/~kstevens/6770/reports/06-cpu-power-report.pdf · 3 Fig. 2. Block Diagram of the processor IV. IMPLEMENTATION The

1

Application of Power-Management Techniquesfor Low Power Processor Design

Sivaram Gopalakrishnan, Chris Condrat, Elaine LyDepartment of Electrical and Computer Engineering, University of Utah, UT 84112

([email protected], [email protected], [email protected])

ECE 6770, May 06, 2006

Abstract— The growing demand for mobile and hand-held devices requires efficient low-power designs andmethodologies providing acceptable levels of performance.This project studies the effects of power-managementtechniques, specifically clock-gating, for a general-purpose,MIPS-architecture processor. The implementation beginsas a behavioral HDL model, and is converted to a syn-thesized model, for which simulated power-consumptionstatistics are derived; this process is then be repeated,but with low-power optimizations added. Performance isjudged through simulated analysis; however, the complete-ness of this approach is questioned, and recommendationsare provided for a more qualitative analysis, given theresources. Ultimately, this project gives insight into how theproposed low-power technique performs on such a design.

I. INTRODUCTION

General-purpose microprocessor design has tra-ditionally concentrated on maximizing performanceabove other design considerations such as power oreven area. This effort has resulted in great advancesin microprocessor designs, achieving performancethat often exceeds the needs of the user. Thisexcess performance can often be sacrificed withoutmuch loss of functionality and usability, allowingthe processors to be tailored for other applicationswhen performance is not always the most importantconsideration—for example mobile and hand-helddevices.

This project studies the effects of a power-management technique for a general-purpose,MIPS-architecture processor [1].

II. RESEARCH FOCUS

The MIPS-architecture is well suited for thisproject. The architecture is relatively old, built when

performance was not a luxury. In addition, thisarchitecture is well documented, and appropriatelycomplex—incorporating a 32-bit pipelined archi-tecture, with other modern processor features, butnot extremely advanced where more specializedtechniques are required.

The goal of this project is to incorporate power-management techniques [2] on this architecture.We briefly review conventional power-managementtechniques and highlight the one that we shall focusupon in this project:

• Power-down Schemes• Low-voltage Design• Frequency Scaling• Multiple Supply Voltages• Multiple Device Thresholds• Reducing Switching Capacitance

Power-down schemes are useful for powering downlogic blocks which are not in use. This techniquecan be applied to complex arithmetic circuits whichare used infrequently, for example a floating pointunit, or divider circuit.

A reduction in supply voltage results in aquadratic decrease in power [2]; however, the delayof CMOS gates increases inversely with the supplyvoltage. This loss of performance can be compen-sated, to some degree, by architectural and logicoptimizations. For example using a carry-lookaheadadder in place of a ripple-carry adder increasesperformance, but at the same time increases area andpower consumption. The carry-lookahead versioncan operate at a much lower supply voltage, with theequivalent performance of the ripple-carry version.This in turn translates into significant power savings.

2

Frequency scaling enables the processor to usedifferent operating clock frequencies. Reducing thefrequency relaxes the constraints on timing and thusallows for a reduction in voltage supply. The goalof frequency scaling is to achieve an optimal power-performance level.

As an extension to low-voltage design, usingmultiple supply voltages for selective portions ofthe design can result in a decrease of power con-sumption, for the same performance. Paths whichfinish computations early can be identified and theirrespective supply voltages can be suitably reduced.

Many technologies offer two types of n-type andp-type transistors, each with differing thresholds.Lower threshold devices are more appropriate fortime-critical paths, as they can switch faster, whilethe higher-threshold devices can have lower leak-age. Timing-critical paths can be identified and theappropriate threshold-scaling devices can be chosenfor implementation.

To reduce the switching capacitance in a cir-cuit, and therefore decrease power consump-tion, techniques such as transistor-sizing andlogic/architectural optimizations can be applied.

In this project, we limit our optimizations topower-down schemes—where we identify blocksof the processor that are not always utilized, andprevent their operation, appropriately. Specifically,we use a technique called “clock gating,” which pre-vents the clock signal from reaching certain blocksin the processor hierarchy, effectively preventingthem from switching, and therefore saving power.The benefit of such a technique is that performanceis not sacrificed if performance is needed, becausethe blocks can be dynamically turned on and offdepending on computational need. How this “need”will be determined will be the subject of latersections.

III. MIPS ARCHITECTURE

While the processor is based upon a MIPS-basedarchitecture, the design of this processor is specificto a low-power application. Therefore, some aspectsof the MIPS architecture may not be fully imple-mented as they are inappropriate for a low-powerdesign. For example, floating-point instructions are

often not found in many low-power applications,and if necessary, can be emulated in software.

A. Instruction Set

This processor architecture is a simplified versionof the basic MIPS architecture in order to concen-trate on low-power techniques, not so much thecomplexity of the instruction set. The instructionset architecture implements the following classes ofinstructions:

• Basic arithmetic and logic• Load/Store• Constant-manipulation• Branch and Jump

The basic arithmetic instructions comprise of ad-dition, subtraction and multiplication. Division canbe carried out in software, or be incorporated intothe multiplication block. In this case, the divisionunit is built into hardware.

B. Pipelining

Modern processors incorporate pipelining to im-prove the performance of the processor by effec-tively utilizing the hardware. The MIPS uses a five-stage pipeline, starting with an instruction fetchstage, followed by a decode/register-file read stage.Then there is an execution stage in the ALU,connecting to a memory-access stage. The finalstage is the write-back stage, where the resultingcomputation is either written to a register or tomemory. Fig. (1) depicts the pipeline.

Fig. 1. Five-stage Pipeline

Multi-cycle arithmetic instructions, such as thosefor a multiply, are managed at the instruc-tion/architectural level, as well as multi-cycle mem-ory reads/writes, which also require pipeline stallingif necessary. The diagram in Fig. (2) represents thetop-level block-diagram of the processor.

3

Fig. 2. Block Diagram of the processor

IV. IMPLEMENTATION

The MIPS architecture has been implementedin Verilog HDL using parametric instantiations asshown in figure 3. This implementation is a variationof the MIPS verilog core available at [3]. In [3], thecore was targeting an FPGA-based implementation.We have adapted this core to a standard-cell-basedimplementation.

A number of modules were removed, and othersrequired modification for use in the standard-cellimplementation. Communication-interface modulessuch as the UART and FIFOs, were removed fromthe final design, as they consumed a substantialamount of area—and more importantly power. Suchmodules only served to distort the initial statisticsunnecessarily, as they were for the FPGA-specificimplementation. Other modules required modifica-tion, as they were programmed assuming an FPGAenvironment, specifically the presence of FPGAblock-RAMs—large blocks of 1-cycle SRAM foundin FPGAs—which needed to be converted into flip-

flops (registers), or emulated in a test bench.

Emulating block-RAMs was an important part ofthe design modification. The FPGA design incorpo-rated the block-RAMs as the instruction cache/datamemory, preloading the memory banks using mem-ory bitmaps on compilation. This works well foran FPGA design, but not for a standard cell im-plementation, where an instruction cache would beimplemented in a very specialized manner. Further-more, for simulation purposes, the design requiredthat instructions could be fed directly into the designas needed, and therefore the “instruction cache”—found in the decoder module—was removed fromthe design altogether, instead being emulated withinthe testbench.

High-level “routing” for the former instructioncache signals was implemented by passing theblock-RAM signals up through the hierarchy andout the main processor modules for access by thetestbench. This was required as it is not possible—in a mapped design—to directly connect to the

4

uart_read Main uart_write FIFO

PC Pipe_REGfile

mul_div ALU Shifter RegFile

Decoder

Instruction Memory

Fig. 3. Implementation of the processor

sub-module signals that provide the instructions, ascan be done in structural Verilog. Providing theinstructions via these routed signals was also easierusing a single portmap.

The golden model design is therefore composedof the following modules:

• Mul Div unit: This module is exercised by themultiply and divide instructions.

• ALU: This module performs the basic arith-metic and logical operations.

• Shifter: This module is used for bitwise-shifts.• REGFile: This module contains the set of regis-

ters forming the register file of the MIPS core.

The decoder module interacts with the instruc-tions and data and essentially implements the in-struction fetch and the decode stages of the pipeline.The PC module implements the program counter.The PileREGfile module implements the other threestages of the pipeline.

Our focus primarily on the mul div unit forimplementing the optimization using the power-down scheme. Multiply and division operations aregenerally quite rare in many instruction streams;their probability of occurrence is often less than10%. Despite the infrequency of the instructions, theunit used to implement them consumes a significantratio of the total power in the processor; however,to eliminate the unit altogether and perform theoperations in software would incur a severe per-formance penalty. This performance-power trade-off

TABLE I

POWER CONSUMPTION OF INDIVIDUAL BLOCKS

Module PowerPipe REGfile 1.6647 WMUL DIV 549.6277 mWALU 247.4343 mWShifter 230.9180 mW

makes this unit an ideal candidate for our clock-gating optimization.

We choose to apply a clock-gating methodologyto the mul div unit, where the clock is appliedto this unit only when this unit performs a non-redundant operation. We identify the bits from theinstruction register that activate the operations usingthe mul div unit and generate a boolean logic whichcan further be used for clock-gating this unit.

V. PRELIMINARY RESULTS

Some preliminary experimental results are shownin Table I. It can be seen that the modulePipe REGfile which is mainly composed of thethree other modules (mul div, alu and Shifter) con-sumes as much as 1.6647 W. In this module, themul div module itself consumes around 550 mW.So the mul div module consumes almost 30% of thepower in the Pile REGfile. Adding a clock gatingmodule for the mul div can result in significantpower savings in the processor.

5

Fig. 4. Layout of the Golden (left) and Optimized Model (right)

TABLE II

COMPARISON OF THE GOLDEN MODEL AND THE OPTIMIZED MODEL

Design Characteristics Golden Model Optimized Model ImprovementCombinational Area 3852317 3856142 -0.1%Non-combinational Area 1988712 1988712 -Net Interconnect Area 22376476 22442624 -0.3%Cell Area 5841137 5844960 -0.065%Total Area 28217504 28287478 - 0.24%Timing 35.8 36.36 -1.5%Power 1.4681 W 1.3223 W + 10%

VI. OPTIMIZATION

The mul div module is utilized for the MUL andDIV instructions. Various signals are required tocontrol the operation of the mul div module, suchas a function bus which determines the sign, modeand the enable signals for the mul div module.In addition, there is a stop-bit which determineswhether the MUL/DIV operation is completed. Onlywhen the stop-bit is turned off, does the decodermodule fetch the next instruction from the instruc-tion memory.

An enable signal is generated to determinewhether the mul div module is required for thecurrent instruction. We, therefore use this signalalong with the clock to perform the clock-gatingfor this module.

A. Results

After performing the optimization, we ran ran-dom simulations for both the golden model and

the optimized model. The results from the randomsimulations are presented in table II. In this table,we present the design characteristics such as: thecombinational area, non-combinational area, net in-terconnect area, the total areas, timing and power. Itcan be seen that although there is a slight increasein the areas and timing characteristics (as expected),there is significant savings in power. Hence, it canbe seen that power-usage is significantly improvedwith negligible performance loss.

Figure 4 depict the final layout descriptions of thegolden model and the optimized model, respectively.

VII. TESTING AND VERIFICATION

Figure 5 shows the structure of our test benchrepresenting the top-level module for simulationand verification purposes. Clock and reset signals,along with block-RAM simulation-inputs and out-puts, serve as the interface to the design. The block-RAM modules are simulated, in the testbench, usingregister-arrays. This “memory” is preloaded with the

6

TABLE III

POWER CONSUMPTION OF INDIVIDUAL BLOCKS

Benchmarks Total Instructions Mul/Div InstructionsCount 354 10/0

Calculator 460 2/1Pi 145 5/3

Reed Solomon 1219 5/1Dhrystone 394 1/1

data prior to simulation, via Memory InitializationFile (MIF)-format files. In total, the block-RAMmodules simulate four 4K-memory blocks, as werepresent in the original FPGA design.

Five benchmarks were chosen for simulation: asimple arithmetic calculator, incremental counter,π-calculator, Reed Solomon code generator, and aDhrystone integer-programming benchmark. Thesebenchmarks were provided along with the originalFPGA design [3], as a series of test vectors writtenin assembly language. These benchmarks are thencompiled into MIF files, which are then loaded intothe block-RAMs for simulation.

As seen from Table III, multiplication and divi-sion operations represent an insignificant percentageof instructions. However, they do not necessarilyreflect the potential effectiveness of our power op-timization technique. Due to the non-linear natureof instruction streams, involving loops and jumps,instructions involving multiply and division instruc-tions may represent a significant portion of theactual computations. Hence, these test cases aretreated as generic tests to verify that instructionsare handled correctly by the core. For a purpose oftesting the mul-div module, we use this test casesset as a minimum test case. This proves to be alimitation in the analysis of power-usage.

VIII. LIMITATIONS AND FUTURE WORK

One of the important aspects of power calculationis to be able to determine the power consumed oversome specific test-bench cases. A more qualitativepower analysis can only be performed using ac-tual simulation, as a simple analysis of sequenceof instructions is not sufficient for true power-use evaluation. We can then statistically estimatethe power consumption depending on the numberof MUL/DIV instructions occurring in an actual

instruction stream as opposed to an instructiondistribution as seen from simple program analysis.There are many tools available in the industry to au-tomatically perform such estimations. However, inacademia (University of Utah), resources and timeare limited, and two options presented themselves aspossible methods to estimate power-usage at a morequalitative level: extensive analog simulation, andpattern simulation via a dedicated tool (NanoSim).

Fig. 5. Testbench platform.

A. Analog Simulation

Using the Analog Artist tool in the Cadenceenvironment, we can run the simulation using ourtest-vectors and calculate power using the operatingvoltage, and the current drained over a period oftime. However, in our case it would be become anexpensive operation in terms of memory and timesince we have to run these simulations over a 32-bit microprocessor core. Hence, this method did notpresent itself as a viable option.

B. NanoSim Simulations

At the suggestion of another student, in the finaldesign-review, we sought out a tool called NanoSimto provide power estimates. NanoSim is a tool whichallows estimation of power using specific simulationpatterns; however, this tool requires a lot of informa-tion about the library elements and their parametersfor performing this estimation. To the best of ourknowledge, the UofU Digital v1 1 library that isavailable to us—and the library used to map thedesign—does not have these parameter files. Also,

7

NanoSim accepts a vector file format (.vec), whichneeds to be generated from the .vcd file format ofthe testbench—requiring further scripts to performthe conversion. If the parameter file issue is re-solved, NanoSim can ideally be used to perform thepower-estimation.

C. Realistic Benchmarking

The benchmarks presented earlier do not reflect“real world” processor-usage. Most of the bench-marks are performance benchmarks–designed to testthe arithmetic capabilities of the processor design.Benchmarks reflecting the standard use of general-purpose processor, would be useful for gauging theeffectiveness of a low-power design.

IX. CONCLUSION

We have successfully demonstrated that clock-gating provides improved power-reduction perfor-mance for a realistic processor design, while negli-gibly affecting the performance and area character-istics of the chip. It must be noted, however, thatthe testing methods used to gauge the effectivenessof the power-optimization approach do not paint afull picture of the extent by which this approachhas improved power-savings. One method to im-prove the quality of the power-savings is throughsimulation; however, this has limitations as well, asthe type of benchmark is also important. This willbe an area of future research, which due to timeand resource constraints, cannot be performed forthis project. However, our results do show promise,and may serve as a basis for future power-savingtechniques, with more qualitative analysis on howthese techniques perform in realistic environments.

REFERENCES

[1] John L. Hennessy and David A. Patterson, Computer Oorgani-zation & Design, Morgan Kaufmann Publishers, 1998.

[2] Anantha Chandrakasan, William J. Bowhill, and Frank Fox,Design of High-Performance Microprocessor Circuits, IEEEPress, 2001.

[3] T. Sugawara, “YACC—Yet Another CPU CPU”,http://www.opencores.org/projects.cgi/web/yacc/overview/.

Documents

Application of Power-Management Techniques for Low …ece.utah.edu/~kstevens/6770/reports/06-cpu-power-report.pdf · 3 Fig. 2. Block Diagram of the processor IV. IMPLEMENTATION The