Compiler ion

8/8/2019 Compiler ion

1/4

Exploring Compiler Optimizations for Enhancing Power

Gating

Soumyaroop Roy, Nagarajan Ranganathan, and Srinivas Katkoori

Department of Computer Science and Engineering

University of South FloridaTampa, FL 33620

{sroy, ranganat, katkoori}@cse.usf.edu

AbstractPower gating is a circuit level technique for reducing standbyleakage in a circuit block by cutting off paths in it between the supplyand the ground. A processor architecture that supports power gatingof its resources may provide instructions that activate and deactivatethose resources as part of the instruction set architecture level. Adequatecompiler support is then required so that the power gating instructionscan be inserted into the code to deactivate the resources that remainidle for long periods of time during program execution. However, theresource usage in a program depends on the code generated by the

compiler. Thus, the code transformations performed by the compiler hasan influence on the power gating opportunities of the processor resources.

In this work, we explore target independent compiler optimizationsthat modify the functional unit usage in the loops of a procedure toenhance the opportunities to deactivate functional units in an embeddedprocessor architecture. The optimizations performed on the code aresparse conditional constant propagation, lazy code motion, weak strengthreduction, and operator strength reduction. Insertion of power gatinginstructions is performed by inspecting the idleness of the units in theregions enclosed within loops. We model the processor architecture withpower gating support around an ARM core and use the SUIF framework

for compiler support. Finally, we use the Simplescalar-ARM distributionto perform power and performance evaluation with a set of benchmarksfrom MiBench and MediaBench suites. Experimental results indicate thatthe integer multiplier in the processor core can be power gated for upto

99% of its idle cycles, for integer benchmarks, and upto 93%, for floatingpoint benchmarks, when all the optimizations are performed. Moreover,the energy due to leakage in the functional units for the code with all theoptimizations performed can be upto 51% lower, for integer benchmarks,

and upto 21% lower, for floating point benchmarks, than that for theunoptimized code.

I. INTRODUCTION AND MOTIVATION

The major components of power consumption in VLSI circuits

are dynamic power, short circuit power, and leakage power. In the

recent years, due to scaling of threshold voltage and gate-oxide

thickness, the contribution of power due to leakage currents to the

total power of a circuit has increased significantly [1]. Therefore,

reducing leakage power has become a vital aspect of design of

low-power VLSI circuits. One of the common techniques used to

reduce standby leakage current in a circuit is power gating [2]. In

this technique, the path between the supply and ground is cut off

by inserting a sleep transistor between the supply and the circuit or

between the circuit and the ground. Since the activation (sleep) anddeactivation (wakeup) of a circuit block results in dynamic energy

overhead, ensuring that the block remains idle for sufficiently long

time is critical in achieving overall energy savings. Power gating

is also applied at the architecture level to reduce leakage in the

components of a microprocessor during periods of their idleness. The

components are equipped with sleep transistors at the circuit level,

and the controls for the sleep transistors may be provided as special

instructions, called power gating instructions. The compiler is then

extended with adequate support to analyze the program behavior and

insert power gating instructions into the code where those components

are idle such that energy savings can be obtained during program

execution.

Several works have investigated the problem of power gating

functional units in a microprocessor to reduce leakage during their

idle periods at the compiler level. Rele et al. propose compiler level

support in [3] for power gating in superscalar processors. In [4],

Zhang et al. investigate power gating and input vector control in

VLIW architectures. You et al. apply dataflow analysis techniques

in [5] to find regions in the program where functional units are

idle. In [6], [7], Roy et al. present a compiler level frameworkwith architectural support in which the units are first designed based

on the specifications of those of the ARM processor and then

characterized for latency and power. The characterization of the units

is an important task because, due to the dynamic energy overhead

involved in activating and deactiving a circuit, the energy savings

through power gating depend not only on the period for which a

unit remains deactivated but also on the number of times that unit is

activated. Furthermore, the architecture proposed in [6] eliminates the

need for instructions to activate the functional units by automatically

waking them up at the decode stage of the processor pipeline.

The main role of the compiler support in all the works discussed

above has been twofold. First, it identifies regions in the program

during which functional units are idle. Then, power gating instruc-

tions are inserted at the boundaries of such regions to deactivateand activate them. However, the usage of the functional units in a

program region depends not only on the source code description

of the program but also on the code generated by the compiler.

Consequently, the power gating opportunities of such units is also

dependent on the code transformations performed by the compiler.

This aspect has not been addressed in any of the prior works. This

forms the main motivation for this work. In this paper, we perform

a set of compiler optimizations on the source program that performs

code transformations, thereby modifying the functional unit usage

in the generated code. These code transformations are performed to

enhance the opportunities for power gating of the functional units.

The rest of the paper is organized as follows. Section II describes

the framework for power gating with compiler optimization tech-

niques. The experimental results are discussed in Section III followedby conclusions in IV.

II. POWER GATING AT COMPILER LEVEL WITH COD E

OPTIMIZATIONS

In this work, the SUIF research compiler [8] is used for com-

piler support. Figure 1 describes the compiler level approach for

power gating functional units. The source files of an application

are translated into SUIF intermediate representation (IR) by the

SUIF frontend. The SUIF IR of the code is then converted into the

SUIF Virtual Machine (SUIFVM) representation, which is the IR

978-1-4244-3828-0/09/$25.00 2009 IEEE 1004


2/4

SUIFVM Translation

Code Optimizations

SUIF IR Translation

Insertion of PowerGating Instructions

Assembly, Linking

ARM Code Generation

and Simulation

MachineSUIFFramework

ArchitectureSupport

Sourcefiles

Fig. 1. Framework for power gating in an optimizing compiler

of the MachineSUIF framework [9]. All the code optimizations are

performed on the SUIFVM representation of the program. The ARM

code generation library [10] lowers the SUIFVM representation into

equivalent ARM assembly code. Power gating instructions are then

inserted into the assembly code to deactivate the units at the entry of

the regions where they are determined to be idle. This code is then

assembled and linked to generate an ARM executable whose runtime

execution is simulated on a cycle accurate simulator for performance

and energy evaluation. The architecture support details are used by

the assembler and the processor simulator to generate object code

and evaluate performance and energy statistics, respectively. SectionII-A describes the details of the architecture and assembler support

for the power gating instructions. Section II-B describes the compiler

optimizations developed and used in this work and, finally, Section

II-C describes the power gating technique used to insert the power

gating instructions into the assembly code.

A. Architecture and Assembler Support for Power Gating

We use the architectural support for power gating described in

[6]. The instruction set architecture (ISA) provides an explicit sleep

instruction, whose argument is the list of functional units that need

to be deactivated. The ISA, however, does not provide any explicit

wakeup instructions. When an instruction is decoded, the units

needed by the instruction are activated by the decode stage of the

pipeline. The library of functional units with power gating supportis characterized for 1 cycle wakeup latency for a clock period of

10 ns (100 MHz clock). The only modification done in this work is

that the barrel shifter is not equipped with any power gating support

because of the frequent usage of shift instructions in the code. Shift

instructions are generated during strength reduction of multiplication

operations with constant operands [11] and during generation of load

and store instructions with complex addressing modes [10].

A sleep control register (SCR) is added to the decode logic which

regulates the sleep controls for the functional units, as shown in

Figure 2. A 0 in the sleep control register indicates that the

1

0

1

0

SCR

Logic

ARM

DecodeFPAdder

FPDsqt

FPMult

IntMult

Fig. 2. Architecture support for power gating

functional unit driven by that register bit is active (awake mode),

while a 1 indicates that it is inactive (sleep mode). In Figure 2,

the contents of the SCR indicate that the integer multiplier, and the

FP division and square root unit, are active, while the FP adder and

the FP multiplier are inactive. When a sleep instruction is decoded,

the SCR is modified to deactivate the functional units passed to

the sleep instruction as arguments. When an instruction requiring

a certain functional unit is decoded, the SCR is modified to activate

that functional unit so that it is can be used by the time the instruction

enters the execution stage.

0F0 F X X Arg7

Machine code format of sleep instruction

Assembly format of sleep instruction

slp /* 4 bit argument */

31 27 23 19 15 11 7 3 0

Arg bit 0 : IntMultArg bit 1 : FPAddArg bit 2 : FPMultArg bit 3 : FPDSQT

Fig. 3. Assembly and machine code formats of the sleep instruction

The assembler support for translating the sleep instructions into

machine code is added to the GNU ARM assembler, which is part

of the binutils package. The format of the machine code for the

sleep instruction is chosen from the domain of exceptional opcodesdescribed in the ARM reference manual and is shown in Figure 3.

The assembly opcode for the sleep instruction is slp. The functional

units that need to be deactivated are encoded into a 4-bit integer, and

this is passed to the slp instruction as an argument. Bits 0-3 are

for deactivating the integer multiplier, floating point adder, floating

point multiplier, and floating point division and square root unit,

respectively. The machine code for the slp instruction has bits 7-0

as F0 and bits 31-20 as 07F. Bits 11-7 are used for encoding

the 4-bit argument passed to the sleep instruction.

The decode logic in the SimpleScalar-ARM distribution is also

extended to include the definition of the slp instruction. After the

slp instruction is decoded by the decode logic, the 4-bit argument is

extracted and a logical OR operation is performed with the contents of

the SCR before the result is stored back in the SCR. When any otherinstruction is decoded, the SCR entry corresponding to the functional

unit needed by the instruction, is overwritten with a 0.

B. Compiler Optimizations

We select four compiler optimizations that either modify arithmetic

instructions in the code or move them across basic blocks, thereby

changing the functional unit usage of the basic blocks. All the opti-

mizations described below are implemented as MachineSUIF passes

that perform code transformations on the SUIFVM representation of

the source program.

1005


3/4

1) Sparse Conditional Constant Propagation: Sparse conditional

constant propagation (SCCP) [12] is a global constant propagation

technique in which propagation of constant temporaries is performed

across basic blocks in the presence of conditional branches. This

optimization is performed on the static single assignment (SSA)

representation of the control flow graph (CFG) form of the code. SSA

form of the CFG is a representation in which a target temporary can

be at the destination of only one instruction [13]. A constant folding

library is also implemented at the SUIFVM level that computes thetarget temporary of an instruction whose operands are identified as

constants during this pass.

2) Lazy Code Motion: Code motion optimizations perform

dataflow analyses to identify the instructions that compute the same

value and move such instructions to locations in the code so that

they are executed less frequently. Lazy code motion (LCM) [14] is a

global code motion technique that eliminates redundant instructions

in a procedure of a program. A descendant of partial redundancy

elimination, LCM performs common subexpression elimination along

with loop invariant code motion. For this work, a publicly available

LCM implementation [15] in MachineSUIF is used.

3) Weak Strength Reduction: Strength reduction is a term used

to refer to techniques that replace expensive operations with inex-

pensive ones. Weak strength reduction (WSR) refers to replacing anexpression like x2 with the expression x+x. In this example, one

of the operands (the constant 2) in the multiplication expression has

been identified as a constant by the compiler. As part of this work,

the technique described in [11] is implemented for replacing integer

multiplication operations with a series of addition, subtraction, and

shift operations.

4) Operator Strength Reduction: A more powerful form of

strength reduction replaces repeated multiplications inside a loop with

repeated additions or subtractions. This is performed by identifying

induction variables, which are temporaries that get incremented or

decremented by a constant value during the execution of a loop,

and replacing multiplication operations involving such variables with

equivalent addition and subtraction operations. For this work, we im-

plement operator strength reduction (OSR) [16], which is performedon the SSA form of the the code. We also perform linear function test

replacement (LFTR) which replaces the uses of original induction

variables in comparison operations (branches) to render series of

computations useless. These useless computations are subsequently

removed by dead code elimination.

C. Insertion of Power Gating Instructions

Since the task of power gating at the compiler level requires

the details of the target architecture support, the insertion of power

gating instructions is not done on the SUIFVM representation of the

code. Instead, it is performed on the ARM assembly code as another

optimization pass in MachineSUIF. However, since the MachineSUIF

framework provides the capability to write optimization passes that

load target specific details during runtime, it is possible to write anabstract optimization pass that attains a concrete structure only during

runtime. This avoids writing the same target dependent optimization

task for each target platform. The dead code elimination pass supplied

with the MachineSUIF distribution illustrates this feature. This ap-

proach is adapted in implementing the power gating pass. The details

of the functional units, and the instruction set, including the format

of the sleep instruction, are obtained from the ARM code generation

library during runtime.

Due to the unavailability of a code instrumentation library for the

ARM backend on MachineSUIF, we do not use dynamic profiling

information for inserting sleep instructions into the code. Instead, we

use a static technique based on the control flow information of the

procedure. A loop tree [13], which is a data structure that maintains

information about all the loops in a function and the basic blocks

contained in those loops, is constructed. For all the functional units

that are not needed in the loop, a sleep instructions deactivating those

units is inserted at the entry to the loop. If the loop entry block has

only one external predecessor, the sleep instruction is inserted at the

end of the predecessor block. An external predecessor of a loop entryblock is a predecessor block which is not part of that loop. In case

there are more than one external predecessors of the entry block, a

new basic block is inserted with the sleep instruction and it is set as

the predecessor of the entry block. The original external predecessors

of the entry block are set as predecessors of the new block. This step

is performed first for the parent loop before any of its child loops so

as to ensure that no redundant sleep instructions are inserted.

III. EXPERIMENTAL RESULTS

The simulations of the ARM executables with the sleep instructions

are performed with the Simplescalar-ARM toolset [17] for a set of

benchmarks from the embedded benchmarks suites, Mibench [18] and

Mediabench [19]. The benchmarks range in size from 1 source file

and 174 lines of source code (Dijkstra) to more than 15 source filesand 8000-9000 lines of source code (Mpeg2E and Mpeg2D). Only

Dijkstra and Sha are integer benchmarks, while the rest are floating

point benchmarks. For leakage energy calculations, the leakage power

characterization of the library of functional units developed in [6] is

used.

TABLE IOPTIMIZATIONS PERFORMED ON THE BECHMARKS

Legend Description

unopt No optimizations

sccp SCCP

lcm SCCP + LCM

wsr SCCP + LCM + WSR

osr SCCP + LCM + OSR + WSR

Fig. 4. Percentage of idle cycles for which the integer multiplier is keptdeactivated.

The compiler optimizations discussed in Section II-B are per-

formed incrementally generating four optimization levels as enumer-

ated in Table I. The results of power gating with the optimizations are

compared to those with the unoptimized code generated by Machine-

SUIF. Since two of the optimizations explored in this work remove

1006


4/4

integer multiplication instructions from the code, the power gating

opportunities are improved significantly for the integer multiplier.

This can be seen in Figure 4, which plots the fraction of idle cycles for

Fig. 5. Average number of cycles for which the integer multiplier is keptturned off before it is woken up.

which the integer multiplier is kept deactivated. Except for SusanS,

the opportunity of power gating this unit improves significantly

in osr in all the benchmarks. This is because SusanS performs

integer multiplications on array members and since they are stored in

memory, these optimizations are not able to remove the multiplication

operations. Although, SCCP and LCM hardly improve the power

gating period for the integer multiplier (except for SusanE), they

are important prerequisites for the strength reduction optimizations

to be effective. For Sha benchmark, the integer multiplier is power

gated for 99% of its idle cycles with the code that is optimized with

osr. Among the floating point benchmarks, the integer multiplier for

Mpeg2D is power gated for 93% of its idle cycles in osr. Figure 5

Fig. 6. Percentage of leakage energy saved with each optimization over that

for the unoptimized code

shows the average number of cycles for which the integer multiplier

is power gated each time it is activated at the pipeline decode stage

after it decodes an integer multiply instruction. Comparing the results

in unopt and osr, the integer multiplier is power gated for a longer

period of time in the latter before it is woken up. This translates to a

fewer number of activations of the multiplier unit, thereby lowering

the dynamic energy overhead in activating the multiplier. Finally,

Figure 6 plots the percentage of energy saved due to leakage during

each optimization over that in unopt. For the integer benchmarks,

the floating point units are not used in the energy calculations. The

performance overhead of the additional sleep instructions for all the

benchmarks, except for Mpeg2D, is lower than 0.1%. For Mpeg2D,

it ranges from 0.57-0.69%.

IV. CONCLUSIONS AND FUTURE WOR K

In this paper, we have explored a few compiler tranformations

on applications for enhancing opportunities to power gate functional

units in an embedded processor. The optimizations discussed inthis work, particularly the strength reduction optimizations, focus on

integer operations. Therefore, the opportunities for power gating the

integer multiplier increase significantly when the optimizations are

performed. The library of compiler optimizations and power gating

developed for this work will be released publicly after we perform

sufficient testing of these passes with even bigger benchmarks. In

the future, compiler transformations to improve power gating for FP

operations will be explored, so that power gating of FP units can

be used to reduce leakage in applications that perform extensive FP

arithmetic operations.

REFERENCES

[1] R.K. Krishnamurthy et al. High-performance and low-voltage challengesfor sub-45nm microprocessor circuits. Intl. Conf. ASIC, pages 283286,

2005.[2] K. Roy. Leakage Power Reduction in Low-Voltage CMOS Design. Proc.

ICECS, pages 167173, 1998.[3] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing Static Power

Dissipation by Functional Units Superscalar processors. Proc. 11th Intl.Conf. on Compiler Construction, pages 261274, 2002.

[4] W. Zhang et al. Compiler Suppport for Reducing Leakage EnergyConsumption. DATE, pages 11461147, 2003.

[5] Y. You, C. Lee, and J.K. Lee. Compiler Analysis and Supports forLeakage Power Reduction on Microprocessors. ACM TODAES, pages147164, 2006.

[6] S. Roy, S. Katkoori, and N. Ranganathan. A Compiler Based LeakageReduction Technique by Power-Gating Functional Units in EmbeddedMicroprocessors. Proc. 20th Intl. Conf. VLSI Design, pages 215220,2007.

[7] S. Roy, N. Ranganathan, and S. Katkoori. A Framework for PowerGating Functional Units in Embedded Microprocessors. Accepted to

Trans. VLSI, 2008.[8] R. Wilson. The SUIF Compiler System: a Parallelizing and Optimizing

Research Compiler. Technical report, Stanford University, 1994.[9] M.D. Smith and G. Holloway. An Introduction to Machine SUIF and

Its Portable Libraries for Analysis and Optimization. http://www.eecs.harvard.edu/hube/software/, 2002.

[10] G. Theoduloz and D.S. Garcia. Machine SUIF Back-end for the ARMArchitecture. http://lap2.epfl.ch/dev/ machsuif/arm backend, 2005.

[11] P. Briggs and T.J. Harvey. Multiplication by Integer Constants. Technicalreport, Rice University, 1994.

[12] M.N. Wegman and F.K. Zadeck. Constant Propagation with ConditionalBranches. ACM TOPLAS, pages 231236, 1991.

[13] R. Morgan. Building and Optimizing Compiler. Digital Press, 1998.[14] J. Knoop, O Ruthing, and B. Steffen. Optimal Code Motion: Theory

and Practice. ACM TOPLAS, pages 11171155, 1994.[15] L Rolaz. An Implementation of Lazy Code Motion for Machine SUIF.

Technical report, Swiss Federal Institute of Technology, 2003.

[16] K.D. Cooper, L.T. Simpson, and C.A. Vick. Operator Strength Reduc-tion. ACM TOPLAS, pages 603625, 2001.

[17] D. Burger and T. Austin. The Simplescalar Tool Set, version 2.0.Technical report, TR-97-1342, University of Wisconsin-Madison, 1997.

[18] M.R. Guthaus et al. MiBench: A free, commercially representativeembedded benchmark suite. IEEE 4th Annual Workshop on WorkloadCharacterization , pages 314, 2001.

[19] C Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: atool for evaluating and synthesizing multimedia and communicationssystems. IEEE/ACM MICRO, page 330, 1997.

1007

Documents

Compiler ion