Compiler ion

Embed Size (px)

Citation preview

  • 8/8/2019 Compiler ion

    1/4

    Exploring Compiler Optimizations for Enhancing Power

    Gating

    Soumyaroop Roy, Nagarajan Ranganathan, and Srinivas Katkoori

    Department of Computer Science and Engineering

    University of South FloridaTampa, FL 33620

    {sroy, ranganat, katkoori}@cse.usf.edu

    AbstractPower gating is a circuit level technique for reducing standbyleakage in a circuit block by cutting off paths in it between the supplyand the ground. A processor architecture that supports power gatingof its resources may provide instructions that activate and deactivatethose resources as part of the instruction set architecture level. Adequatecompiler support is then required so that the power gating instructionscan be inserted into the code to deactivate the resources that remainidle for long periods of time during program execution. However, theresource usage in a program depends on the code generated by the

    compiler. Thus, the code transformations performed by the compiler hasan influence on the power gating opportunities of the processor resources.

    In this work, we explore target independent compiler optimizationsthat modify the functional unit usage in the loops of a procedure toenhance the opportunities to deactivate functional units in an embeddedprocessor architecture. The optimizations performed on the code aresparse conditional constant propagation, lazy code motion, weak strengthreduction, and operator strength reduction. Insertion of power gatinginstructions is performed by inspecting the idleness of the units in theregions enclosed within loops. We model the processor architecture withpower gating support around an ARM core and use the SUIF framework

    for compiler support. Finally, we use the Simplescalar-ARM distributionto perform power and performance evaluation with a set of benchmarksfrom MiBench and MediaBench suites. Experimental results indicate thatthe integer multiplier in the processor core can be power gated for upto

    99% of its idle cycles, for integer benchmarks, and upto 93%, for floatingpoint benchmarks, when all the optimizations are performed. Moreover,the energy due to leakage in the functional units for the code with all theoptimizations performed can be upto 51% lower, for integer benchmarks,

    and upto 21% lower, for floating point benchmarks, than that for theunoptimized code.

    I. INTRODUCTION AND MOTIVATION

    The major components of power consumption in VLSI circuits

    are dynamic power, short circuit power, and leakage power. In the

    recent years, due to scaling of threshold voltage and gate-oxide

    thickness, the contribution of power due to leakage currents to the

    total power of a circuit has increased significantly [1]. Therefore,

    reducing leakage power has become a vital aspect of design of

    low-power VLSI circuits. One of the common techniques used to

    reduce standby leakage current in a circuit is power gating [2]. In

    this technique, the path between the supply and ground is cut off

    by inserting a sleep transistor between the supply and the circuit or

    between the circuit and the ground. Since the activation (sleep) anddeactivation (wakeup) of a circuit block results in dynamic energy

    overhead, ensuring that the block remains idle for sufficiently long

    time is critical in achieving overall energy savings. Power gating

    is also applied at the architecture level to reduce leakage in the

    components of a microprocessor during periods of their idleness. The

    components are equipped with sleep transistors at the circuit level,

    and the controls for the sleep transistors may be provided as special

    instructions, called power gating instructions. The compiler is then

    extended with adequate support to analyze the program behavior and

    insert power gating instructions into the code where those components

    are idle such that energy savings can be obtained during program

    execution.

    Several works have investigated the problem of power gating

    functional units in a microprocessor to reduce leakage during their

    idle periods at the compiler level. Rele et al. propose compiler level

    support in [3] for power gating in superscalar processors. In [4],

    Zhang et al. investigate power gating and input vector control in

    VLIW architectures. You et al. apply dataflow analysis techniques

    in [5] to find regions in the program where functional units are

    idle. In [6], [7], Roy et al. present a compiler level frameworkwith architectural support in which the units are first designed based

    on the specifications of those of the ARM processor and then

    characterized for latency and power. The characterization of the units

    is an important task because, due to the dynamic energy overhead

    involved in activating and deactiving a circuit, the energy savings

    through power gating depend not only on the period for which a

    unit remains deactivated but also on the number of times that unit is

    activated. Furthermore, the architecture proposed in [6] eliminates the

    need for instructions to activate the functional units by automatically

    waking them up at the decode stage of the processor pipeline.

    The main role of the compiler support in all the works discussed

    above has been twofold. First, it identifies regions in the program

    during which functional units are idle. Then, power gating instruc-

    tions are inserted at the boundaries of such regions to deactivateand activate them. However, the usage of the functional units in a

    program region depends not only on the source code description

    of the program but also on the code generated by the compiler.

    Consequently, the power gating opportunities of such units is also

    dependent on the code transformations performed by the compiler.

    This aspect has not been addressed in any of the prior works. This

    forms the main motivation for this work. In this paper, we perform

    a set of compiler optimizations on the source program that performs

    code transformations, thereby modifying the functional unit usage

    in the generated code. These code transformations are performed to

    enhance the opportunities for power gating of the functional units.

    The rest of the paper is organized as follows. Section II describes

    the framework for power gating with compiler optimization tech-

    niques. The experimental results are discussed in Section III followedby conclusions in IV.

    II. POWER GATING AT COMPILER LEVEL WITH COD E

    OPTIMIZATIONS

    In this work, the SUIF research compiler [8] is used for com-

    piler support. Figure 1 describes the compiler level approach for

    power gating functional units. The source files of an application

    are translated into SUIF intermediate representation (IR) by the

    SUIF frontend. The SUIF IR of the code is then converted into the

    SUIF Virtual Machine (SUIFVM) representation, which is the IR

    978-1-4244-3828-0/09/$25.00 2009 IEEE 1004

  • 8/8/2019 Compiler ion

    2/4

    SUIFVM Translation

    Code Optimizations

    SUIF IR Translation

    Insertion of PowerGating Instructions

    Assembly, Linking

    ARM Code Generation

    and Simulation

    MachineSUIFFramework

    ArchitectureSupport

    Sourcefiles

    Fig. 1. Framework for power gating in an optimizing compiler

    of the MachineSUIF framework [9]. All the code optimizations are

    performed on the SUIFVM representation of the program. The ARM

    code generation library [10] lowers the SUIFVM representation into

    equivalent ARM assembly code. Power gating instructions are then

    inserted into the assembly code to deactivate the units at the entry of

    the regions where they are determined to be idle. This code is then

    assembled and linked to generate an ARM executable whose runtime

    execution is simulated on a cycle accurate simulator for performance

    and energy evaluation. The architecture support details are used by

    the assembler and the processor simulator to generate object code

    and evaluate performance and energy statistics, respectively. SectionII-A describes the details of the architecture and assembler support

    for the power gating instructions. Section II-B describes the compiler

    optimizations developed and used in this work and, finally, Section

    II-C describes the power gating technique used to insert the power

    gating instructions into the assembly code.

    A. Architecture and Assembler Support for Power Gating

    We use the architectural support for power gating described in

    [6]. The instruction set architecture (ISA) provides an explicit sleep

    instruction, whose argument is the list of functional units that need

    to be deactivated. The ISA, however, does not provide any explicit

    wakeup instructions. When an instruction is decoded, the units

    needed by the instruction are activated by the decode stage of the

    pipeline. The library of functional units with power gating supportis characterized for 1 cycle wakeup latency for a clock period of

    10 ns (100 MHz clock). The only modification done in this work is

    that the barrel shifter is not equipped with any power gating support

    because of the frequent usage of shift instructions in the code. Shift

    instructions are generated during strength reduction of multiplication

    operations with constant operands [11] and during generation of load

    and store instructions with complex addressing modes [10].

    A sleep control register (SCR) is added to the decode logic which

    regulates the sleep controls for the functional units, as shown in

    Figure 2. A 0 in the sleep control register indicates that the

    1

    0

    1

    0

    SCR

    Logic

    ARM

    DecodeFPAdder

    FPDsqt

    FPMult

    IntMult

    Fig. 2. Architecture support for power gating

    functional unit driven by that register bit is active (awake mode),

    while a 1 indicates that it is inactive (sleep mode). In Figure 2,

    the contents of the SCR indicate that the integer multiplier, and the

    FP division and square root unit, are active, while the FP adder and

    the FP multiplier are inactive. When a sleep instruction is decoded,

    the SCR is modified to deactivate the functional units passed to

    the sleep instruction as arguments. When an instruction requiring

    a certain functional unit is decoded, the SCR is modified to activate

    that functional unit so that it is can be used by the time the instruction

    enters the execution stage.

    0F0 F X X Arg7

    Machine code format of sleep instruction

    Assembly format of sleep instruction

    slp /* 4 bit argument */

    31 27 23 19 15 11 7 3 0

    Arg bit 0 : IntMultArg bit 1 : FPAddArg bit 2 : FPMultArg bit 3 : FPDSQT

    Fig. 3. Assembly and machine code formats of the sleep instruction

    The assembler support for translating the sleep instructions into

    machine code is added to the GNU ARM assembler, which is part

    of the binutils package. The format of the machine code for the

    sleep instruction is chosen from the domain of exceptional opcodesdescribed in the ARM reference manual and is shown in Figure 3.

    The assembly opcode for the sleep instruction is slp. The functional

    units that need to be deactivated are encoded into a 4-bit integer, and

    this is passed to the slp instruction as an argument. Bits 0-3 are

    for deactivating the integer multiplier, floating point adder, floating

    point multiplier, and floating point division and square root unit,

    respectively. The machine code for the slp instruction has bits 7-0

    as F0 and bits 31-20 as 07F. Bits 11-7 are used for encoding

    the 4-bit argument passed to the sleep instruction.

    The decode logic in the SimpleScalar-ARM distribution is also

    extended to include the definition of the slp instruction. After the

    slp instruction is decoded by the decode logic, the 4-bit argument is

    extracted and a logical OR operation is performed with the contents of

    the SCR before the result is stored back in the SCR. When any otherinstruction is decoded, the SCR entry corresponding to the functional

    unit needed by the instruction, is overwritten with a 0.

    B. Compiler Optimizations

    We select four compiler optimizations that either modify arithmetic

    instructions in the code or move them across basic blocks, thereby

    changing the functional unit usage of the basic blocks. All the opti-

    mizations described below are implemented as MachineSUIF passes

    that perform code transformations on the SUIFVM representation of

    the source program.

    1005

  • 8/8/2019 Compiler ion

    3/4

    1) Sparse Conditional Constant Propagation: Sparse conditional

    constant propagation (SCCP) [12] is a global constant propagation

    technique in which propagation of constant temporaries is performed

    across basic blocks in the presence of conditional branches. This

    optimization is performed on the static single assignment (SSA)

    representation of the control flow graph (CFG) form of the code. SSA

    form of the CFG is a representation in which a target temporary can

    be at the destination of only one instruction [13]. A constant folding

    library is also implemented at the SUIFVM level that computes thetarget temporary of an instruction whose operands are identified as

    constants during this pass.

    2) Lazy Code Motion: Code motion optimizations perform

    dataflow analyses to identify the instructions that compute the same

    value and move such instructions to locations in the code so that

    they are executed less frequently. Lazy code motion (LCM) [14] is a

    global code motion technique that eliminates redundant instructions

    in a procedure of a program. A descendant of partial redundancy

    elimination, LCM performs common subexpression elimination along

    with loop invariant code motion. For this work, a publicly available

    LCM implementation [15] in MachineSUIF is used.

    3) Weak Strength Reduction: Strength reduction is a term used

    to refer to techniques that replace expensive operations with inex-

    pensive ones. Weak strength reduction (WSR) refers to replacing anexpression like x2 with the expression x+x. In this example, one

    of the operands (the constant 2) in the multiplication expression has

    been identified as a constant by the compiler. As part of this work,

    the technique described in [11] is implemented for replacing integer

    multiplication operations with a series of addition, subtraction, and

    shift operations.

    4) Operator Strength Reduction: A more powerful form of

    strength reduction replaces repeated multiplications inside a loop with

    repeated additions or subtractions. This is performed by identifying

    induction variables, which are temporaries that get incremented or

    decremented by a constant value during the execution of a loop,

    and replacing multiplication operations involving such variables with

    equivalent addition and subtraction operations. For this work, we im-

    plement operator strength reduction (OSR) [16], which is performedon the SSA form of the the code. We also perform linear function test

    replacement (LFTR) which replaces the uses of original induction

    variables in comparison operations (branches) to render series of

    computations useless. These useless computations are subsequently

    removed by dead code elimination.

    C. Insertion of Power Gating Instructions

    Since the task of power gating at the compiler level requires

    the details of the target architecture support, the insertion of power

    gating instructions is not done on the SUIFVM representation of the

    code. Instead, it is performed on the ARM assembly code as another

    optimization pass in MachineSUIF. However, since the MachineSUIF

    framework provides the capability to write optimization passes that

    load target specific details during runtime, it is possible to write anabstract optimization pass that attains a concrete structure only during

    runtime. This avoids writing the same target dependent optimization

    task for each target platform. The dead code elimination pass supplied

    with the MachineSUIF distribution illustrates this feature. This ap-

    proach is adapted in implementing the power gating pass. The details

    of the functional units, and the instruction set, including the format

    of the sleep instruction, are obtained from the ARM code generation

    library during runtime.

    Due to the unavailability of a code instrumentation library for the

    ARM backend on MachineSUIF, we do not use dynamic profiling

    information for inserting sleep instructions into the code. Instead, we

    use a static technique based on the control flow information of the

    procedure. A loop tree [13], which is a data structure that maintains

    information about all the loops in a function and the basic blocks

    contained in those loops, is constructed. For all the functional units

    that are not needed in the loop, a sleep instructions deactivating those

    units is inserted at the entry to the loop. If the loop entry block has

    only one external predecessor, the sleep instruction is inserted at the

    end of the predecessor block. An external predecessor of a loop entryblock is a predecessor block which is not part of that loop. In case

    there are more than one external predecessors of the entry block, a

    new basic block is inserted with the sleep instruction and it is set as

    the predecessor of the entry block. The original external predecessors

    of the entry block are set as predecessors of the new block. This step

    is performed first for the parent loop before any of its child loops so

    as to ensure that no redundant sleep instructions are inserted.

    III. EXPERIMENTAL RESULTS

    The simulations of the ARM executables with the sleep instructions

    are performed with the Simplescalar-ARM toolset [17] for a set of

    benchmarks from the embedded benchmarks suites, Mibench [18] and

    Mediabench [19]. The benchmarks range in size from 1 source file

    and 174 lines of source code (Dijkstra) to more than 15 source filesand 8000-9000 lines of source code (Mpeg2E and Mpeg2D). Only

    Dijkstra and Sha are integer benchmarks, while the rest are floating

    point benchmarks. For leakage energy calculations, the leakage power

    characterization of the library of functional units developed in [6] is

    used.

    TABLE IOPTIMIZATIONS PERFORMED ON THE BECHMARKS

    Legend Description

    unopt No optimizations

    sccp SCCP

    lcm SCCP + LCM

    wsr SCCP + LCM + WSR

    osr SCCP + LCM + OSR + WSR

    Fig. 4. Percentage of idle cycles for which the integer multiplier is keptdeactivated.

    The compiler optimizations discussed in Section II-B are per-

    formed incrementally generating four optimization levels as enumer-

    ated in Table I. The results of power gating with the optimizations are

    compared to those with the unoptimized code generated by Machine-

    SUIF. Since two of the optimizations explored in this work remove

    1006

  • 8/8/2019 Compiler ion

    4/4

    integer multiplication instructions from the code, the power gating

    opportunities are improved significantly for the integer multiplier.

    This can be seen in Figure 4, which plots the fraction of idle cycles for

    Fig. 5. Average number of cycles for which the integer multiplier is keptturned off before it is woken up.

    which the integer multiplier is kept deactivated. Except for SusanS,

    the opportunity of power gating this unit improves significantly

    in osr in all the benchmarks. This is because SusanS performs

    integer multiplications on array members and since they are stored in

    memory, these optimizations are not able to remove the multiplication

    operations. Although, SCCP and LCM hardly improve the power

    gating period for the integer multiplier (except for SusanE), they

    are important prerequisites for the strength reduction optimizations

    to be effective. For Sha benchmark, the integer multiplier is power

    gated for 99% of its idle cycles with the code that is optimized with

    osr. Among the floating point benchmarks, the integer multiplier for

    Mpeg2D is power gated for 93% of its idle cycles in osr. Figure 5

    Fig. 6. Percentage of leakage energy saved with each optimization over that

    for the unoptimized code

    shows the average number of cycles for which the integer multiplier

    is power gated each time it is activated at the pipeline decode stage

    after it decodes an integer multiply instruction. Comparing the results

    in unopt and osr, the integer multiplier is power gated for a longer

    period of time in the latter before it is woken up. This translates to a

    fewer number of activations of the multiplier unit, thereby lowering

    the dynamic energy overhead in activating the multiplier. Finally,

    Figure 6 plots the percentage of energy saved due to leakage during

    each optimization over that in unopt. For the integer benchmarks,

    the floating point units are not used in the energy calculations. The

    performance overhead of the additional sleep instructions for all the

    benchmarks, except for Mpeg2D, is lower than 0.1%. For Mpeg2D,

    it ranges from 0.57-0.69%.

    IV. CONCLUSIONS AND FUTURE WOR K

    In this paper, we have explored a few compiler tranformations

    on applications for enhancing opportunities to power gate functional

    units in an embedded processor. The optimizations discussed inthis work, particularly the strength reduction optimizations, focus on

    integer operations. Therefore, the opportunities for power gating the

    integer multiplier increase significantly when the optimizations are

    performed. The library of compiler optimizations and power gating

    developed for this work will be released publicly after we perform

    sufficient testing of these passes with even bigger benchmarks. In

    the future, compiler transformations to improve power gating for FP

    operations will be explored, so that power gating of FP units can

    be used to reduce leakage in applications that perform extensive FP

    arithmetic operations.

    REFERENCES

    [1] R.K. Krishnamurthy et al. High-performance and low-voltage challengesfor sub-45nm microprocessor circuits. Intl. Conf. ASIC, pages 283286,

    2005.[2] K. Roy. Leakage Power Reduction in Low-Voltage CMOS Design. Proc.

    ICECS, pages 167173, 1998.[3] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing Static Power

    Dissipation by Functional Units Superscalar processors. Proc. 11th Intl.Conf. on Compiler Construction, pages 261274, 2002.

    [4] W. Zhang et al. Compiler Suppport for Reducing Leakage EnergyConsumption. DATE, pages 11461147, 2003.

    [5] Y. You, C. Lee, and J.K. Lee. Compiler Analysis and Supports forLeakage Power Reduction on Microprocessors. ACM TODAES, pages147164, 2006.

    [6] S. Roy, S. Katkoori, and N. Ranganathan. A Compiler Based LeakageReduction Technique by Power-Gating Functional Units in EmbeddedMicroprocessors. Proc. 20th Intl. Conf. VLSI Design, pages 215220,2007.

    [7] S. Roy, N. Ranganathan, and S. Katkoori. A Framework for PowerGating Functional Units in Embedded Microprocessors. Accepted to

    Trans. VLSI, 2008.[8] R. Wilson. The SUIF Compiler System: a Parallelizing and Optimizing

    Research Compiler. Technical report, Stanford University, 1994.[9] M.D. Smith and G. Holloway. An Introduction to Machine SUIF and

    Its Portable Libraries for Analysis and Optimization. http://www.eecs.harvard.edu/hube/software/, 2002.

    [10] G. Theoduloz and D.S. Garcia. Machine SUIF Back-end for the ARMArchitecture. http://lap2.epfl.ch/dev/ machsuif/arm backend, 2005.

    [11] P. Briggs and T.J. Harvey. Multiplication by Integer Constants. Technicalreport, Rice University, 1994.

    [12] M.N. Wegman and F.K. Zadeck. Constant Propagation with ConditionalBranches. ACM TOPLAS, pages 231236, 1991.

    [13] R. Morgan. Building and Optimizing Compiler. Digital Press, 1998.[14] J. Knoop, O Ruthing, and B. Steffen. Optimal Code Motion: Theory

    and Practice. ACM TOPLAS, pages 11171155, 1994.[15] L Rolaz. An Implementation of Lazy Code Motion for Machine SUIF.

    Technical report, Swiss Federal Institute of Technology, 2003.

    [16] K.D. Cooper, L.T. Simpson, and C.A. Vick. Operator Strength Reduc-tion. ACM TOPLAS, pages 603625, 2001.

    [17] D. Burger and T. Austin. The Simplescalar Tool Set, version 2.0.Technical report, TR-97-1342, University of Wisconsin-Madison, 1997.

    [18] M.R. Guthaus et al. MiBench: A free, commercially representativeembedded benchmark suite. IEEE 4th Annual Workshop on WorkloadCharacterization , pages 314, 2001.

    [19] C Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: atool for evaluating and synthesizing multimedia and communicationssystems. IEEE/ACM MICRO, page 330, 1997.

    1007