Upload
bibinvsbibin
View
214
Download
0
Embed Size (px)
Citation preview
8/8/2019 Compiler ion
1/4
Exploring Compiler Optimizations for Enhancing Power
Gating
Soumyaroop Roy, Nagarajan Ranganathan, and Srinivas Katkoori
Department of Computer Science and Engineering
University of South FloridaTampa, FL 33620
{sroy, ranganat, katkoori}@cse.usf.edu
AbstractPower gating is a circuit level technique for reducing standbyleakage in a circuit block by cutting off paths in it between the supplyand the ground. A processor architecture that supports power gatingof its resources may provide instructions that activate and deactivatethose resources as part of the instruction set architecture level. Adequatecompiler support is then required so that the power gating instructionscan be inserted into the code to deactivate the resources that remainidle for long periods of time during program execution. However, theresource usage in a program depends on the code generated by the
compiler. Thus, the code transformations performed by the compiler hasan influence on the power gating opportunities of the processor resources.
In this work, we explore target independent compiler optimizationsthat modify the functional unit usage in the loops of a procedure toenhance the opportunities to deactivate functional units in an embeddedprocessor architecture. The optimizations performed on the code aresparse conditional constant propagation, lazy code motion, weak strengthreduction, and operator strength reduction. Insertion of power gatinginstructions is performed by inspecting the idleness of the units in theregions enclosed within loops. We model the processor architecture withpower gating support around an ARM core and use the SUIF framework
for compiler support. Finally, we use the Simplescalar-ARM distributionto perform power and performance evaluation with a set of benchmarksfrom MiBench and MediaBench suites. Experimental results indicate thatthe integer multiplier in the processor core can be power gated for upto
99% of its idle cycles, for integer benchmarks, and upto 93%, for floatingpoint benchmarks, when all the optimizations are performed. Moreover,the energy due to leakage in the functional units for the code with all theoptimizations performed can be upto 51% lower, for integer benchmarks,
and upto 21% lower, for floating point benchmarks, than that for theunoptimized code.
I. INTRODUCTION AND MOTIVATION
The major components of power consumption in VLSI circuits
are dynamic power, short circuit power, and leakage power. In the
recent years, due to scaling of threshold voltage and gate-oxide
thickness, the contribution of power due to leakage currents to the
total power of a circuit has increased significantly [1]. Therefore,
reducing leakage power has become a vital aspect of design of
low-power VLSI circuits. One of the common techniques used to
reduce standby leakage current in a circuit is power gating [2]. In
this technique, the path between the supply and ground is cut off
by inserting a sleep transistor between the supply and the circuit or
between the circuit and the ground. Since the activation (sleep) anddeactivation (wakeup) of a circuit block results in dynamic energy
overhead, ensuring that the block remains idle for sufficiently long
time is critical in achieving overall energy savings. Power gating
is also applied at the architecture level to reduce leakage in the
components of a microprocessor during periods of their idleness. The
components are equipped with sleep transistors at the circuit level,
and the controls for the sleep transistors may be provided as special
instructions, called power gating instructions. The compiler is then
extended with adequate support to analyze the program behavior and
insert power gating instructions into the code where those components
are idle such that energy savings can be obtained during program
execution.
Several works have investigated the problem of power gating
functional units in a microprocessor to reduce leakage during their
idle periods at the compiler level. Rele et al. propose compiler level
support in [3] for power gating in superscalar processors. In [4],
Zhang et al. investigate power gating and input vector control in
VLIW architectures. You et al. apply dataflow analysis techniques
in [5] to find regions in the program where functional units are
idle. In [6], [7], Roy et al. present a compiler level frameworkwith architectural support in which the units are first designed based
on the specifications of those of the ARM processor and then
characterized for latency and power. The characterization of the units
is an important task because, due to the dynamic energy overhead
involved in activating and deactiving a circuit, the energy savings
through power gating depend not only on the period for which a
unit remains deactivated but also on the number of times that unit is
activated. Furthermore, the architecture proposed in [6] eliminates the
need for instructions to activate the functional units by automatically
waking them up at the decode stage of the processor pipeline.
The main role of the compiler support in all the works discussed
above has been twofold. First, it identifies regions in the program
during which functional units are idle. Then, power gating instruc-
tions are inserted at the boundaries of such regions to deactivateand activate them. However, the usage of the functional units in a
program region depends not only on the source code description
of the program but also on the code generated by the compiler.
Consequently, the power gating opportunities of such units is also
dependent on the code transformations performed by the compiler.
This aspect has not been addressed in any of the prior works. This
forms the main motivation for this work. In this paper, we perform
a set of compiler optimizations on the source program that performs
code transformations, thereby modifying the functional unit usage
in the generated code. These code transformations are performed to
enhance the opportunities for power gating of the functional units.
The rest of the paper is organized as follows. Section II describes
the framework for power gating with compiler optimization tech-
niques. The experimental results are discussed in Section III followedby conclusions in IV.
II. POWER GATING AT COMPILER LEVEL WITH COD E
OPTIMIZATIONS
In this work, the SUIF research compiler [8] is used for com-
piler support. Figure 1 describes the compiler level approach for
power gating functional units. The source files of an application
are translated into SUIF intermediate representation (IR) by the
SUIF frontend. The SUIF IR of the code is then converted into the
SUIF Virtual Machine (SUIFVM) representation, which is the IR
978-1-4244-3828-0/09/$25.00 2009 IEEE 1004
8/8/2019 Compiler ion
2/4
SUIFVM Translation
Code Optimizations
SUIF IR Translation
Insertion of PowerGating Instructions
Assembly, Linking
ARM Code Generation
and Simulation
MachineSUIFFramework
ArchitectureSupport
Sourcefiles
Fig. 1. Framework for power gating in an optimizing compiler
of the MachineSUIF framework [9]. All the code optimizations are
performed on the SUIFVM representation of the program. The ARM
code generation library [10] lowers the SUIFVM representation into
equivalent ARM assembly code. Power gating instructions are then
inserted into the assembly code to deactivate the units at the entry of
the regions where they are determined to be idle. This code is then
assembled and linked to generate an ARM executable whose runtime
execution is simulated on a cycle accurate simulator for performance
and energy evaluation. The architecture support details are used by
the assembler and the processor simulator to generate object code
and evaluate performance and energy statistics, respectively. SectionII-A describes the details of the architecture and assembler support
for the power gating instructions. Section II-B describes the compiler
optimizations developed and used in this work and, finally, Section
II-C describes the power gating technique used to insert the power
gating instructions into the assembly code.
A. Architecture and Assembler Support for Power Gating
We use the architectural support for power gating described in
[6]. The instruction set architecture (ISA) provides an explicit sleep
instruction, whose argument is the list of functional units that need
to be deactivated. The ISA, however, does not provide any explicit
wakeup instructions. When an instruction is decoded, the units
needed by the instruction are activated by the decode stage of the
pipeline. The library of functional units with power gating supportis characterized for 1 cycle wakeup latency for a clock period of
10 ns (100 MHz clock). The only modification done in this work is
that the barrel shifter is not equipped with any power gating support
because of the frequent usage of shift instructions in the code. Shift
instructions are generated during strength reduction of multiplication
operations with constant operands [11] and during generation of load
and store instructions with complex addressing modes [10].
A sleep control register (SCR) is added to the decode logic which
regulates the sleep controls for the functional units, as shown in
Figure 2. A 0 in the sleep control register indicates that the
1
0
1
0
SCR
Logic
ARM
DecodeFPAdder
FPDsqt
FPMult
IntMult
Fig. 2. Architecture support for power gating
functional unit driven by that register bit is active (awake mode),
while a 1 indicates that it is inactive (sleep mode). In Figure 2,
the contents of the SCR indicate that the integer multiplier, and the
FP division and square root unit, are active, while the FP adder and
the FP multiplier are inactive. When a sleep instruction is decoded,
the SCR is modified to deactivate the functional units passed to
the sleep instruction as arguments. When an instruction requiring
a certain functional unit is decoded, the SCR is modified to activate
that functional unit so that it is can be used by the time the instruction
enters the execution stage.
0F0 F X X Arg7
Machine code format of sleep instruction
Assembly format of sleep instruction
slp /* 4 bit argument */
31 27 23 19 15 11 7 3 0
Arg bit 0 : IntMultArg bit 1 : FPAddArg bit 2 : FPMultArg bit 3 : FPDSQT
Fig. 3. Assembly and machine code formats of the sleep instruction
The assembler support for translating the sleep instructions into
machine code is added to the GNU ARM assembler, which is part
of the binutils package. The format of the machine code for the
sleep instruction is chosen from the domain of exceptional opcodesdescribed in the ARM reference manual and is shown in Figure 3.
The assembly opcode for the sleep instruction is slp. The functional
units that need to be deactivated are encoded into a 4-bit integer, and
this is passed to the slp instruction as an argument. Bits 0-3 are
for deactivating the integer multiplier, floating point adder, floating
point multiplier, and floating point division and square root unit,
respectively. The machine code for the slp instruction has bits 7-0
as F0 and bits 31-20 as 07F. Bits 11-7 are used for encoding
the 4-bit argument passed to the sleep instruction.
The decode logic in the SimpleScalar-ARM distribution is also
extended to include the definition of the slp instruction. After the
slp instruction is decoded by the decode logic, the 4-bit argument is
extracted and a logical OR operation is performed with the contents of
the SCR before the result is stored back in the SCR. When any otherinstruction is decoded, the SCR entry corresponding to the functional
unit needed by the instruction, is overwritten with a 0.
B. Compiler Optimizations
We select four compiler optimizations that either modify arithmetic
instructions in the code or move them across basic blocks, thereby
changing the functional unit usage of the basic blocks. All the opti-
mizations described below are implemented as MachineSUIF passes
that perform code transformations on the SUIFVM representation of
the source program.
1005
8/8/2019 Compiler ion
3/4
1) Sparse Conditional Constant Propagation: Sparse conditional
constant propagation (SCCP) [12] is a global constant propagation
technique in which propagation of constant temporaries is performed
across basic blocks in the presence of conditional branches. This
optimization is performed on the static single assignment (SSA)
representation of the control flow graph (CFG) form of the code. SSA
form of the CFG is a representation in which a target temporary can
be at the destination of only one instruction [13]. A constant folding
library is also implemented at the SUIFVM level that computes thetarget temporary of an instruction whose operands are identified as
constants during this pass.
2) Lazy Code Motion: Code motion optimizations perform
dataflow analyses to identify the instructions that compute the same
value and move such instructions to locations in the code so that
they are executed less frequently. Lazy code motion (LCM) [14] is a
global code motion technique that eliminates redundant instructions
in a procedure of a program. A descendant of partial redundancy
elimination, LCM performs common subexpression elimination along
with loop invariant code motion. For this work, a publicly available
LCM implementation [15] in MachineSUIF is used.
3) Weak Strength Reduction: Strength reduction is a term used
to refer to techniques that replace expensive operations with inex-
pensive ones. Weak strength reduction (WSR) refers to replacing anexpression like x2 with the expression x+x. In this example, one
of the operands (the constant 2) in the multiplication expression has
been identified as a constant by the compiler. As part of this work,
the technique described in [11] is implemented for replacing integer
multiplication operations with a series of addition, subtraction, and
shift operations.
4) Operator Strength Reduction: A more powerful form of
strength reduction replaces repeated multiplications inside a loop with
repeated additions or subtractions. This is performed by identifying
induction variables, which are temporaries that get incremented or
decremented by a constant value during the execution of a loop,
and replacing multiplication operations involving such variables with
equivalent addition and subtraction operations. For this work, we im-
plement operator strength reduction (OSR) [16], which is performedon the SSA form of the the code. We also perform linear function test
replacement (LFTR) which replaces the uses of original induction
variables in comparison operations (branches) to render series of
computations useless. These useless computations are subsequently
removed by dead code elimination.
C. Insertion of Power Gating Instructions
Since the task of power gating at the compiler level requires
the details of the target architecture support, the insertion of power
gating instructions is not done on the SUIFVM representation of the
code. Instead, it is performed on the ARM assembly code as another
optimization pass in MachineSUIF. However, since the MachineSUIF
framework provides the capability to write optimization passes that
load target specific details during runtime, it is possible to write anabstract optimization pass that attains a concrete structure only during
runtime. This avoids writing the same target dependent optimization
task for each target platform. The dead code elimination pass supplied
with the MachineSUIF distribution illustrates this feature. This ap-
proach is adapted in implementing the power gating pass. The details
of the functional units, and the instruction set, including the format
of the sleep instruction, are obtained from the ARM code generation
library during runtime.
Due to the unavailability of a code instrumentation library for the
ARM backend on MachineSUIF, we do not use dynamic profiling
information for inserting sleep instructions into the code. Instead, we
use a static technique based on the control flow information of the
procedure. A loop tree [13], which is a data structure that maintains
information about all the loops in a function and the basic blocks
contained in those loops, is constructed. For all the functional units
that are not needed in the loop, a sleep instructions deactivating those
units is inserted at the entry to the loop. If the loop entry block has
only one external predecessor, the sleep instruction is inserted at the
end of the predecessor block. An external predecessor of a loop entryblock is a predecessor block which is not part of that loop. In case
there are more than one external predecessors of the entry block, a
new basic block is inserted with the sleep instruction and it is set as
the predecessor of the entry block. The original external predecessors
of the entry block are set as predecessors of the new block. This step
is performed first for the parent loop before any of its child loops so
as to ensure that no redundant sleep instructions are inserted.
III. EXPERIMENTAL RESULTS
The simulations of the ARM executables with the sleep instructions
are performed with the Simplescalar-ARM toolset [17] for a set of
benchmarks from the embedded benchmarks suites, Mibench [18] and
Mediabench [19]. The benchmarks range in size from 1 source file
and 174 lines of source code (Dijkstra) to more than 15 source filesand 8000-9000 lines of source code (Mpeg2E and Mpeg2D). Only
Dijkstra and Sha are integer benchmarks, while the rest are floating
point benchmarks. For leakage energy calculations, the leakage power
characterization of the library of functional units developed in [6] is
used.
TABLE IOPTIMIZATIONS PERFORMED ON THE BECHMARKS
Legend Description
unopt No optimizations
sccp SCCP
lcm SCCP + LCM
wsr SCCP + LCM + WSR
osr SCCP + LCM + OSR + WSR
Fig. 4. Percentage of idle cycles for which the integer multiplier is keptdeactivated.
The compiler optimizations discussed in Section II-B are per-
formed incrementally generating four optimization levels as enumer-
ated in Table I. The results of power gating with the optimizations are
compared to those with the unoptimized code generated by Machine-
SUIF. Since two of the optimizations explored in this work remove
1006
8/8/2019 Compiler ion
4/4
integer multiplication instructions from the code, the power gating
opportunities are improved significantly for the integer multiplier.
This can be seen in Figure 4, which plots the fraction of idle cycles for
Fig. 5. Average number of cycles for which the integer multiplier is keptturned off before it is woken up.
which the integer multiplier is kept deactivated. Except for SusanS,
the opportunity of power gating this unit improves significantly
in osr in all the benchmarks. This is because SusanS performs
integer multiplications on array members and since they are stored in
memory, these optimizations are not able to remove the multiplication
operations. Although, SCCP and LCM hardly improve the power
gating period for the integer multiplier (except for SusanE), they
are important prerequisites for the strength reduction optimizations
to be effective. For Sha benchmark, the integer multiplier is power
gated for 99% of its idle cycles with the code that is optimized with
osr. Among the floating point benchmarks, the integer multiplier for
Mpeg2D is power gated for 93% of its idle cycles in osr. Figure 5
Fig. 6. Percentage of leakage energy saved with each optimization over that
for the unoptimized code
shows the average number of cycles for which the integer multiplier
is power gated each time it is activated at the pipeline decode stage
after it decodes an integer multiply instruction. Comparing the results
in unopt and osr, the integer multiplier is power gated for a longer
period of time in the latter before it is woken up. This translates to a
fewer number of activations of the multiplier unit, thereby lowering
the dynamic energy overhead in activating the multiplier. Finally,
Figure 6 plots the percentage of energy saved due to leakage during
each optimization over that in unopt. For the integer benchmarks,
the floating point units are not used in the energy calculations. The
performance overhead of the additional sleep instructions for all the
benchmarks, except for Mpeg2D, is lower than 0.1%. For Mpeg2D,
it ranges from 0.57-0.69%.
IV. CONCLUSIONS AND FUTURE WOR K
In this paper, we have explored a few compiler tranformations
on applications for enhancing opportunities to power gate functional
units in an embedded processor. The optimizations discussed inthis work, particularly the strength reduction optimizations, focus on
integer operations. Therefore, the opportunities for power gating the
integer multiplier increase significantly when the optimizations are
performed. The library of compiler optimizations and power gating
developed for this work will be released publicly after we perform
sufficient testing of these passes with even bigger benchmarks. In
the future, compiler transformations to improve power gating for FP
operations will be explored, so that power gating of FP units can
be used to reduce leakage in applications that perform extensive FP
arithmetic operations.
REFERENCES
[1] R.K. Krishnamurthy et al. High-performance and low-voltage challengesfor sub-45nm microprocessor circuits. Intl. Conf. ASIC, pages 283286,
2005.[2] K. Roy. Leakage Power Reduction in Low-Voltage CMOS Design. Proc.
ICECS, pages 167173, 1998.[3] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing Static Power
Dissipation by Functional Units Superscalar processors. Proc. 11th Intl.Conf. on Compiler Construction, pages 261274, 2002.
[4] W. Zhang et al. Compiler Suppport for Reducing Leakage EnergyConsumption. DATE, pages 11461147, 2003.
[5] Y. You, C. Lee, and J.K. Lee. Compiler Analysis and Supports forLeakage Power Reduction on Microprocessors. ACM TODAES, pages147164, 2006.
[6] S. Roy, S. Katkoori, and N. Ranganathan. A Compiler Based LeakageReduction Technique by Power-Gating Functional Units in EmbeddedMicroprocessors. Proc. 20th Intl. Conf. VLSI Design, pages 215220,2007.
[7] S. Roy, N. Ranganathan, and S. Katkoori. A Framework for PowerGating Functional Units in Embedded Microprocessors. Accepted to
Trans. VLSI, 2008.[8] R. Wilson. The SUIF Compiler System: a Parallelizing and Optimizing
Research Compiler. Technical report, Stanford University, 1994.[9] M.D. Smith and G. Holloway. An Introduction to Machine SUIF and
Its Portable Libraries for Analysis and Optimization. http://www.eecs.harvard.edu/hube/software/, 2002.
[10] G. Theoduloz and D.S. Garcia. Machine SUIF Back-end for the ARMArchitecture. http://lap2.epfl.ch/dev/ machsuif/arm backend, 2005.
[11] P. Briggs and T.J. Harvey. Multiplication by Integer Constants. Technicalreport, Rice University, 1994.
[12] M.N. Wegman and F.K. Zadeck. Constant Propagation with ConditionalBranches. ACM TOPLAS, pages 231236, 1991.
[13] R. Morgan. Building and Optimizing Compiler. Digital Press, 1998.[14] J. Knoop, O Ruthing, and B. Steffen. Optimal Code Motion: Theory
and Practice. ACM TOPLAS, pages 11171155, 1994.[15] L Rolaz. An Implementation of Lazy Code Motion for Machine SUIF.
Technical report, Swiss Federal Institute of Technology, 2003.
[16] K.D. Cooper, L.T. Simpson, and C.A. Vick. Operator Strength Reduc-tion. ACM TOPLAS, pages 603625, 2001.
[17] D. Burger and T. Austin. The Simplescalar Tool Set, version 2.0.Technical report, TR-97-1342, University of Wisconsin-Madison, 1997.
[18] M.R. Guthaus et al. MiBench: A free, commercially representativeembedded benchmark suite. IEEE 4th Annual Workshop on WorkloadCharacterization , pages 314, 2001.
[19] C Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: atool for evaluating and synthesizing multimedia and communicationssystems. IEEE/ACM MICRO, page 330, 1997.
1007