12
Sifting Out the Mud: a Low-Level Treatment of Reusable Code in C++ Programs * Bjorn De Sutter Bruno De Bus Koen De Bosschere Ghent University, Belgium [email protected] ABSTRACT More and more computers are being incorporated in de- vices where the available amount of memory is limited. This contrasts with the increasing need for additional function- ality and the need for rapid application development. While object-oriented programming languages, supporting code reusability by providing mechanisms such as inheritance and templates, allow fast development of complex applications, they have a detrimental effect on program size. As a re- sult more and more research is focussing on the automated reduction of program size. Libraries and components, devel- oped with reusability in mind, often offer much more func- tionallity than what the application developer needs. It is very hard however to extract only the necessary function- ality from a library. We propose applying link-time code compaction techniques to sift out the mud by eliminating useless code and by reusing binary code as much as possi- ble. The obtained code size reductions using our techniques range from 43 to 64% for a set of 7 C++ programs. The main techniques discussed are combined unreachable code and data elimination, aggressive whole-program optimiza- tion and code factoring. They involve an average execution time penalty of 11%. 1. INTRODUCTION More and more computers are being incorporated in devices where the available amount of memory is limited, such as PDAs, set top boxes, wearables, mobile and embedded sys- tems in general. The limitations on memory size result from considerations such as space, weight, power consumption and production cost. At the same time, ever sophisticated applications have to be executed on these devices, such as encryption and speech recognition, often accompanied by all kinds of eye-candy and fancy GUIs. These applications have to be developed in shorter and shorter design cycles. More complex appli- cations, i.e. providing more functionality, generally mean larger applications. Additional functionality is however not the only reason why these applications become larger. An- other important reason is the use of modern software engi- neering techniques, that aim at the use of components or * The work of De Sutter was supported by the Fund for Sci- entific Research – Flanders under grant 3G001998. De Bus is supported by a grant from the ‘Flemisch Institute for the Promotion of the Scientific Technological Research in the Industry’ (IWT). code libraries. These building blocks are primarily devel- oped with reusability and generality in mind. An appli- cation developer often uses only part of a component or a library, but because of the complex structure of these build- ing blocks, the linker often links a lot of useless code and data into the application. Even useful code that is linked with the application will often involve some overhead, since it was not programmed with that specific application con- text in mind. As object-oriented programming languages (OOPL) engage the programmer to develop and use reusable code, it is no surprise that the average application written in OOPL con- tains large amounts of useless code. Besides this, the facilities provided by OOPL to develop reusable code at the source code level, such as templates and inheritance, often result in duplicate code fragments on the low level, thus needlessly increasing program size again. For each instantiation of a template class, e.g., a separate low- level instantiation is generated. While at the source code level the instantiations have different types, their low-level implementations are often the same. Pointers to different types of object, e.g., have a different type at the source code level, but are all simple addresses at the lower level. Recent years have seen growing interest in research on code and data compaction, i.e., the transformation of programs to reduce their memory footprint while retaining the prop- erty that they can be executed directly without requiring any decompression. Work on code compaction has gener- ally focused on identifying repeated instruction sequences within a program and abstracting them into functions [5, 13] or macro-instructions in programmable execution en- vironments such as the Java Virtual Machine [4]. Work on data compaction is limited to simple literal address re- moval from object files [19]. Whereas program compaction compacts code and data in a program, program extraction identifies those parts of libraries, classes or run-time envi- ronments that are needed for a specific application. To our knowledge, all such proposed techniques [1, 20, 21] are lan- guage dependent, requiring higher level descriptions of li- braries, classes or run-time environments and above all type information. This highly limits their applicability, e.g., on libraries that are available in object format only. In the past we have proposed applying code compaction [7]

Sifting Out the Mud: a Low-Level Treatment of Reusable Code in C++ Programs

Embed Size (px)

Citation preview

Sifting Out the Mud: a Low-Level Treatment of ReusableCode in C++ Programs ∗

Bjorn De SutterBruno De Bus

Koen De BosschereGhent University, Belgium

[email protected]

ABSTRACTMore and more computers are being incorporated in de-vices where the available amount of memory is limited. Thiscontrasts with the increasing need for additional function-ality and the need for rapid application development. Whileobject-oriented programming languages, supporting codereusability by providing mechanisms such as inheritance andtemplates, allow fast development of complex applications,they have a detrimental effect on program size. As a re-sult more and more research is focussing on the automatedreduction of program size. Libraries and components, devel-oped with reusability in mind, often offer much more func-tionallity than what the application developer needs. It isvery hard however to extract only the necessary function-ality from a library. We propose applying link-time codecompaction techniques to sift out the mud by eliminatinguseless code and by reusing binary code as much as possi-ble. The obtained code size reductions using our techniquesrange from 43 to 64% for a set of 7 C++ programs. Themain techniques discussed are combined unreachable codeand data elimination, aggressive whole-program optimiza-tion and code factoring. They involve an average executiontime penalty of 11%.

1. INTRODUCTIONMore and more computers are being incorporated in deviceswhere the available amount of memory is limited, such asPDAs, set top boxes, wearables, mobile and embedded sys-tems in general. The limitations on memory size result fromconsiderations such as space, weight, power consumptionand production cost.

At the same time, ever sophisticated applications have tobe executed on these devices, such as encryption and speechrecognition, often accompanied by all kinds of eye-candyand fancy GUIs. These applications have to be developedin shorter and shorter design cycles. More complex appli-cations, i.e. providing more functionality, generally meanlarger applications. Additional functionality is however notthe only reason why these applications become larger. An-other important reason is the use of modern software engi-neering techniques, that aim at the use of components or

∗The work of De Sutter was supported by the Fund for Sci-entific Research – Flanders under grant 3G001998. De Busis supported by a grant from the ‘Flemisch Institute for thePromotion of the Scientific Technological Research in theIndustry’ (IWT).

code libraries. These building blocks are primarily devel-oped with reusability and generality in mind. An appli-cation developer often uses only part of a component or alibrary, but because of the complex structure of these build-ing blocks, the linker often links a lot of useless code anddata into the application. Even useful code that is linkedwith the application will often involve some overhead, sinceit was not programmed with that specific application con-text in mind.

As object-oriented programming languages (OOPL) engagethe programmer to develop and use reusable code, it is nosurprise that the average application written in OOPL con-tains large amounts of useless code.

Besides this, the facilities provided by OOPL to developreusable code at the source code level, such as templates andinheritance, often result in duplicate code fragments on thelow level, thus needlessly increasing program size again. Foreach instantiation of a template class, e.g., a separate low-level instantiation is generated. While at the source codelevel the instantiations have different types, their low-levelimplementations are often the same. Pointers to differenttypes of object, e.g., have a different type at the source codelevel, but are all simple addresses at the lower level.

Recent years have seen growing interest in research on codeand data compaction, i.e., the transformation of programsto reduce their memory footprint while retaining the prop-erty that they can be executed directly without requiringany decompression. Work on code compaction has gener-ally focused on identifying repeated instruction sequenceswithin a program and abstracting them into functions [5,13] or macro-instructions in programmable execution en-vironments such as the Java Virtual Machine [4]. Workon data compaction is limited to simple literal address re-moval from object files [19]. Whereas program compactioncompacts code and data in a program, program extractionidentifies those parts of libraries, classes or run-time envi-ronments that are needed for a specific application. To ourknowledge, all such proposed techniques [1, 20, 21] are lan-guage dependent, requiring higher level descriptions of li-braries, classes or run-time environments and above all typeinformation. This highly limits their applicability, e.g., onlibraries that are available in object format only.

In the past we have proposed applying code compaction [7]

on a very general program representation: binary programs.The techniques discussed were limited to code compactiononly and evaluated for C programs only. How the data wasused by the programs was not analyzed.

It is however not difficult to see that there are significantdependences between the code and data components of anexecutable program. For example, unused library code thatis uselessly being linked with a program will often be ac-companied by useless data (rather obsolete empirical evi-dence indicates that 5–10% of the library code linked with aprogram is unreachable [16, 18]). Code optimizations suchas dead and unreachable code elimination can cause datato become unreachable by getting rid of code referring tothat data. Conversely, the elimination of unused data thatcontains pointers to code, such as jump tables and virtualmethod tables, can cause code to become unreachable, andpotentially eliminable. Indeed, the two optimizations aresynergistic: the elimination of data can enable additionalelimination of code, which can enable the elimination ofeven more data, and so on. Since OOPL programming con-structs (such as virtual method calls) extensively use proce-dure pointers in their low-level implementations, an analysisof how data is being used is of even bigger importance toOOPL applications than it is for other languages such asFortran or C.

In this paper, we analyze the performance of link-time pro-gram compaction on C++ programs. Extensions to the pre-viously discussed algorithms are proposed and evaluated.Code size reductions ranging from 43 to 64% are obtained,averaging around 54%. The major contributers to the achievedresults are combined unreachable code and data elimination,aggressive whole-program optimization and much improvedcode factoring. They are discussed in sections 2, 3 and 4.In section 5 the optimizations are evaluated on a number ofC++ applications. Related work is discussed in section 6,after which some conclusions are drawn.

2. UNREACHABLE CODE ANDDATA ELIMINATION

During a linking process, the search for necessary librarycode and data to be included in the binary is guided bysymbol information. If a symbol is referred to in an alreadylinked-in object file, another object file defining that symbolwill be linked with the binary as well, possibly requiringnew symbol definitions. This is an iterative process, thatfinishes when all referenced symbols are defined. Whetherthe reference to a symbol is going to be used by the specificprogram is not taken into account during this process. Asa result, a lot of useless code and data is being linked intothe program; application programmers often only need partof the functionality provided by some library they use. Themore complex the structure of libraries or components theprogrammer uses, the more useless code and data will belinked with the program. The more general library code isused by an application developer, the more useless code willbe linked with his application.

One can argue that using libraries that attack this problemfrom the start, e.g., by creating an object file per method ordata member, are a possible solution. Although this couldbe a possible (partial) solution, it is not viable for everybody:

M := R := {program entry point}while (M 6= ∅)M := M \ {p}∀ successor q of p

if (q 6∈ R)M := M ∪ {q}R := R ∪ {q}

Figure 1: Simple unreachable code elimination.

often code libraries are only available in object format. Inthat case smarter linkers or post-link time optimizers are theonly viable solution. But even when the programmer hasfull control over the library source code and over the waythe object files are generated, unreachable code eliminationafter linking can be useful, when, e.g., conditional constantpropagation is able to determine that certain paths withinreachable procedures cannot be executed. Procedure callson those paths can then be eliminated, possibly resulting inmore procedures becoming unreachable.

Unreachable code elimination applied on a whole programafter linking can help in getting rid of large parts of theuselessly linked in code. For a simple program in whichall control flow transfers are direct transfers, unreachablecode elimination is straightforward. It suffices to mark theprogram entry point as reachable and then iteratively markall program points that are reachable from allready markedprogram points. The pseudo-code for such an unreachablecode eliminator is depicted in Figure 1. M is the set ofprogram points marked for the iterative process, R the setof reachable points.

With indirect control flow transfers, this straightforward ap-proach is not viable anymore: on analyzing binary programs,having only the code, the data and relocation informationat our disposal, it is very hard if not impossible to find theexact set of the possible targets of an indirect control flowtransfer. For program points that can be the target of sucha transfer, it is therefore impossible to detect which con-trol flow transfers can or cannot transfer control to thoseprogram points.

The number of possible targets of indirect control flow ishowever limited to those points of which the address can befound somewhere in the program (encoded in instructionsor stored in the data sections). If this is not the case, theiraddress cannot be loaded or produced to transfer control to.

Unreachable code elimination can easily handle this by ini-tially marking all the program points that have their addressstored somewhere. In practice this is overly conservative, asunreachable code being linked with the program will often beaccompanied by unaccessible data, containing unaccessiblecode addresses. This is especially the case for OOPL, wheremethod invocations are very often indirect and for which,e.g., code addresses are stored in virtual method lookup ta-bles.

So for optimal performance of unreachable code elimination,we need unreachable data elimination as well. However,as with indirect control flow transfers, it is very hard to

analyze all load instructions in a program. Fortunately, oneunresolved load does not result in all the data becomingaccessible. This is due to the way compilers generate objectfiles that are linked by the linker.

The object module generated by a compiler from a sourcemodule typically consists of several code and data sections;examples of such sections include the code section, the con-stant data section, the zero-initialized data section, the lit-eral address section, etc. The linker combines a numberof such object modules into an executable program: in theprocess, it puts all the sections in their final order and lo-cation. The sections of the same type coming from differentobject modules are typically combined into a single sectionof that type in the final executable. To avoid confusion, inthe remainder of this paper the original sections in the ob-ject files will be called code and data blocks, or blocks forshort. A section in an executable file is thus a juxtapositionof blocks from the object modules from which the executablewas constructed.

To access a memory location, the address of that locationhas to be loaded or computed into a register (possibly im-plicitly, as a displacement of a base address). This registeris then used as a source operand to access that location.Now consider the locations that an address computed inthis way could possibly refer to. In general, when generat-ing the blocks in one object module, the compiler does nothave any information about the blocks in other object mod-ules, such as their size or the order in which they will belinked together. It therefore cannot make any assumptionsabout the eventual locations of these blocks in the final exe-cutable. This means that in the object code, computationson an address pointing to some block can never yield anaddress pointing to some other block in the object file, be-cause the displacement between the two blocks is not knownat compile time. This property holds for all the blocks inthe final executable program. This means that the data ina block is dead unless there is a pointer to that block foundin some other live block (e.g., a pointer to a data block froma code block, or vice versa) or explicitly programmed in thecode.

The unreachable code eliminator is therefore combined withunreachable data elimination as depicted in Figure 2. Rnow contains all reachable code and data locations. If adata location is accessed by a reachable instruction, it isadded to the set of reachable locations. If a reachable datalocation contains a code address, the program point at thatlocation becomes reachable. If the data location holds adata address, that address can be used to access its wholeblock. Therefore that whole block are recursively added tothe set of reachable locations.

Table 1 gives some insight in the distribution of the size ofthe blocks containing statically allocated data for the pro-grams evaluated in this paper. Note that more than onethird of the statically allocated data are code or data ad-dresses. For the SPECint2000 benchmarks suite, mainlyconsisting of C programs, this is only one fifth. This vali-dates our intuition that C++ programs contain much morepointers than C programs. About 85% of those addressesare located in read-only data sections. Note how many of

M := R := {program entry point}while (M 6= ∅)M := M \ {p}if (p is a code address)∀ successor q of p

if (q 6∈ R)M := M ∪ {q}R := R ∪ {q}if (Instruction at p accesses data at location l)

if (l 6∈ R)M := M ∪ {l}R := R ∪ {l}

if (p is a data address)if (data d stored at location p is a code address)

if (d 6∈ R)M := M ∪ {d}R := R ∪ {d}

if (data d stored at location p is a data address)∀ locations l in Block(d)

if (l 6∈ R)M := M ∪ {l}R := R ∪ {l}

Figure 2: Pseudo code for the combined unreachablecode and data elimination.

the data blocks contain at most one or two addresses. Inblocks that are 16 bytes large, the last 8 bytes are very of-ten padding and so contain no real data or addresses. It isclear that most of the blocks are small enough to put severerestrictions on the possible uses of the data addresses storedin the program.

As for the elimination of useless code linked into a programfrom libraries, it is obvious that the modular design of soft-ware in OOPL, whereby related code is stored in separatesource code modules, results in a strong correlation betweenreachable data blocks and reachable code: in other words,when a whole useless object file is linked with the program,it is very likely that our combined unreachable code anddata elimination will detect this.

3. AGGRESSIVE WHOLE-PROGRAM OP-TIMIZATION

Due to the separate compilation of source code modules,the compiler has to make conservative assumptions aboutprocedures and data declared outside the compilation unit.This results in inaccuracies in the analyses performed by thecompiler.

3.1 Optimization opportunitiesConstant propagation, e.g., however complex the constantpropagation algorithm may be, cannot propagate constantsover module boundaries. To some extent, this may be over-come by collecting summary information about the othersource code files used in the application, but even then con-servative assumptions are still necessary for source files writ-ten in other languages or (already compiled) object files inlibraries. The same holds for many other analyses.

As a result of the flaws in the analyses, certain optimizationsobtain poor results or can not be done at all:

data 3326576 bytesread-only data 1587856 bytesaddresses in data 943556 bytesread-only addresses in data 791700 bytesblock size = 8 bytes 34270 blocksblock size = 16 bytes 3687 blocks16 bytes < block size ≤ 64 bytes 1480 blocks64 bytes < block size ≤ 256 bytes 1503 blocks256 bytes < block size ≤ 1KB 820 blocks1KB < block size ≤ 4KB 449 blocks4KB < block size ≤ 16KB 91 blocks16KB < block size ≤ 32KB 8 blocks32KB < block size ≤ 64KB 1 block64KB < block size ≤ 128KB 3 blocks128KB < block size ≤ 256KB 1 block

Table 1: Some numbers on statically allocated non-zero-initialized data and addresses summed for thebenchmark programs evaluated in this paper.

– When optimizing the stack space used by a procedure,a compiler has to follow calling conventions if the pro-cedure can be referenced from within other compila-tion modules. Often these conventions are far fromoptimal.

– Since the compiler does not know the final location ofdata in the application, the optimization of addresscalculations is largely limited to calculations insidedata blocks.

– Although a profile-guided compiler can optimize codelayout with respect to the directions that are most fre-quently taken on conditional branches, it cannot or-ganize the whole code in such a way that instructioncache misses are limited. A more sophisticated linkerthan the ones traditionally used can overcome this.

3.2 OptimizationsAs stated conservative assumptions result in inefficiencies inthe final application. To get rid of them, and to facilitateother optimizations our compacting linker performs somewell-know analyses and applies a number of well-know opti-mizations to the code.

On the one hand, since the linker has the entire programat its disposal, its analyses can be more accurate and asa result its optimizations can be more aggressive can. Onthe other hand the link-time optimizer is hampered by alack of semantic information because of the low-level sourcelanguage it works on. Without going into details, we give ashort overview of the most important optimizations we wantto perform. The reader is referred to [7] for more details onthese optimizations.

3.2.1 Interprocedural constant propagationSince the entire program is available for inspection at link-time, interprocedural constant propagation is performed onthe whole-program. This is limited to propagating constantregister contents through the code. Values are not propa-gated through memory. The reason is the conservative as-

sumption that, except for read-only memory, all loads pos-sibly alias with stores. This results from the fact that aliasanalysis is extremely difficult on binary programs.

The results from the constant propagator are used to guidedifferent optimizations such as constant folding, eliminationof conditional branches, elimination of idempotent instruc-tions, i.e. instructions that do not change register contentsand strength reduction [2].

3.2.2 Dead code eliminationInstructions that produce values that are not used by otherinstructions can be eliminated. The opportunities for deadcode elimination result primarily from additional constantsfound by the constant propagator and other optimizations.

If a small constant can be encoded in an instruction as animmediate operand, the constant must not occupy a tempo-rary register and therefore there is no need for an instructionproducing the register.

If a conditional branch is eliminated after constant propa-gation found the branching condition to evaluate to a con-stant, partially redundant expressions (i.e. that produce val-ues used on some of the paths following the producer) canbecome fully redundant.

3.2.3 Address calculation optimizationIndirect control flow transfers are transformed into directones if the constant propagator finds the target address to bea constant. For indirect data accesses at constant addresses(i.e. the address of the data is loaded before the data itselfis accessed), the instruction loading the address can oftenbe replaced by an instruction that calculates the addressfrom an other address that is present in some other register.This is most important for 64-bit architectures, where 64-bitaddresses cannot be encoded in one or even two instructionsand the generation of addresses is therefore quite expensive:usually loading an address from a directly accessible tableis the best option.

But even for architectures with a smaller address space, opti-mizing address calculations at link-time can result in signifi-cant gains. Consider two subsequent load/store instructionsthat access different memory locations. If the displacementbetween the two locations is small enough, one common baseaddress can be used by both the instructions using two dif-ferent offsets. The compiler is limited by data block bound-aries for applying such base pointer optimizations, since thefinal displacement between two locations in different blocksis not known by the compiler, even if the two data blockscome from the same compilation module. A link-time opti-mizer is not hampered by this.

3.2.4 Load/store avoidanceSpill code is eliminated to some extend if a register is freedbecause of other optimizations. More importantly is theelimination of useless spill code introduced because of callingconventions the compiler had to obey to.

3.2.5 InliningCompilers compiling one source code module at a time canonly inline procedures at callsite in the same module. If nospace trade-offs are to be made, this will only be done forvery small procedures or procedures with one callsite. Thelatter means that the procedure should not be accessiblefrom outside its own module. If the programmer not ex-plicitely states this (e.g., by making a method private or bydeclaring a procedure static in C source code), a compilermust assume that there will be other callsites.

The linker is not hampered by this and therefore can applyinlining over the original module boundaries and in thosecase the compiler did not choose to apply inlining becauseof a lack of information about the other modules.

3.2.6 Instruction scheduling and Code layoutAfter carrying out all these optimizations, the resulting bi-nary differs by and large from the original binary. The orig-inal schedule the compiler generated is thus no longer valid.Therefore the code is rescheduled. No no-ops are insertedhowever, as they would increase the code size.

Since we have the entire binary at our disposal, we also reor-ganize the whole basic block layout to optimize the instruc-tion cache behavior. Some kind of hot-cold profile-guidedoptimizations is implemented: frequently executed parts ofthe program are collected and layouted close to each otherto avoid conflict misses in the instruction cache.

Note that instruction rescheduling and code layouting donot contribute to code size reductions. They are importantfor speeding up the program execution however. Togetherwith the other optimizations discussed in this section, theycompensate for the execution overhead caused by factoringout multiple-occuring code fragments.

4. CODE FACTORINGProgramming constructs such as templates and inheritanceoften result in multiple low-level implementations of sourcecode fragments that show great similarity. Significant com-paction of programs can be obtained by factoring out multiple-occuring instruction sequences: the sequence is put in a sep-arate procedure and all occurances of the sequence are re-placed by calls to that single procedure. This is the inverseof inlining.

Looking for multiple occuring instruction sequences can bedone at various levels of granularity. These are discussed inthe following subsections.

4.1 ProceduresThe use of templates often results in identical procedures inthe final application. This is typically the case for sortingand searching operations in container classes, that involveonly pointer operations. These pointer operations (on theassembly level) are independent of the type of object forwhich the template class was instantiated.

Looking for identical procedures is performed by pair-wisecomparison of the control flow graphs of all procedures. Tospeed-up the comparisons, a fingerprint of the procedures

is used consisting of the number of basic blocks and thenumber of instructions in the procedure. These are collectedin a prepass.

If two or more identical procedures have been found, one isselected and all pointers to the others are converted to pointto the selected one. These pointers can be direct procedurecalls, interprocedural control flow transfers other than callsor addresses stored in the data sections of the program. Un-reachable code elimination will then take care of the elimi-nation of the now obsolete copies of the procedure.

Note the search for identical procedures is an iterative pro-cess. Consider two identical procedures A and B of whichA is selected. All procedures calling B will now call A, thuspossibly becoming identical to other procedures that calledA from the start.

Also note that this is a form of code factoring that involvesno overhead, since no additional calls are inserted into theprogram.

4.2 Basic BlocksThe use of inheritance and virtual method calls results inprocedures that are often not identical, but very much alike,simply because they provide similar functionallity. It there-fore often happens that the structure of these procedures isalike and that some of the basic blocks in those structuresare alike or at least are functionally equivalent. If they areidentical, they can be factored out. If they are functionallyequivalent, they possibly can be made identical by registerrenaming, after which they can be factored out.

The factoring of basic blocks discussed in our previous work [7]discusses the detection and creation of identical blocks, afterwhich they are factored out. The detection and renamingsystem consists of the following parts:

– Control flow separation Control flow instructions, suchas branches and procedure calls, are separated intobasic blocks with one instruction only. This is done toensure that functionally equivalent blocks that havedifferent successors in the control flow graph can stillbe factored out.

– A fingerprint system: fingerprints are calculated forbasic blocks based on the opcodes of the instructionsin the basic blocks. The fingerprints are used in hashkeys that allow the grouping of basic blocks in buck-ets in such a way that basic blocks in different bucketsare not identical and are not renamable to each other.Using a hash table to collect possible candidates forbasic block factoring significantly speeds-up the fac-toring process.

– A selection mechanism: Using liveness information,one basic block in a bucket is selected as the targetbasic block. The other blocks in the bucket are thencandidates for renaming to this target block. Livenessinformation is used to select the block that allows thegreatest freedom for the others to be renamed to.

– A renaming mechanism: Blocks identical to the tar-get block are trivial candidates for factoring once that

the liveness information permits factoring (e.g., thereturn address register must not be live over the to-be-renamed block). Blocks that are not identical tothe target block can possibly be made identical by ap-plying register renaming and inserting the necessaryregister copy instructions before and after the block.

– Actual factoring If all identical blocks have been iden-tified, they are factored out.

This is all dicussed in more detail in [7]. The most impor-tant part of the system (apart from efficiency concerns) isthe register renaming. In this paper, we propose a new reg-ister renaming algorithm that performs better than the oldalgorithm in [7].

The new renaming algorithm works as follows:

1. Compare the immediate operands encoded in the in-structions. If they are not the same, the two blockscannot be functionally equivalent and no factoring willbe possible.

2. Compare the two dependeny graphs of the blocks. Wealready know that the opcodes of the instructions arethe same; they were in the same bucket. If they alsohave the same dependency graphs, they are function-ally equivalent. If not, renaming is considered not pos-sible.

3. Detect the register copy operations that need to beinserted before the to-be-renamed block. These copiesmust copy values from live registers on entry to the to-be-renamed block to the corresponding register in thetarget block. If a register needs to be the destination ofmore then one copy operation, no renaming is possible.

4. Detect the register copy operations that need to be in-serted after the to-be-renamed block. These are nec-essary for destination operand registers defined in theto-be-renamed block that reach the end of that blockand that are different from the corresponding registerin the target block. Again, if a register needs to bethe destination of more then one copy operation, norenaming is possible.

5. Detect which register spills (i.e. register copy opera-tions before and after the block) are necessary aroundthe to-be-renamed block: these are registers live afterthe to-be-renamed block that are defined in the targetblock or in copy instructions but not defined in the to-be-renamed block. If there are no free registers thatcan be used as temporary registers to spill to, renam-ing is not considered possible.

6. Detect whether the register copy instructions to be in-serted before and after the block can be inserted with-out requiring additional temporary registers.

7. Count the number of register copy operations thatneed to be inserted. If this number plus one (for thecall instruction inserted by factoring) is greater then orequal to than the number of instructions in the blockitself, no renaming is applied, since renaming and fac-toring together will not result in code size reductions.

8. Perform the actual renaming and insert the necessarymoves.

Consider the example basic blocks in Figure 3, in which onlythe first two instructions need renaming.

Step 1 of the renaming process is trivial and needs no fur-ther explanation. Step 2 needs some clarification. The de-pendency graph of both blocks are identical in the example:the local use-definition relationships inside both blocks areidentical. Use-definition relationships to the outside of theblock (i.e. involving registers defined prior to the block) arenot identical (r2 is used twice in the to-be-renamed block,while r6 and r7 are used in the target block), but so far wedon’t have to bother about that.

Step 3 identifies two copy operations that need to be insertedbefore the block: ‘r6 := r2’ and ‘r7 := r2’. Note that renam-ing these blocks vice-versa would not be possible, since twocopy operations (‘r2 := r6’ and ‘r2 := r7’) with the same des-tination operand cannot be inserted: one of the two copiedvalues will not reach the block entry point.

Since register r4 is not live after to the-be-renamed block (itis a local temporary register), step 4 identifies only one copyoperation that needs to be inserted after the block: ‘r7 :=r5’.

There are two registers (r5 and r6) whose liveness rangesspan the whole to-be-renamed block. Since r5 is defined inthe target block, factoring out the blocks can only be done ifthe value of r5 is temporarily spilled to some other register:two spill instructions need to be inserted: ‘r0 := r5’ beforethe block and ‘r5 := r0’ after the block. Step 5 takes careof this. Since the value of r6 must also stay unchanged, itsvalue has to be spilled as well, say to register r11.

Step 6 of the algorithm detects no problems for the examplecode fragments. Imagine a case however where two copy op-erations ‘r6 := r2’ and ‘r2 := r6’ are to be inserted before theblock. Since this requires exchanging to values rather thatjust copying them, additional temporary registers would berequired. Cases like these are detected by looking for cy-cles in the graph that has register as its nodes and copyoperations as its edges. Our algorithm does not provide theopportunity yet to handle the most general case.

The number of instructions added is 7 + 1 = 8, while 9 in-structions can be eliminated if the renamed block is factoredout, so the renaming is applied and the necessary instruc-tions are inserted.

The algorithm described in [7] differs from this approach intwo ways:

– Spill code is not considered and thus blocks aroundwhich spill code is necessary are not renamed.

– Steps 2, 3 and 4 are performed in a single pass over theblock’s instructions. Using liveness information duringthe single pass over all the instructions, the requiredregister copy operations are detected. This is equiva-lent to our steps 3 and 4. At each instruction register

target block: to-be-renamed block: renamedr11 := r6;r6 := r2;r7 := r2;r0 := r5;

r3 := r1 + r6; r4 := r1 + r2; r3 := r1 + r6;r5 := r3 + r7; r7 := r4 + r2; r5 := r3 + r7;r9 := r8 * r8; r9 := r8 * r8; r9 := r8 * r8;r10 := r9 + r8; r10 := r9 + r8; r10 := r9 + r8;r8 := r10 - 0x10; r8 := r10 - 0x10; r8 := r10 - 0x10;r9 := r10 - r8; r9 := r10 - r8; r9 := r10 - r8;r10 := r8 * r9; r10 := r8 * r9; r10 := r8 * r9;r10 := r10 + 0x10; r10 := r10 + 0x10; r10 := r10 + 0x10;r9 := r10 - r8; r9 := r10 - r8; r9 := r10 - r8;

r7 := r5;r5 := r0;r6 := r11;

live registers:r5,r8,r9,r10 r5,r6,r7,r8,r9,r10 r5,r6,r7,r8,r9,r10

Figure 3: Example basic blocks with the live register sets after the blocks. On the right, the renamed blockwith the inserted register rename instructions is depicted.

operands in the to-be-renamed block are mapped tothe corresponding register in the target block. At thesame time, all subsequent uses of those registers get thesame mapping. (A register can be mapped to itself ifno renaming of that register is necessary). If duringthis process, a register operand is ever to be mappedto two different registers, no renaming is applied sincethe dependency graphs are not the same.

Consider our example program for this algorithm. Becausethe old algorithm did not introduce spill code, renaming wasnot applied. But even if inserting spill code was an option,no renaming would have been done.

After the first instruction is evaluated, register operand r2 ofboth the first and second instruction is mapped to r6. Eval-uating the second instruction requires mapping the registeroperand r2 of the second instruction to r7. Since two map-pings are needed for that register operand r2, no renamingis applied.

The basic problem with the old algorithm allowing only onemapping per register operand, is that this constraint is asufficient condition, not only for the local use-definition re-lationships to be identical, but also for the global relation-ships. Global use-definition relationships should howevernot play any role in the renaming of code blocks that haveunique entry and exit points.

Cooper and McIntosh [5] have proposed renaming basic blocksby globally renaming registers instead of inserting registercopy operations. This has the disadvantage that renamingto make one pair of blocks identical can affect (even undo)renaming for another pair of blocks. We feel that insert-ing copy operations is favourable, especially since a sepa-

rate copy elimination phase after the code factoring will beable to eliminate most of the copy instructions in those casewhere global renaming could have been applied to make theblocks identical.

4.3 Code RegionsIt should be no surprise that multiple code fragments con-sisting of more then one basic block can be identical. Suchfragments, when they have a unique entry point and a uniqueexit point can be factored out as well. How this can bedone efficiently is discussed in [7] in detail. In this paper, itsuffices to note that the fingerprinting system implementedfor basic blocks is adapted to include information on thestructure of code regions and that one has to be much morecareful about the preservation of the return address valuethroughout the region, as procedure calls might be found inthe region for which it is often very hard to prove that aregisters value is preserved over the call.

4.4 Partially Matched Basic BlocksOur older research on factoring partially matched basic blocksinvolved two important special cases: saves and restores forcallee-saved registers. Most of the time the saves occur in theentry block of a procedure and the restores occur in blocksending with a return instruction. While factoring registersave sequences involves some overhead because of the extracall and return, this is not the case for register restores, astail-call optimization can often eliminate the overhead. Howthis is done is discussed in more detail in [7].

Whereas we previously found that a more general factoringof partially matched blocks is computationally quite expen-sive without offering significant code size reductions for Cprograms, this is not the case for C++ programs. The rea-son is again that C++ programs involve much higher degrees

of code reuse.

Potentially, all instructions sequences of all possible lengthshave to be compared to each other. To keep the compu-tational cost of doing so reasonable, we restrict the generalfactoring of partially matched basic blocks to identical in-struction sequences. No renaming is tried.

Even then an enormous number of instructions sequenceshas to be compared. The most efficient way we found tothis is as follows:

1. In a single pass over all instructions, all occuring se-quences of up to 32 instructions are collected in anarray. Each element of the array is a structure hold-ing a 32x32bit numerical and an instruction number:the 32 machine codes are concatenated into a singlenumerical and the number of the starting instructionis included as a link between the element in the arrayand the actual instruction in the program.

For most instructions however, the possible sequencesstarting with that instruction are at most 4 to 5 in-structions long, since this is the average size of thebasic blocks. In such a case, the rest of the 32 instruc-tions is filled with no-operations that have the propertythat their machine code representation is numericallythe highest possible value.

2. The array of instruction sequences is sorted. As a re-sult of choosing the highest possible numerical value,the longest sequences will preceede the shorter ones inthe array.

3. A single pass over the array suffices to detect whichpartial basic blocks are identical and can be factored.Such partially matched blocks are converted into realblocks splitting the basic block containing the instruc-tion sequence.

4. The actual factoring is performed by applying the ba-sic block factoring algorithm again.

Note that this method is not optimal, since splitting a basicblock during the walk over the array might result in other,possibly more beneficial splits to be prohibited. The largestpossible identical sequences are greedily choosen first. Wefound this a to be a good compromise between computa-tional efficiency and code size reductions.

The upper bound of 32 instructions for partial basic blockinstruction sequences was choosen as a compromise betweenmemory footprint, computational complexity and code sizereduction performance. The number of basic blocks contain-ing more than 32 instructions is very limited.

4.5 Data FactoringIf low-level code can be reused by factoring it out, whywouldn’t data be a possible candidate for factoring. Mul-tiple instances of a template classes will not only result inmultiple instances of the code fragments, but also in mul-tiple instances of the statically allocated data. At least forthe read-only part of that data, only one copy is needed.

One way to go is comparable to factoring procedures: fromseveral identical data blocks, one is choosen and all point-ers to the other blocks are transformed into pointers to theselected block.

We implemented this program transformation and it turnedout to be almost completely useless: the data blocks thatcould be removed after this transformation are most of thetime unreachable in the first place or become unreachableafter evaluating load instructions during constant propaga-tion. The granularity of the data blocks isn’t fine enough: itrarely happens that whole blocks are identical. This doesn’tcome as a surprise: while object files contain several pro-cedures that are easily recognized as separate entities andtherefore separate factoring candidates, this is not the casefor data blocks, as we have not yet found a conservativemethod to divide the blocks in smaller entities. Since datablocks can only be removed if the whole block becomes un-reachable, data factoring of whole data blocks is unfeasible.

There is a finer granularity in some cases though, as with in-structions loading data from constant addresses in the read-only data sections. If there are several locations where thesame data is stored, we can transform all such loads to loadfrom the same location. Unfortunately we have experiencedthat this again does not allow data to be removed from theprogram.

There are some other consequences however: consider a tem-plate class with multiple different instances. It often hap-pens that perfectly functionally equivalent code sequencescannot be factored out. The reason is that they load datafrom different instances of the statically allocated data, lo-cated at different addresses. If constant propagation is ableto determine that the same constant is loaded, albeit from adifferent location, the instructions loading the data can per-fectly well be transformed into identical instructions loadingfrom the same address.

Detecting whether or not load instructions are transformedis done independently of actual factoring, as it is very hardto combine the heuristics for factoring with the ones fortransforming load instructions. We have not found one casehowever where transforming a load instruction had negativeeffects. In general, code factoring profits significantly fromthis transformation.

4.6 SummaryFactoring multiple instances of identical code fragments canbe done at various levels of granularity: procedures, coderegions with unique entry and exit points, basic blocks andpartially matched basic blocks.

Looking at the ratio between computationial cost and codesize reduction certain trade-offs have to be made. Regis-ter renaming to create identical code fragments is appliedon the basic block granularity level only. Rescheduling (i.e.moving instructions upwards or downwards to create identi-cal code fragments) is applied for callee saved register storesand restores only. Transforming load instructions that loadidentical data to have them load that data from a unique ad-dress is performed without actually measuring its influenceon code size reduction.

It is obvious that the different factoring levels influence eachother: without procedural factoring, all the blocks or regionsin procedures will still be factored out if basic block andregion factoring is applied. The introduced overhead willthen be larger however, since a procedure is replaced by anumber of calls to factored procedures instead of replacingit by an identical procedure.

5. EXPERIMENTAL EVALUATIONThe algorithms described in this paper are implemented inSqueeze, a binary rewriting tool aiming at code compactionfor the Alpha architecture. The Alpha architecture waschoosen because of its clean nature that eases the imple-mentation of algorithms, in particular of the Squeeze back-end. We firmly believe that the algorithms implementedin Squeeze are generally applicable and not architecturespecific. The only exception is the optimization of addresscalculations. Parts of it are of lesser importance to archi-tectures that have instructions wide enough to include ad-dresses.

We evaluated the code compaction performance of Squeeze

as a whole and of some of its algorithms in specific on a num-ber of C++ benchmarks. Some properties of those programsare summerized in Table 2. Libraries other than the stan-dard C++ libraries with which the applications are built areindicated.

The compiler we used to generate the binaries is Compaq’sC++ V6.3-002. All binaries were compiled with the -O2flag, resulting in base binaries that are optimized for spaceand time. For linking, we used the vendor-supplied linkerwith flags to produce statically linked executables contain-ing symbol and relocation information, and to dump a mapindicating where the blocks of the object files are located inthe final binary. It is this map we use to divide the datasection into blocks.

Overall code size reduction results are given in Table 3.The compacted binaries are on average 54% smaller thanthe original statically linked ones. The code size reductionsvary from 43 to 64%. Table 4 shows the contributions offactoring and the combined reachable code and data elim-ination. While an average 12% of the code size reductionscome from factoring (at various granularity levels), about6% comes from unreachable data elimination being com-bined with unreachable code elimination.

The average code size reductions (54%) almost double pre-viously reported results (30%) [7]. Partially this is due toenhancements in Squeeze, such as described in this paper.On the C programs from the SPECint2000 benchmark suite,we obtained an average code size reduction of 35% with thelatest version of Squeeze.

The performance gap between compacting C++ and C pro-grams has two reasons: on the one hand C++ programsexhibit more high-level code reuse that leads to low-level fac-torable code, on the other hand the nature of libraries usedwith by C++ application developers leads to more uselesscode being linked with the application.

It is no surprise that code factoring contributes more to

the overall code size reduction for programs that use morethan just the standard C++ libraries. GUI libraries are usein addressbook and blackbox, while gtl is a test programmanipulating graphs for evaluating whether or not the GTLlibrary was built correctly. It are precisely these kind ofC++ libraries that make extensive use of class hierarchiesand templates. Part of their larger contribution of factoringis also due to the fact that those are the largest programs:the more code, the higher are the chances that identicalpieces of code are present or can be generated.

Table 5 shows the execution times of different versions ofsome binaries. These binaries are selected because they arethe only ones for which inputs are available that lead tosignificant timing results. The other programs are eitherGUI-interactive or the only input files we have at our dis-posal resulted in execution times of less than a second. Theexperiments were run on a 667 MHz Compaq Alpha 21264EV67 processor with a split primary cache (64 KB each of in-struction and data cache), 8 MB of off-chip secondary cache,and 1.5 Gbytes of main memory running Tru64 Unix 5.1.

An average slowdown of 11% is noticed. As factoring isthe only program transformation that inherently introducesoverhead into the program, we have also measured the exe-cution times of compacted code on which no factoring wasapplied. On average these binaries show a slowdown of 2%.We believe this is due to Squeeze’s back-end. The qualityof our code scheduler is definitely not comparable to thatof the vendor supplied compiler. Still we can conclude thatthe slowdown due to code compaction in general and codefactoring in particular is relatively limited.

What about dynamically linked applications? It is obviousthat dynamically linked library code cannot be specializedfor one application. If this would be the case, using a dy-namic library is useless. Besides this, one would tend to be-lieve that dynamically linked programs are always smallerthan statically linked programs. This doesn’t need to be thecase. As a proof of concept, we have also dynamically linkedour evaluation programs. It is obvious that the ratio of func-tionality in application programmer code to that in librarycode has a major influence on the ratio between programsizes of dynamically and statically linked programs.

For addressbook, which involves almost no application func-tionality, the dynamically linked binary is an order of mag-nitude smaller than the statically linked one. Code com-paction doesn’t change this.

For other programs, such as 252.eon and gtl1, things arecompletely different. For 252.eon, the stripped dynamicallylinked binary is 560 Kbytes large. The statically linked bi-nary is 976 Kbytes large. The code compacted version of thestatically linked binary however is only 448 Kbytes large, i.e.20% smaller than the dynamically linked one! For gtl, thedynamically linked binary is 944 Kbytes large. This com-pares to 1272 Kbytes for the statically linked binary and 568

1When we dynamically compiled gtl, the vendor supplied li-braries were dynamically linked. The GTL library was stillstatically linked with the application, as this is not the typeof general library that would be used by multiple applica-tions.

Program Description Non-standard Nr. oflibraries used instructions

richards Simple operating system simulator - 58048deltablue Incremental dataflow constraint solver - 73632lcom “L” hardware description language compiler - 139772blackbox Fully functional window manager - 324444gtl Test program of the Graph Template Library GTL 200912252.eon Probabilistic ray tracer (from the SPECint2000 suite) - 186004addresbook Simple addressbook application Qt 1005716

Table 2: Description of the programs on which our code compaction was evaluated.

Program Base Squeezed Reductionrichards 58048 25184 57%deltablue 73632 36032 51%lcom 139772 62160 56%blackbox 324444 184752 43%gtl 200912 72176 64%252.eon 186004 75392 59%addressbook 1005716 517392 49%average 54%

Table 3: Numbers of instruction in the different binaries: the base binaries are the ones compiled with-O2, the squeezed binaries are the ones on all our code compaction algorithms are applied by Squeeze. Therightmost column shows the code size reduction ratios.

Program Total Factor CC&DE FactorTotal

CC&DETotal

richards 32864 1952 2976 6% 9%deltablue 37600 2656 3008 7% 8%lcom 77612 5824 4464 8% 6%blackbox 139692 20704 3632 15% 3%gtl 128736 26816 10416 21% 8%252.eon 110612 13872 7952 13% 7%addressbook 488324 76960 22144 16% 4%average 12% 6%

Table 4: Absolute number of instructions eliminated from the binaries. The numbers given are (1) thetotal number of eliminated instructions (2) the number of instructions that are not eliminated when codefactoring is not applied and (3) the number of instructions that are not eliminated when simple unreachablecode elimination is applied instead of the combined unreachable code and data elimination (CC&DE).

program Base Squeezed NoFactor SqueezedBase

NoFactorBase

richards 87 106 91 122% 105%deltablue 77 82 80 106% 104%252.eon 294 309 283 105% 96%average 111% 102%

Table 5: Execution times in seconds for the base binaries, the fully compacted binaries and the binariescompacted without applying code factoring.

Kbytes for the squeezed version thereof. Here the squeezedversion is 40% smaller than the dynamically linked version.One of the major reasons is that the dynamically linkedprograms consist for a large part of a dynamic string andsymbol table.

6. RELATED WORKThere is a considerable body of work on code compression,but much of this focuses on compressing executable files asmuch as possible in order to reduce storage or transmissioncosts [8, 10, 11, 12, 14, 15, 17]. These approaches gen-erally produce compressed representables that are smallerthan those obtained using our approach, but have the draw-back that they must either be decompressed to their originalsize before they can be executed [8, 10, 11, 12]—which canbe problematic for limited-memory devices—or require spe-cial hardware support for executing the compressed codedirectly [14, 15]. By contrast, programs compacted usingour techniques can be executed directly without any decom-pression or special hardware support.

Most of the previous work on code compaction to yieldsmaller executables treats an executable program as a sim-ple linear sequence of instructions [3, 5, 13]. They use suffixtrees to identify repeated instructions in the program andabstract them out into functions. The size reductions theyreport are modest, averaging about 4–7%. Clausen et al. [4]applied minor modifications to the Java Virtual Machineto allow it to decode macros that combine several bytecodeinstructions. They report code size reductions of 15% onaverage. Our techniques do not rely on changing the under-lying architecture on which a program is executed and arenot language dependent.

We have recently showed that an alternative approach, usingthe conventional control flow graph representation of a pro-gram and based by and large on aggressive inter-procedural compiler optimizations aimed at eliminating code,can achieve significant reductions in code size, averagingaround 30% [7]. However, this work does not take intoaccount the removal of dead data, and the synergistic ef-fect this has on the removal of unnecessary code. The workwe have reported in this paper yields code size reductionsthat are about 5% higher than that reported in our ear-lier work [7] for the same C benchmark programs, this im-provement coming from the removal of dead data and otherimprovements to squeeze. The recent results for C++ pro-grams are much higher, averaging around 54%.

The elimination of unused data from a program has beenconsidered by Srivastava and Wall [19] and Sweeney and Tip[20]. Srivastava and Wall, describing a link-time optimiza-tion technique for improving the code for subroutine calls inAlpha executables, observe that the optimization allows theelimination of most of the global address table entries in theexecutables. However, their focus is primarily on improvingexecution speed, and they do not investigate the eliminationof data areas other than the global address table. The workof Sweeney and Tip is restricted to eliminating dead datamembers in C++ programs, and so is not applicable to non-object-oriented programs; by contrast, our approach, whichworks on executable programs, can be applied to programswritten in any language. Neither of these works addresses

the close relationship between the elimination of data andthe elimination of code. Sweeney reports a run-time memoryfootprint reduction of 4.4% on the average.

For object-oriented programming languages, several tech-niques have been proposed for application extraction, whereonly the necessary parts of libraries and/or run-time en-vironments are linked with the programmer’s code. ForSelf [1], a dynamically typed language, such systems ob-tain higher compaction levels than our system. They arehowever programming-language specific and start from pro-grams containing the whole run-time environment of Self ap-plications. For Java [21] similar results to ours are achieved,but again the techniques used are language specific.

MLD [9] and Vortex [6] are two whole-program optimiz-ers for object-oriented languages. They focus on reducingthe overhead created by virtual method invocation. Look-ing at the whole class hierarchy of a program, some of thevirtual method invocations can be replaced by direct ones.These systems also reduce the performance penalty due topolymorphism by using profile information to optimize themethod calls for the most frequently appearing object types.

7. CONCLUSIONSGenerally applicable program compaction, applied at link-time, is able to achieve significant code size reductions forapplications developed in object-oriented programming frame-works.

The main opportunities for achieving these results comefrom the use of reusable code: libraries developed with reuseand general applicability in mind on the higher level and pro-gramming constructs such as templates and inheritance onthe lower level.

Our prototype link-time code compactor, named Squeeze,achieved average code size reductions of 54% on a set of 7C++ applications, ranging from 43 to 64%. This involves aexecution slowdown of about 11% on average.

8. REFERENCES[1] O. Agesen and D. Ungar. Sifting out the gold:

Delivering compact applications from an exploratoryobject-oriented environment. In Proc. OOPSLA, pages355–370, Oct. 1994.

[2] A. Aho, R. Sethi, and J. Ullman. Compilers,Principles, Techniques and Tools. Addison-Wesley,1986.

[3] B. S. Baker and U. Manber. Deducing similarities inJava sources from bytecodes. In Proc. USENIXAnnual Technical Conference, pages 179–190,Berkeley, CA, June 1998. Usenix.

[4] L. Clausen, U. Schultz, C. Consel, and G. Muller. Javabytecode compression for low-end embedded systems.ACM Toplas, 22(3):471–489, May 2000.

[5] K. Cooper and N. McIntosh. Enhanced codecompression for embedded RISC processors. In Proc.PLDI, pages 139–149, May 1999.

[6] J. Dean, G. DeFouw, D. Grove, V. Litvinov, andG. Chaimber. Vortex: an optimizing compiler forobject-oriented languages. In Proc. OOPSLA’96,pages 83–100, San Jose, California, October 1996.

[7] S. Debray, W. Evans, R. Muth, and B. De Sutter.Compiler techniques for code compression. ACMTOPLAS, 22(2):378–415, March 2000.

[8] J. Ernst, W. Evans, C. Fraser, S. Lucco, andT. Proebsting. Code compression. In Proc. PLDI,pages 358–365, June 1997.

[9] M. Fernandez. A Retargetable, Optimizing Linker.PhD thesis, Princeton University, January 1996.

[10] M. Franz. Adaptive compression of syntax trees anditerative dynamic code optimization: Two basictechnologies for mobile-object systems. In J. Vitekand C. Tschudin, editors, Mobile Object Systems:Towards the Programmable Internet, number 1222 inLNCS, pages 263–276. Springer, Feb. 1997.

[11] M. Franz and T. Kistler. Slim binaries. Commun.ACM, 40(12):87–94, Dec. 1997.

[12] C. Fraser. Automatic inference of models forstatistical code compression. In Proc. PLDI, pages242–246, May 1999.

[13] C. Fraser, E. Myers, and A. Wendt. Analyzing andcompressing assembly code. In Proc. ACM SIGPLANSymposium on Compiler Construction, volume 19,pages 117–121, June 1984.

[14] T. M. Kemp, R. M. Montoye, J. D. Harper, J. D.Palmer, and D. J. Auerbach. A decompression core forpowerpc. IBM J. Research and Development, 42(6),November 1998.

[15] K. D. Kissell. Mips16: High-density mips for theembedded market. In Proc. Real Time Systems ’97(RTS97), 1997.

[16] R. Muth, S. Debray, S. Watterson, andK. De Bosschere. alto : A link-time optimizer for thecompaq alpha. Software Practice and Experience,31(1):67–101, January 2001.

[17] W. Pugh. Compressing java class files. In Proc. PLDI,pages 247–258, May 1999.

[18] A. Srivastava. Unreachable procedures inobject-oriented programming. ACM Letters onProgramming Languages and Systems, 1(4):355–364,December 1992.

[19] A. Srivastava and W. Wall. Link-time optimization ofaddress calculation on a 64-bit architecture. In Proc.PLDI, pages 49–60, June 1994.

[20] P. Sweeney. and F. Tip. A study of dead datamembers in C++ applications. In Proc. PLDI, pages324–323, June 1998.

[21] F. Tip, C. Laffra, and P. Sweeney. Practicalexperience with an application extractor for java. InProc. OOPSLA, pages 292–305, Nov. 1999.