19
Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu T. Nakatani We present the design and implementation of several optimizations and techniques included in the latest IBM Java TM Just-in-Time (JIT) Compiler. We first discuss some of the modifications we have applied to Sun Microsystems’ reference implementation of the Java Virtual Machine (JVM TM ) Specification to increase the performance, including a change in the object layout. We then describe each of the optimizations, referring to what had to be taken into account because of both the just-in-time nature of the compiler and the requirements of the Java language specification, such as exception checking. We also present code generation techniques targeting Intel architectures, describing the register allocation schemes, exception handling, and code scheduling. Finally we report on the performance of the IBM JIT compiler, showing both the effectiveness of the individual optimizations and the competitive overall performance of the JIT compiler in comparison with a competitor, using industry-standard benchmarking programs. All the techniques presented here are included in the official product (JIT Compiler version 3.0), which has been integrated into the IBM Developer Kit for Windows TM , Java Technology Edition, Version 1.1.7. T he Java** language 1 has rapidly been gaining importance as a standard object-oriented pro- gramming language since its advent in late 1995. Java source programs are first converted into an archi- tecture-neutral distribution format, called Java byte- code, and the bytecode sequences are then inter- preted by a Java virtual machine (Jvm) 2 for each platform. Although its platform-neutrality, flexibil- ity, and reusability are all advantages for a program- ming language, the execution by interpretation im- poses an unacceptable performance penalty, mainly on account of the run-time overhead of the bytecode instruction fetch and decode. One means of improv- ing the run-time performance is to use a just-in-time (JIT) compiler, which converts the given bytecode se- quences “on the fly” into an equivalent sequence of the native code of the underlying machine. It sig- nificantly improves the performance, but the over- all program execution time, in contrast to that of a conventional static compiler, now includes the com- pilation overhead of the JIT compiler. It is therefore very important for the JIT compiler to be fast and lightweight, as well as to generate high-quality na- tive code. There are several reasons for the poor execution per- formance of programs written in the Java language. rCopyright 2000 by International Business Machines Corpora- tion. Copying in printed form for private use is permitted with- out payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copy- right notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 0018-8670/00/$5.00 © 2000 IBM SUGANUMA ET AL. 175

Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

Overview of the IBMJava Just-in-TimeCompiler

by T. SuganumaT. OgasawaraM. TakeuchiT. YasueM. KawahitoK. IshizakiH. KomatsuT. Nakatani

We present the design and implementation ofseveral optimizations and techniques included inthe latest IBM JavaTM Just-in-Time (JIT) Compiler.We first discuss some of the modifications wehave applied to Sun Microsystems’ referenceimplementation of the Java Virtual Machine(JVMTM) Specification to increase theperformance, including a change in the objectlayout. We then describe each of theoptimizations, referring to what had to be takeninto account because of both the just-in-timenature of the compiler and the requirements ofthe Java language specification, such asexception checking. We also present codegeneration techniques targeting Intelarchitectures, describing the register allocationschemes, exception handling, and codescheduling. Finally we report on the performanceof the IBM JIT compiler, showing both theeffectiveness of the individual optimizations andthe competitive overall performance of the JITcompiler in comparison with a competitor, usingindustry-standard benchmarking programs. Allthe techniques presented here are included inthe official product (JIT Compiler version 3.0),which has been integrated into the IBMDeveloper Kit for WindowsTM, Java TechnologyEdition, Version 1.1.7.

The Java** language1 has rapidly been gainingimportance as a standard object-oriented pro-

gramming language since its advent in late 1995. Javasource programs are first converted into an archi-

tecture-neutral distribution format, called Java byte-code, and the bytecode sequences are then inter-preted by a Java virtual machine (Jvm)2 for eachplatform. Although its platform-neutrality, flexibil-ity, and reusability are all advantages for a program-ming language, the execution by interpretation im-poses an unacceptable performance penalty, mainlyon account of the run-time overhead of the bytecodeinstruction fetch and decode. One means of improv-ing the run-time performance is to use a just-in-time(JIT) compiler, which converts the given bytecode se-quences “on the fly” into an equivalent sequence ofthe native code of the underlying machine. It sig-nificantly improves the performance, but the over-all program execution time, in contrast to that of aconventional static compiler, now includes the com-pilation overhead of the JIT compiler. It is thereforevery important for the JIT compiler to be fast andlightweight, as well as to generate high-quality na-tive code.

There are several reasons for the poor execution per-formance of programs written in the Java language.

rCopyright 2000 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 0018-8670/00/$5.00 © 2000 IBM SUGANUMA ET AL. 175

Page 2: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

First of all, because of the standard object-orientedprogramming practices, there tend to be many rel-atively small methods that lead to more frequentmethod invocations than is the case in other lan-guages. For example, there may be a method solelyfor accessing a private field variable. An object con-structor method is also automatically created in Javaeven if it is not explicitly written in the program, andconsequently there are many empty object construc-tor methods. Polymorphic method invocations3 cre-ate another performance problem because of theoverhead of indirectly calling through a dynamicmethod table lookup. This problem is aggravated bythe reusability of object-oriented code, since pro-grammers are encouraged to incorporate generalclass libraries or existing frameworks designed forgeneral use into new programs to increase their pro-ductivity, and this practice in general has resulted inthe creation of many polymorphic method invoca-tion sites. Even if the class is actually not overriddenin a particular program, the polymorphic method in-vocation overhead is still imposed; it likely occurs inmany cases. All these problems are inherent in ob-ject-oriented programming and can be an importantfactor in the overall performance.

Second, the Java language specification requires run-time exception checking to ensure the validity ofaccesses to objects and arrays. If a reference or anoperation is invalid, an exception must be thrown,and the environment, such as local variables, has tobe preserved and made available to an exceptionhandler that catches the exception. This conditionimposes a large penalty in program execution andprevents the application of conventional loopoptimization techniques. The type inclusion test isanother factor in the run-time overhead of check-ing type conformance, which not only resultsfrom an explicit request by the programmer usinginstanceof operations, but is implicitly required by astatement assigning an object to another type.

Third, the synchronized methods or synchronizedblocks, which are used to ensure the atomicity fora set of operations in a region, cause run-time over-head by locking a given object for the duration ofthe execution. Again, the reusability of object-ori-ented code makes this problem worse, since general-purpose class libraries are designed for use in themultithreaded environment, and synchronized meth-ods or blocks are heavily used where thread safetymust be guaranteed. When these libraries are usedby single-threaded programs, there will be substan-

tial performance degradation caused by unnecessarysynchronization.

We address all these problems in the IBM JIT com-piler by optimizing the original bytecode sequencesand applying various code generation techniques,such as method inlining, loop versioning, fast typeinclusion testing, and others, while keeping the orig-inal program semantics. Special care must be takento ensure that any optimization here allows any ex-ceptions that would have been thrown in the orig-inal program to still be thrown in the optimized code.That is to say, the exceptions must be thrown in pre-cisely the same order with respect to the rest of theprogram execution.

The next section of this paper describes the overallstructure of the IBM JIT compiler and some of themodifications we applied in Sun Microsystems’ ref-erence implementation of the Java Virtual Machine(JVM**) Specification. The third section discusseseach of the bytecode-level optimizations. The codegeneration technique, including the register alloca-tion scheme, is described in detail in the fourth sec-tion. The fifth section presents some measurementsobtained by using some industry-standard bench-marking programs and shows both the effectivenessof each of the optimizations presented in this paperand the overall performance of the compiler in com-parison with those of a competitor on platformsbased on Intel processors. The sixth section summa-rizes related works, and the final section concludesthe paper.

Overview

We first present an overview of the JIT compiler be-fore some of the details are described.

JIT compiler structure. The overall flow diagram ofthe IBM JIT compiler is shown in Figure 1. The struc-ture is basically similar to that used in typical staticcompiler environments. First, in the flow analysisphase, the basic blocks and the loop structures aregenerated by performing a linear-time traversal ofthe given bytecode sequences. The given bytecodeis then converted into an internal representation,called extended bytecode, with some new opcodesintroduced to represent operations resulting fromthe optimizations. An example of the new opcodes,which are produced by common subexpression elim-ination, is one to obtain the array object interiorpointer for a common effective address. It allows con-secutive array elements to be accessed by simple

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000176

Page 3: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

load-and-store instructions without indexing each el-ement.4 The extended bytecode, where stack seman-tics are still retained as in the original bytecode, ischosen as the internal representation in order toavoid the conversion cost in the compilation process.The optimization phase then starts, and several tech-niques are applied to this internal representation:method inlining, exception check elimination, com-mon subexpression elimination, and loop version-ing, in that order (all described in detail in the nextsection). Optimizations based on dataflow analysis,such as constant propagation and dead code elim-ination, are also applied to the internal representa-tion. The variables in stack computation are then sep-arately mapped to each of the logical registers forinteger and floating-point calculation by traversingbytecode stack operations. The region for registerallocation of local variables is also defined, and thenumber of accesses of local variables in each regionis counted.

Finally, native code is generated on the basis of theoptimized sequences of extended bytecode by allo-cating physical registers for each stack and local vari-able. To reduce the compilation time, there is no in-dependent single pass for the register allocation; thatis, the register allocation runs synchronously with thecode generation. Simple code scheduling within abasic block is applied to reorder the instructions so

that they best fit the requirements of the underlyingmachine. The JIT compiler has a potential advantageover the traditional compilation technique in that itcan identify the type of machine it is running on, andwe make use of this information in both code gen-eration and code scheduling.

Base Jvm modifications. Among a number of en-hancements included in the IBM Jvm that have beenapplied to Sun’s reference implementation, two aremajor changes introduced to improve the overall per-formance of the JIT compiler: a change in the objectlayout and the execution of the static initializer.

Figure 2 shows the change in the object layout forboth ordinary objects and array objects. An objectis originally pointed to through a handle, causing anextra level of indirection every time object fields andarray elements are accessed. In the new object lay-out, handles are removed, and each object has a two-word header instead of a handle. With this change,we can directly access instance fields simply by add-ing an extra offset to the object pointer. Furthermore,the element size now occupies the first word of thearray object header, which allows the JIT compilerto generate array index out-of-bounds checking codeby issuing just one instruction, whereas several in-structions were required, including load and shift op-erations, with the original object layout. This is a

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 177

Page 4: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

great advantage in terms of code generation effi-ciency, since the array bound exception checking hasto be done every time an array element is accessed,unless it has been proved to be safe. The space over-head per object is unchanged by this modification,two words for a header and one word for heap main-tenance, and the space reserved for the handle ar-ray is totally unnecessary now.

The other important change we have made to thereference implementation of the JVM specificationis to separate the resolution of a class from the ex-ecution of its static initializer. In the original Jvmimplementation, whenever a class is resolved, an at-tempt is always made to run its static initializer. Thisaction is good for the interpreter execution, becausethe attempt to resolve a class is always made at runtime, which is exactly the time at which the static ini-tializer is run. But from the standpoint of the JIT com-piler, a class needs to be resolved at compile timeas much as possible in order to obtain the addressesof methods and object fields and thus to generatebetter code. Otherwise, code has to be generated forcalling the run-time class resolution routine, and dy-namic code patching is required at run time. Sincerunning a static initializer entails a side effect, it can-not be run speculatively at compile time. By sepa-rating the class resolution and the execution of itsstatic initialization, the JIT compiler has more op-

portunity to generate faster code, using run-time callsif necessary to run the static initializer. Thus, the useof class resolution without static initialization is a veryimportant modification for high-performance JITcompilers.

Other enhancements in the IBM version of the Jvminclude object allocation, monitor optimization,5 andclass libraries. The IBM JIT compiler also takes ad-vantage of these enhancements. The monitor opti-mization dramatically improves performance forthe majority of object synchronization operations,namely locking an unlocked object or locking an ob-ject already locked by the current thread a small num-ber of times (shallow nested locking), by attainingthe operation with just a few machine instructions.This optimization takes advantage of the new ob-ject layout just described and uses a portion of thesecond word of the object header for encoding var-ious information, such as the thread identifier, thenested lock count, and the inflation bit. The inflatedlocks support the general case of heavyweight locks,where contention can occur between multiplethreads.

Selective compilation. Since JIT compilation occu-pies a part of the application run time, it is not nec-essarily beneficial to compile all the methods beinginvoked. For example, when a method is executed

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000178

Page 5: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

only once and does not contain any loops, theoverall performance might be degraded if it isJIT-compiled. The cost of the JIT compilation needsto be offset by the performance gain achieved by run-ning the native code in terms of both time and space.The IBM JIT compiler and Jvm allow efficient mixed-mode execution of interpreter and JIT-generatedcode by sharing the execution stack and exception-handling mechanism. By adopting an appropriateway of identifying and choosing “hot” methods thatdeserve JIT compilation, it is expected to achieve highperformance in running real applications as well asbenchmarking programs.

The notion of mixed execution of interpretation andcompiled code was considered as a continuous com-piler or smart JIT approach in Plezbert and Cytron.6

In deciding when a given compilation unit shouldbe translated to native code, they propose a modelfor making the choices on the fly based on some ex-periments. Our current strategy for selecting hotmethods to be compiled is quite simple. A methodinvocation count is provided for each method andinitialized as a certain threshold value. Whenever themethod is executed by the interpreter, the invoca-tion count is decremented. When the count reacheszero, it is determined that the method has been in-voked frequently enough, and JIT compilation startsfor the method. If the method includes a loop, how-ever, it is considered to be very JIT-sensitive and han-dled in a different way. When the interpreter detectsa loop backedge, it “snoops” the loop iteration counton the basis of a simple bytecode pattern-matchingsequence, such as iload/sipush/if_cmp_lt, and then ad-justs the amount by which the invocation count isdecremented, depending on the iteration count. Inthe extreme case, where the iteration count is largeenough, the control is immediately transferred to theJIT compiler from the partially interpreted code bydynamically changing the frame structure for the JITuse and jumping to a specially generated paddingcode. The JIT compilation for such methods can bedone without sacrificing any optimization features.

The run-time trace information is another benefitof mixed mode execution for the JIT compiler. Forany conditional branches encountered for the firsttime, the interpreter keeps information on whetherthe branch is taken or not taken to provide the JITcompiler with a guide for what direction to take atbasic block boundaries. The branch instruction isthen converted to the corresponding quick instruc-tion so that this operation, including the detectionof the loop backedge, is not executed a second time

to minimize the performance penalty. The trace in-formation is used by the JIT compiler for orderingbasic blocks so that the code can be generated in astraightline manner between basic blocks guided bythe taken branch. This information also makes a dif-ference in how a program is cut into tiles, in whichregisters are allocated to variables, as described indetail later.

Optimizations on extended bytecode

In this section, we describe each of the optimizationsfor transforming the extended bytecode, namely, theinternal representation. All the optimizations appliedat this level will be transparent for the final code-generation phase.

Method inlining. As explained earlier, the methodinvocation overhead is one of the major reasons forthe poor performance of Java. We deal with thisproblem by method inlining (which replaces calls tomethods by copies of their bodies). At the bytecodelevel, the interpreter in Sun’s Java Development Kit(JDK**) reference implementation does inline somesimple methods, if the bytecode they contain fits intothe space for method invocation or converts the callsto empty constructor methods to invokeignored_quickinstruction. The JIT compiler can apply the techniquein a more aggressive way. However, excessive appli-cation of this optimization causes an unacceptablelevel of code expansion, which can itself degrade theperformance because of the instruction cache inef-ficiency. Our basic strategy is to apply method in-lining only to hot spots to reduce the invocation over-head while avoiding explosive growth of the codesize. By a hot spot, we mean a method invocationwithin a loop or an indirect method invocation froma loop.

In the case of static and nonvirtual method invoca-tions, the problem is typically caused by empty meth-ods, many of which come from object constructors,and small access methods, where the invocation andframe allocation costs outweigh the execution timeof the method body. Since inlining these methodsdoes not cause a code size problem, they are alwaysinlined regardless of the hot spot. In other cases, sev-eral conditions are set as inlining criteria on the ba-sis of (1) the nest level of invocations in the inlininganalysis, (2) the number of total bytecode instruc-tions to be inlined, (3) the number of local variablesand stack variables increased in the inlining method,(4) the existence of exception tables in the inlinedmethod, and (5) no native or virtual methods found

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 179

Page 6: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

in the call tree in the inlining analysis. We also op-timize the recursive method invocation. A methodcall is replaced by the body of the method itself, re-ducing the number of method calls and the frameallocation cost. The tail recursion elimination, a spe-cial case of tail call optimization, converts the methodinvocation into the branch to the beginning of themethod, eliminating the method call overhead. Anexample is shown in Figure 3A.

For virtual method invocations, we take advantageof the fact that the target class is frequently not over-ridden and that some virtual call sites actually ex-ecute only one method, namely monomorphic ratherthan polymorphic. We inline the target virtualmethod by versioning the fast path of the inlined codeand the slow path of the original virtual method callin order to decrease the indirect call overheadthrough method table lookup, as illustrated in Fig-ure 3B. The run-time checking code is provided toguard the inlined code to ensure that it is correct toexecute for the current receiver object. Namely, thefast path inlined code is executed if the method in

the receiver object method table matches the targetmethod inlined, otherwise normal virtual invocationis performed. If the target method is only one forthe virtual call site at compile time, and the class isnot overridden during the course of execution, thenthe fast path inlined code can always be executed.The guard code is generated to test against the tar-get method, rather than the target class, as a resultof its efficiency. This method is used because the testcan cover even those receiver objects whose classesare an extension of the target class, if the methodinlined is not overridden by those classes. The de-tailed discussion and evaluation between methodtests and class tests for guarding the inlining of vir-tual invocations is provided in Detlefs and Agesen.7

We also optimize the interface method invocation,which is for classes with multiple inheritance. Again,we take advantage of the fact that the number of in-herited classes is only one in many cases, and there-fore we can skip searching the table for the imple-mented class. When the inlined fast path is executed,the method invocation cost for the interface class will

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000180

Page 7: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

be almost the same as that for virtual method invo-cations.

Exception check elimination. One of the perfor-mance problems in executing Java programs resultsfrom the requirement of run-time exception check-ing to ensure the safety of program execution. Ex-ception checking, both null pointer checking and ar-ray index out-of-bounds checking, can be eliminatedif it can be proved that an exception cannot occur,namely, that the exception has already been testedsomewhere along the data flow, or that the array ac-cess can be guaranteed within its bound. For the In-tel-based platform, the null pointer exception canbe detected by a hardware trap, as described in thelater subsection on exception handling, for most ofthe instructions required; however, for those byte-code instructions where the object will not betouched in the generated code, such as invokenon-virtual and athrow, the explicit null pointer checkingcode still has to be generated for dynamic checkingfor the current object. Thus the elimination of nullpointer checking is beneficial for Intel-based plat-forms as well.

The JIT compiler can eliminate null pointer check-ing by solving the simple data flow. Whenever an ob-ject is null-pointer-checked, regardless of whetherthe checking is done explicitly or implicitly in the ac-

tual code generation, a flag indicating that it has al-ready been tested is simply propagated along the dataflow. For array bound checking, in contrast, we de-veloped a new algorithm by extending Gupta’s al-gorithm8 and eliminated a broader range of casesof array access. Gupta’s method basically propagatesthe checked index set forward and backward, usingdata flow analysis; however, when an index variableis updated, it is treated as KILL,9,10 and hence, thechecked index set for the variable has to be reset.In the algorithm we developed, the exact range ofthe checked index set can be determined, includingthe maximum and minimum constant offset from theindex variable, and this set of information is thenpropagated along the data flow. Even when the ac-cess index is updated by adding or subtracting a con-stant, it is not treated as KILL, but the checked indexset is updated by offsetting the constant value. Con-sequently, the new algorithm can eliminate more ex-ception checks for array accesses, especially for thosewith a constant index value. The example in Figure4 shows one advantage of the algorithm, where thearray index exception check required is explicitly in-dicated by italicized statements. 11 The number of ex-ception checks, which was originally 11 (the num-ber of array accesses), was reduced to six by theexisting method and is further reduced to only threeby the new method. The detailed algorithm and ex-

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 181

Page 8: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

perimental result will be reported in a future pub-lication.

Common subexpression elimination. We apply threetechniques for common subexpression elimination(CSE) to reduce the overhead of array and instancevariable accesses: scalar replacement, common effec-tive address generation, and partial redundancy elim-ination. Other techniques such as global value num-bering,12,13 which heavily relies on the static singleassignment (SSA) form, are considered too expen-sive in our just-in-time compilation environment.

Scalar replacement replaces subscripted variables bylocal variable references and makes them availablefor register allocation when the array object and theindex variable are not updated in the scope in whichit is being applied. Common effective address gen-eration produces an instruction pointing to the in-ternal element of the array for consecutive array ref-erences, which typically appears in many sortingprograms. This method is quite effective for reduc-ing the register pressure, but the array object pointerneeds to be retained so that it is not subject to gar-bage collection. In both cases, the code can be movedout of the loop if it is loop-invariant. Since a depen-dence DAG (directed acyclic graph) has to be con-structed in order to apply both of these techniques,and this is quite expensive, we limit their applica-tion to loops, in view of the costs and benefits.

Partial redundancy elimination14 is also used to re-duce the number of accesses to instance variables.It eliminates all but one identical access on some ex-ecution path through a flowgraph and also movesinvariant accesses out of loops. The instance vari-

able accesses are then mapped to local variables,which can be allocated to physical registers in thecode generation phase. An example of how redun-dancy elimination is applied is given later.

Loop versioning. Loop versioning is a technique forhoisting an individual array index exception checkoutside a loop by providing two copies of the loop:the safe loop and the unsafe (original) loop. The ex-ception checking code is first created at the entry ofthe loop by examining the whole range of the indexagainst the bound of the arrays accessed within theloop. Two versions of the target loop are then gen-erated: one for the safe loop, which no longer re-quires exception checks within the loop, and theother for the original unsafe loop, with all exceptionchecks retained, as shown in Figure 5. Dependingon the result of the index range test at the time ofentry, either the safe loop or the unsafe loop is se-lected for execution at run time. For the applicationof this transformation, the order of any exceptionsthat might be raised during the loop execution andany side effect that might occur at the time of an ex-ception must be guaranteed to be unchanged. Thecurrent criteria for applying this optimization are asfollows: (1) no method invocation exists within theloop; (2) there is a loop induction variable whoseinitial value and final value are both loop-invariantand whose stride is constant; and (3) for all the ar-ray accesses within the loop, the array object mustbe a local or a class variable that is loop-invariant,and the array indices must be constants, loop-invari-ant variables, or loop induction variables with con-stant offsets. Application of this optimization is alsolimited by the size of the target loop body, in viewof the increase in code size.

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000182

Page 9: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

The safe loop created by this optimization not onlyeliminates the exception checking code itself but im-poses fewer constraints in the work on register al-location and code scheduling, thus producing bettercode. Also, the safe loop will allow the JIT compilerto apply other loop optimizations in the future, be-cause the array exception checking and the array ofarray implementation for multidimensional arraysin Java are considered two major factors preventingthe application of traditional loop optimizations andthus slowing down the execution of loops. Therefore,this technique, together with the use of rectangularmultidimensional dense arrays, can pave the way formore aggressive loop optimizations, such as loop un-rolling, loop interchange, and loop blocking, whichshould be effective for programs with numericallycomputation-intensive loops.15

Code generation details

Prior to the code generation phase, the given pro-gram is divided into several pieces, called tiles, eachof which is used as a block for allocating registers.The tile cutting is done on the basis of the loop struc-ture existing in the program; a loop is identified asa tile, and code between loops is treated as anothertile. This heuristic is simple and appropriate for slic-ing a program in the absence of relative strength ofedges between basic blocks. Information on localvariable usage is then collected within each tile, anda local variable table sorted in decreasing order ofaccess count is generated. For a nested loop, a localvariable table is prepared for each level of loop nest,so that the local variables with the highest accesscounts in each nest level can be cached on registers.

In the code generation, several modes of addressingregister operands, memory operands, and immedi-ate operands provided by the IA32 architecture16 areexploited. Different machine instructions are se-lected, depending on the underlying processor typefor some bytecode instructions. In general, memoryoperand instructions are preferentially used with thePentium Pro** family of processors to exploit its ca-pability of out-of-order instruction execution andhardware register renaming. The instruction cacheefficiency is also taken into account by generating aslow and rare path at the bottom of the code andcompacting the fast execution path as much as pos-sible.

Register allocation. As explained earlier, no inde-pendent single pass is used for the global registerallocation in order to reduce the compilation time.

We consider expensive register allocation algorithms,such as graph coloring,17 to be inappropriate, owingto the just-in-time nature of the JIT compiler. Instead,we use a simple and fast algorithm for allocating reg-isters that does not require an extra phase in the com-pilation process.

The register allocator, better called the register man-ager, allocates registers for stack variables as a firstpriority and then allocates the remaining availableregisters to local variables based on the usage countwithin each tile. During the course of the code gen-eration, other local variables may be left unchangedin the registers by assigning stack computation re-sults to them. Thus we have three types of registers:stack variable registers, permanent cached local vari-able registers, and temporary cached local variableregisters. In deciding which of several available reg-isters to allocate, the register manager adopts thecircular allocation policy, returning the least recentlyallocated register. This policy likely results in thecode sequences being less dependent on the sur-rounding code with respect to register usage and,hence, creates more opportunities for code sched-uling work. When the code generator requires anadditional register and none is available, the regis-ter manager searches for one, first in temporary lo-cal variable registers, and then in permanent localregisters and stack variable registers, in that order.This simple heuristic is based on the assumption thatthe stack variables are generally short-lived andtherefore most likely have the shortest interval tothe next references. It is used to avoid the expensivecomputation needed to choose the optimal spill can-didate.

There are some conventions regarding register us-age for method call argument and return value pass-ing. In order to prevent unnecessary copy instruc-tions from being generated, designated registers canbe allocated for those instructions that set methodparameters or return values. Liveness informationon local variables, obtained by data flow analysis, isalso used to avoid generating unnecessary spill codes.Whenever a reference to a cached local variable isknown to be the last use, the register is invalidatedimmediately after generation of the relevant codeand is given back to the register manager to be addedto the list of available registers.

Idioms in bytecode sequences. We provide a tableof frequently appearing bytecode sequences as id-ioms to mitigate the inefficiency in code generationcaused by stack semantics. The purpose of the id-

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 183

Page 10: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

iom table is to locally eliminate the stack semanticsand to exploit the local variable cache registers byavoiding unnecessary move instructions between lo-cal and stack variables. Many of the idioms workaround the stack manipulation bytecode instructionsto reduce the stack height of expression evaluation,thus reducing the number of stack variable registersneeded and allowing more local variables to becached. The total number of idioms provided is morethan 80, and some examples of these idioms and theircorresponding Java source statements are shown inTable 1. A similar approach is discussed in Proeb-sting,18 where frequently appearing bytecode pat-terns are synthesized as superoperators and wherea heuristic method for inferring a good set of super-operators is proposed.

The bytecode sequences generated for a specific codefragment in Java sources, of course, depends on theJava source compiler. A different Java compiler mayproduce different bytecode sequences for the samesource programs. Therefore, some of the idioms pro-vided may not be applicable to some class files if theirunderlying Java compiler for the class files is differ-ent.

Type inclusion test. The type inclusion test is a pro-cedure for checking whether two given types are re-lated by a subclass relationship and will result from

either an explicit or an implicit requirement by pro-grammers. Explicit type checking can be specifiedby using the construct instanceof in the source pro-gram, and the type conformance test is implicitly im-posed for a statement assigning an instance to a dif-ferent type of local variable or array element. Thisrun-time overhead can be a significant performancebottleneck in many Java programs.

In the past, several techniques have been proposedfor conducting efficient type inclusion tests19 by en-coding class hierarchies in a small table, which is re-ferred to by the inlined checking code. The speedand space efficiency varies depending on the encod-ing type of the subclass relations. However, newclasses and interfaces can be loaded in Java at ar-bitrary points in the program execution, and the ta-ble has to be recomputed for the new encoding whenclasses are loaded dynamically. Although we did notexplore the possibility of this scheme in terms of per-formance benefit and the recomputation cost in ourenvironment, we chose to implement a simple andeffective caching mechanism for type inclusion test-ing and inlined the fast path of the code as shownin the pseudocode for the checkcast instruction inFigure 6.

First, the given object is tested to determine whetherit is null. Since a null object can be cast to any type

Table 1 Examples of bytecode idioms

Bytecode Idiom Corresponding Java Source Statement

dup / getfield / iload / iop2 / putfield Obj.field 15 var;dup2 / iaload / iload / iop2 / iastore Array[i] 15 var;dupx1 / putfield if(Obj.field 5 val) . . .aload / iload / iinc / iaload . . . 5 Array[i11] 1 . . .

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000184

Page 11: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

of object, no operation is required in this case. Theobject class is then compared with the value of thecache that holds the successful class from the pre-vious test. If these tests fail, then the Jvm-suppliedrun-time library is called to traverse the class hier-archy to check subclass relations, and if it succeeds,the cache value is updated with the new object class.In the code generation, only the first and second testsare inlined in the fast execution path, and the re-maining operations are placed at the bottom of thecode. We also hold another cache holding failureclass for the instanceof instruction, and the objectclass is compared with the value of the cache as well.This action is taken because the instanceof instruc-tion returns either true or false depending on theresult of the test, instead of just throwing the class-cast exception as in the case where the checkcast in-struction fails.

Exception handling. Since using exceptions can beone of the normal programming practices in Java,the exception-handling mechanism must be efficientin the JIT implementation. For example, a programmay contain an infinitely running loop that exits onlythrough an array index out-of-bounds exception. Werely on the hardware trap and system-level exception-handling mechanism to detect and handle some Javaexceptions, namely, the null pointer exception, arith-metic (division by zero) exception, and stack over-flow exception, to achieve high performance by elim-inating the code for explicitly checking the exceptioncondition. Once an exception of this type has beenraised, the system-level exception handler directlycalls the user-level handler, which is registeredthrough the exception registration record. The ex-ception registration record is created in the execu-tion frame only for those methods that have anexception table, namely, a try-catch block, and isregistered and deregistered to the system during thecourse of program execution. Thus an exceptionchain is maintained for each thread, always indicat-ing the list of exception records with the topmost oneas the anchor. The user-level handler will then tra-verse along this exception chain to find an appro-priate address to be resumed, instead of unwindingthe entire stack frame linearly.

The try region at the time of the exception has to beidentified by the exception handler in determiningthe corresponding catch block where the executionis possibly resumed. This is originally given as therange in the bytecode sequences; however, it is notappropriate for the JIT compilation, since the codegeneration does not take place in the order of given

bytecode sequences, but takes advantage of the ba-sic block ordering that results from various bytecode-level optimizations and the run-time trace informa-tion from the mixed mode interpreter. We thereforeprovide an extra local variable to keep the currenttry region identification: Whenever the execution en-ters or exits a basic block and the corresponding tryregion identification becomes different, the local vari-able is updated so that the exception handler canlocate the corresponding catch blocks correctly tocheck whether the execution can be resumed.

For those exceptions that require code to be gen-erated explicitly for checking exception conditionssuch as array index out-of-bounds, we generate a con-ditional instruction to jump to the bottom of the code

when the exception condition is met, so that the ex-ecution falls through the jump instruction in normalcases.20 The actual exception-throwing instructionsare placed at the bottom of the code. This placementis to avoid an instruction cache miss caused by theloading of an exception-throwing code that is unlikelyto be executed, and also to take advantage of theprocessors’ static branch prediction of any forwardconditional branches not to be taken.21

Code scheduling. Code scheduling is a well-knowntechnique for optimizing code by scheduling or re-ordering given instructions to best fit the require-ments imposed by the underlying machine architec-tural characteristics. Since many machines nowadays,including Intel processors,21 have multiple pipelinesand expose instruction-level parallelism to the users,it is essential that the code for such machines be or-ganized in a way that takes best advantage of pipe-lines present in an architecture or implementation.The code scheduling technique was developed forstatic compilation, and therefore the compilationtime was not a major consideration. Even with thesimple list scheduling technique for the basic blockrange, it is necessary to construct a dependence DAG.

The exception-handlingmechanism must be efficientin the JIT implementation.

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 185

Page 12: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

The worst-case running time is O(n 2), where n isthe number of machine instructions in the basicblock.6 In the IBM JIT compiler, we have implementeda different way of scheduling code within a basicblock, running in C p n, where C is the size of thelookahead to be scheduled, in view of the compila-tion time constraint.

The code scheduler works synchronously with thecode generator within a basic block range. A table,whose columns show the number of pipelines andwhose rows show the number of clocks, is providedto serve as a buffer for reordering instruction se-quences. When a native code is generated, it is givento the scheduler together with some attributes forthe generated code, such as reference registers, anupdated register, an address and the type of mem-ory access, restrictions on executable pipelines, anda flag indicating whether an exception can be raisedwith this instruction, all represented by bit vectors.The scheduler then considers all the requirementsand dependencies between the new generated in-struction and existing instructions in the buffer, in-cluding the number of clocks for address generationinterlock (AGI), and places the instruction at a slotwith the earliest possible clock number in the buffer.When there are no available slots in the buffer orthe code scheduling scope has ended, the instruc-tions in the buffer are emitted into the actual codespace in the scheduled order.

The type of memory access given to the scheduleris categorized on the basis of the language specifi-cation to ensure that there is no interference betweenthose memory access instructions with different types.Even though the instructions have the same type ofmemory access, the different offset means the ac-cesses are for different memory locations, since gen-erated instructions should not have the interiorpointer of an object, unless common subexpressionelimination has been applied. Our code schedulertakes advantage of the fact to reorder instructions.

The scheduler cannot reorder any two instructionsthat may raise exceptions or produce side effectswhen executed, in order to guarantee that the cor-rect exception can be thrown and the environmentat the time of the exception is preserved. This is whythe information indicating whether the given instruc-tion can raise an exception or not is necessary forthe scheduler. For array index out-of-bounds excep-tion checking, we treat compare and jump instruc-tions as a single complex instruction and include itin the scheduling scope, unlike the traditional basic

block range scheduler, in order to make the sched-uling scope long enough.

Since all the requirements and dependencies amonginstructions are solved by using bit vectors, an ap-propriate slot in which to place an instruction canbe found quickly. The limitation of our approach isthat there may be more appropriate instructions tobe placed among the succeeding instructions that willbe generated, since the first-fit strategy is used toplace the given instruction in the buffer; that is tosay, it is placed in the available slot with the earliestpossible clock number. The architectural character-istics are largely different between the Pentium**processor and Pentium Pro processor family (Pen-tium Pro, Pentium II, and Pentium III processors),and the code scheduling is available only for Pen-tium processors in the current version of the JIT com-piler. It will be enabled for the Pentium Pro processorfamily as well in the next version of the JIT compiler.

Example. Figure 7 shows an example from the bub-ble sort sample program. We use this example to il-lustrate common subexpression elimination and itsresulting extended bytecode, native code generationusing the bytecode idioms, and register allocation.Through bytecode-level optimization, common sub-expression elimination is applied to the original pro-gram (Figure 7A), and it is transformed into theequivalent sequence of extended bytecode, for whichthe pseudocode can be shown in Figure 7B. The in-stance variable access for the array object is local-ized and moved out of the loop. An effective com-mon address is then generated for the array accesseswith consecutive indices. The italicized statementsin Figure 7B correspond to the extended bytecodeadded or modified through this optimization.

The resulting extended bytecode for the innermostloop is shown in Figure 7C with some labels show-ing the basic block boundary. Each instruction isnumbered with its index in all the sequences of theextended bytecode. New bytecode instructions, suchas eaddress, eaload, and eastore, are added in ourinternal representation to express the operations ap-plied by redundancy elimination. These instructionsdenote generating an effective address, loading avalue from the specified effective address, and stor-ing a value to the specified effective address, respec-tively. The number in parentheses in the figure foreaload and eastore indicates the offset from the effec-tive address to be accessed.

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000186

Page 13: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

The extended bytecode sequences in Figure 7C arecombined by idioms, based on which native code gen-eration is performed. When the innermost loop isentered, the local variables i, j, and la[ ] are assignedpermanent local registers according to the access

counts for those variables. Numbers shown in gen-erated native codes (Figure 7D) correspond to thenumbers in the extended bytecode (Figure 7C). Inthe code generation for eaddress (numbered 15), ar-ray index out-of-bounds checking code is produced

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 187

Page 14: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

for the range of indices accessed through the effec-tive address generated; that is, two checks are gen-erated for the lower and upper offsets of the inter-val against the lower and upper bounds of the arrayrespectively. In this case, the interval [0, 1] is givenas the index offset to be accessed from the effectiveaddress, and two checking instructions are generatedwithout changing the index variable in the registeredi.

Experimental results

To evaluate the effectiveness of the optimizationsand techniques presented above, we performedtwo experiments of different kinds: one thatfocused on individual optimizations to check theirbenefits and contributions to the speedup, andanother that measured the total performanceto examine the overall competitiveness of theIBM JIT compiler relative to those of a major com-petitor.

Evaluation for individual optimization. We chosethree industry-standard client benchmarking pro-grams for the evaluation of individual optimi-zation: CaffeineMark** 3.0,22 JMark** 2.0,23 andSPECjvm98.24 For JMark 2.0, only the six computation-intensive tests were selected, since the other tests inthe benchmark mainly focus on the performancerelated to AWT (Abstract Window Toolkit), whichis not very JIT-sensitive. The measurement forSPECjvm98 was conducted in the test mode with acount of 100 in this set of experiments by runningthe test as a local application, not through the Webserver as specified in the SPEC-compliant mode. Sincethese experiments are to measure the effectivenessof the JIT optimizations, the selective compilationswitch is turned off; that is, all the methods areJIT-compiled (the switch is a development andtuning aid and not made public). The JIT compila-tion time is or is not included, depending onhow each test of benchmarks is organized. ForSPECjvm98, we took the best time for each test thatresulted from autorun; the compilation times werenot counted in this comparison. All the experimentsdescribed below were conducted on a Pentium II 375MHz processor with 256 MB of RAM, running Win-dows NT** 4.0 Service Pack 3.

Figure 8 shows the relative performance when weturn off one of the optimizations described above toexamine its effectiveness in each of the benchmarktests. The bar graphs for each test show how muchperformance is lost when we turn off each optimi-

zation—method inlining (NO INLINE), commonsubexpression elimination (NO CSE), loop version-ing (NO LOOPVER), exception check elimination(NO EXC), fast type inclusion testing (NO TYPE), andidioms in bytecode sequences (NO IDIOM), respec-tively, from left to right—compared with the per-formance when we run the compiler fully optimized.As a reference, the performance with all the opti-mizations disabled is also shown as the right-mostbar graph (NO OPT).

As we can easily see, method inlining is the mosteffective in the CaffeineMark 3.0 method, the JMark2.0 processor, and several SPECjvm98 tests. Recursivemethod inlining is the biggest contributor to improv-ing the method test score, whereas virtual methodinlining is actually effective for improving the pro-cessor and SPECjvm98 mtrt test performance. Com-mon subexpression elimination is effective when ap-plied to the CaffeineMark 3.0 loop and float and tothe JMark2.0 fast Fourier tests, all of which have ar-ray-manipulation-intensive loops. Loop versioningalso makes a modest contribution in the same set oftests. These two optimizations have almost no effecton any tests in SPECjvm98, probably because in thesetests there is no single hot spot where the optimi-zations applied for some loops contribute to the to-tal performance. Exception check elimination seemsto be effective only for the CaffeineMark 3.0 logicand string and for SPECjvm98 mtrt and mpegaudiotests in this limited set of benchmark tests. Fast typeinclusion testing is quite effective when applied inseveral SPECjvm98 tests and contributes the improve-ment in performance, up to 20 percent for the jessand db tests. Idiom-based code generation is widelyeffective for many of the benchmark tests.

Total performance evaluation. Figure 9 shows theresults of the overall performance measurements forSPECjvm98, comparing the IBM Developer Kit forWindows, Java Technology Edition, Version 1.1.7,which includes the JIT compiler we developed withSun’s reference implementation of JDK 1.1.7 (JDK1.1.7B 003 for Windows),25 which features the Sy-mantec JIT compiler. This measurement was con-ducted in the SPEC-compliant mode on a machinewith a Pentium II 350 MHz processor, 512 MB mem-ory, running Windows NT Server 4.0 Service Pack 4and the Apache Web server 1.3.4. The relative per-formance is computed from the best data (elapsedtime) produced by the autorun, so the higher barmeans the better performance.

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000188

Page 15: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 189

Page 16: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

The IBM Jvm and JIT compiler outperform Sun’s JDK1.1.7 in all of the tests in this benchmark, and forsome tests our score is more than double. An attemptto run the latest version of Microsoft Java executionenvironment Software Development Kit (SDK) 3.2(build 3186, released August 24, 1999)26 was alsomade on the same machine environment, butthe valid result could not be obtained in theSPEC-compliant mode. In the comparison done un-der the test mode, the IBM Jvm and JIT compiler per-form better than SDK 3.2, although the differencesare smaller than those against Sun’s JDK 1.1.7. Un-fortunately, the comparison between different Javaenvironments based on the measurement in testmode cannot be published by the term of our SPECmembership. Overall, the IBM Jvm and JIT compilercan be considered a top performer among major Javaenvironments available for platforms based on Intelprocessors.

Related work

There have been several reports of optimizations andcode generation techniques for fast and efficientcompilation in product-level Java compilers. The In-tel JIT compiler20 exploits lazy code selection as afast and effective way of folding Java stack operandsby propagating information about operands via anauxiliary data structure called the mimic stack. Since

it took a lightweight approach to having no internalrepresentation, the optimization features are limitedto extended basic blocks. Both the Microsoft JIT com-piler26 and CaCao27 have their own internal repre-sentations for optimizations, but the details of theirJIT optimizations are not clearly described. For staticcompilers, JOVE,28 TowerJ,29 and HPCJ30 are nowproducts and available, and they all exploit traditionaloptimization techniques such as static single assign-ment (SSA) representation, SSA-dependent optimi-zations, and global register allocation based on graphcoloring. Marmot31 is a complete system of a nativecompiler and run-time system, implementing stan-dard scalar and object-oriented optimizations, suchas call binding based on class hierarchy analysis.Ninja15 is another static compiler, a research pro-totype that addresses optimizations for technicalcomputing to make Java competitive with FORTRANand C11.

Jalapeno32 is a Jvm implemented in Java itself, de-signed to satisfy some critical requirements for serv-ers. It takes a compile-only approach for programexecution with three different dynamic compilers, in-stead of providing both interpreter and a JIT com-piler as in most other Jvms. Some interesting opti-mizations, such as profile-directed method inliningand escape analysis, are being implemented in themost aggressive version of the compiler.

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000190

Page 17: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

A fast and lightweight register allocation method wasproposed in Poletto and Sarkar33 and Traub et al.34;it consists of scanning all the live ranges of local var-iables and allocating registers to them in a greedyfashion. It is reported that the compile time speedis several times faster than the fastest graph color-ing method, yet the quality of the generated code iscomparable. Since the register allocation should bethe key to improving JIT compiler performance fur-ther, it may be worth investigating a similar idea oran extension of this idea for the IBM environment.

The problem of unnecessary synchronization in Javais addressed in Bogda and Holzle35 by removing syn-chronizations for those objects reachable from onlya single thread. The analysis first identifies objectsthat will be local to a thread by dereferencing thefield of the thread local object, if necessary, and thentransforms the program by introducing nonsynchro-nized versions of optimizable classes. This work isin contrast to our effort of alleviating the synchro-nization cost by speeding up the majority of synchro-nization cases. Although this work is done for a staticenvironment where the whole program is availableto be analyzed, the idea behind the work will be ap-plicable in the IBM environment. The analysis canalso be used for determining objects that can be al-located on the stack, instead of heap, which will re-duce the cost of both object allocation and garbagecollection.

Concluding remarks

In this paper, we have presented the design and im-plementation of the IBM Java JIT Compiler version3.0 for platforms based on Intel processors. Follow-ing an overview of the overall structure and the baseJvm modification, we discussed each of the optimi-zations and the code generation techniques includedin the JIT compiler in detail, and then presented aperformance comparison using three industry-stan-dard client benchmarking programs. It was shownthat each of the optimizations is very effective forsome type of program, and that overall the IBM JITcompiler combined with IBM’s enhanced Jvm is oneof the top performing Java execution environmentson Intel-based platforms.

The discussion in this paper has been centeredaround an Intel-based platform; however, most ofthe optimizations and techniques are common toother IBM platforms as well, such as those based onthe PowerPC* and S/390*, which the JIT compiler sup-ports.36 Specifically, the bytecode level optimization

features described in the section on optimization andthe basic idea of register allocation and idiom-basedcode generation in the section on code generationare all applicable as cross-platforms. Also, all thetechniques presented here can be applicable regard-less of the version of Java implementation; namelynot only for version 1.1.7 of Java as made generallyavailable, but also for the Java 2 implementation andbeyond.

Since the JIT compiler is a critical component of theJvm for achieving high performance, we will continueto improve the JIT compiler to stretch the limit ofJava performance by balancing the compilation timerequirement and the quality of code generation. Inthe next version of the JIT compiler, we will com-pletely eliminate the stack semantics from the inter-nal representation and will move into a register-based representation for further improvement in thequality of JIT-generated code. We also plan some ob-ject-oriented specific optimizations, including classhierarchy analysis and its use in method inlining anddirect binding for virtual invocation sites.36

Acknowledgments

We would like to thank Duc Vianney and AkihikoTogami for their cooperation in measuring per-formance of several Jvms for SPECjvm98 bySPEC-compliant mode. We also thank the IBM Net-work Computing Software Division System Perfor-mance group in Austin for their helpful discussionand analysis of possible performance improvements.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of Sun Microsystems, Inc.,Intel Corporation, Pendragon Software Corporation, Ziff-Davis,Inc., or Microsoft Corporation.

Cited references and notes

1. J. Gosling, B. Joy, and G. Steele, The Java Language Spec-ification, Addison-Wesley Publishing Co., Reading, MA(1996).

2. T. Lindholm and F. Yellin, The Java Virtual Machine Spec-ification, Addison-Wesley Publishing Co., Reading, MA(1996).

3. A given call site may invoke several different actual methodsover the course of program execution, depending on the dy-namic type of receiver object.

4. The concrete example appears in the fourth section in thesubsection labeled “Example.”

5. D. F. Bacon, R. Konuru, C. Murthy, and M. Serrano, “ThinLocks: Featherweight Synchronization for Java,” Proceedingsof the ACM SIGPLAN’98 Conference on Programming Lan-guage Design and Implementation (1998), pp. 258–268.

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 191

Page 18: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

6. M. P. Plezbert and R. K. Cytron, “Does ‘Just in Time’ 5 ‘Bet-ter Late Than Never’?” Conference Record of the 24th ACMSIGPLAN-SIGACT Symposium on Principles of ProgrammingLanguages (1997), pp. 120–131.

7. D. Detlefs and O. Agesen, “Inlining of Virtual Methods,”Proceedings of the 13th European Conference on Object-Ori-ented Programming, Lecture Notes in Computer Science 1628,Springer-Verlag (1999), pp. 258–278.

8. R. Gupta, “Optimizing Array Bound Checks Using FlowAnalysis,” ACM Letters on Programming Languages and Sys-tems 2, No. 1–4, 135–150 (1993).

9. KILL is a term used in data flow analysis. If an instructionredefines a variable, it is said to kill the definition, which meansthe collected information regarding the variable cannot bepreserved after the point in the flow analysis.

10. A. V. Aho, R. Sethi, and J. Ullman, Compilers: Principles,Techniques, and Tools, Addison-Wesley Publishing Co., Read-ing, MA (1986).

11. Please note that index_check (3 , ub) in Figure 4B can beeliminated in Figure 4C as a result of the fact that checking(0 ,5 i 2 2) and (i 1 1 , ub) at the top results in (3 ,5i 1 1 , ub).

12. C. Click, “Global Code Motion, Global Value Numbering,”Proceedings of the ACM SIGPLAN’95 Conference on Program-ming Language Design and Implementation (1995), pp. 246–257.

13. S. S. Muchnick, Advanced Compiler Design and Implemen-tation, Morgan-Kaufmann Publishers, San Francisco, CA(1997).

14. J. Knoop, R. Oliver, and S. Bernhard, “Lazy Code Motion,”Proceedings of the ACM SIGPLAN’92 Conference on Program-ming Language Design and Implementation (1992), pp. 224–234.

15. Ninja: Numerically Intensive Java, IBM Corporation, avail-able at http://www.research.ibm.com/ninja/.

16. Intel Architecture Software Developer’s Manual, Order Num-ber 243191-001, Intel Corporation, Santa Clara, CA (1997).

17. F. Chow and J. Henessey, “The Priority-Based Coloring Ap-proach to Register Allocation,” ACM Transactions on Pro-gramming Languages and Systems 12, No. 4, 501–536 (1990).

18. T. A. Proebsting, “Optimizing an ANSI C Interpreter withSuperoperators,” Conference Record of the 22nd ACM SIG-PLAN-SIGACT Symposium on Principles of ProgrammingLanguages (1995), pp. 322–332.

19. J. Vitek, R. Horspool, and A. Krall, “Efficient Type Inclu-sion Test,” Proceedings of the ACM Conference on Object Ori-ented Programming Systems, Languages & Applications,OOPSLA ’97 (1997), pp. 142–157.

20 A.-R. Adl-Tabatabai, M. Cierniak, G.-Y. Lueh, V. M. Parikh,and J. M. Stichnoth, “Fast, Effective Code Generation in aJust-in-Time Java Compiler,” Proceedings of the ACM SIG-PLAN ’98 Conference on Programming Language Design andImplementation (1998), pp. 280–290.

21. Intel Architecture Optimization Manual, Order Number242816-003, Intel Corporation, Santa Clara, CA (1997).

22. CaffeineMark3 Benchmarks, Pendragon Software Corpora-tion, Libertyville, IL, available at http://www.pendragon-software.com/pendragon/cm3/info.html.

23. JMark2.0 Benchmarks, Ziff-Davis, Inc., New York, availableat http://www.zdnet.com/zdbop/jmark/jmark20/applet/jmdocs/jmarkdoc.htm.

24. SPECjvm98 Benchmarks, Standard Performance EvaluationCorporation (SPEC), Manassas, VA, available at http://www.spec.org/osg/jvm98.

25. JDK1.1.7B for Win32, Sun Microsystems, Inc., Palo Alto,

CA, binary available at http://java.sun.com/products/jdk/1.1/index.html.

26. MS SDK for Java 3.2, Microsoft Corporation, Redmond, WA,binary available at http://microsoft.com/java/vm/dl vm32.htm.

27. A. Krall and R. Grafl, “CACAO: A 64-bit Java VM Just-in-Time Compiler,” Proceedings of the ACM PPoPP ’97 Work-shop on Java for Science and Engineering Computation (1997).

28. JOVE Technical Report, Instantiations Inc., Tualatin, OR,available at http://www.instantiations.com/jove/jovereport.htm.

29. TowerJ, Tower Technology Corporation, Austin, TX, avail-able at http://www.towerj.com.

30. High Performance Compiler for Java, now integrated in Vi-sualAge for Java, IBM Corporation, available at http://www.software.ibm.com/ad/vajava/.

31. Marmot: an Optimizing Compiler for Java, Microsoft Cor-poration, Redmond, WA, available at http://www.research.microsoft.com/apl.

32. M. G. Burke, J.-D. Choi, S. Fink, D. Grove, M. Hind,V. Sarkar, M. Serrano, V. C. Sreedhar, H. Srinivasan, andJ. Whaley, “The Jalapeno Dynamic Optimizing Compiler forJava,” Proceedings of the ACM SIGPLAN Java Grande Con-ference (1999), pp. 129–141.

33. M. Poletto and V. Sarkar, “Linear Scan Register Allocation,”ACM Transactions on Programming Languages and Systems(1999).

34. O. Traub, G. Holloway, and M. D. Smith, “Quality and Speedin Linear-Scan Register Allocation,” Proceedings of the ACMSIGPLAN ’98 Conference on Programming Language Designand Implementation (1998), pp. 142–151.

35. J. Bogda and U. Holzle, Removing Unnecessary Synchroni-zation in Java, Technical Report TRCS99-10, Department ofComputer Science, University of California, Santa Barbara,CA (1999).

36. K. Ishizaki, M. Kawahito, T. Yasue, M. Takeuchi, T. Oga-sawara, T. Suganuma, T. Onodera, H. Komatsu, and T. Na-katani, “Design, Implementation, and Evaluation of Optimi-zations in a Just-in-Time Compiler,” Proceedings of the ACMSIGPLAN Java Grande Conference (1999), pp. 119–128.

Accepted for publication September 15, 1999.

Toshio Suganuma IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken, 242-8502, Japan (electronic mail: [email protected]). Mr. Su-ganuma joined IBM in 1992 as a research member at the TokyoResearch Laboratory, and since then he has worked on compileroptimizations and code generation for the High PerformanceFORTRAN (HPF) compiler and IBM Java Just-in-Time Com-piler projects. His research interests are in the area of code op-timization, parallel processing, and instruction scheduling. He re-ceived the B.E. and M.E. degrees, both in applied mathematicsand physics from Kyoto University in 1980 and 1982, respectively,and received the M.S. degree in computer science from HarvardUniversity in 1992. He is currently a research staff member in theNetwork Computing Platform group.

Takeshi Ogasawara IBM Research Division, Tokyo ResearchLaboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken,242-8502 Japan (electronic mail: [email protected]). Mr. Oga-sawara works in the areas of optimizing compilers. He designedand implemented the type inclusion test optimization, the methodinvocation optimization, and the exception-handling mechanismfor the IBM Java Just-in-Time Compiler in the IA32 architec-

SUGANUMA ET AL. IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000192

Page 19: Overview of the IBM Java Just-in-Time Compiler · Overview of the IBM Java Just-in-Time Compiler by T. Suganuma T. Ogasawara M. Takeuchi T. Yasue M. Kawahito K. Ishizaki H. Komatsu

ture. He also contributed to the fixed-size buffer management ofthe just-in-time compiler for Java-based thin clients. He joinedIBM in 1991 at the Tokyo Research Laboratory after receivingthe B.S. and M.S. degrees in computer science from the Univer-sity of Tokyo. His research interests include code optimizationand memory management.

Mikio Takeuchi IBM Research Division, Tokyo Research Labo-ratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502 Japan (electronic mail: [email protected]). Mr. Takeuchijoined IBM as a researcher at the Tokyo Research Laboratoryin 1993. He has been a member of the IBM Java Just-in-TimeCompiler project since 1996, where he has been working on theplatform-independent fast register manager and the code gen-erator for platforms based on Intel processors. For this work, hereceived a Division Award in 1998. His current research interestsare the implementation of dynamic languages (especially opti-mizing compilers and programming environments) and object-oriented technology (especially languages, tools, frameworks, anddesign patterns). He received the B.E. and M.E. degrees in math-ematical engineering and information physics from the Univer-sity of Tokyo in 1990 and 1993, respectively.

Toshiaki Yasue IBM Research Division, Tokyo Research Labo-ratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502 Japan (electronic mail: [email protected]). Mr. Yasue re-ceived the B.S. and M.S. degrees from the School of Science andEngineering, Waseda University, in Tokyo in 1989 and 1991, re-spectively. He joined IBM in 1995 and is currently a research mem-ber in the Network Computing Platform group at the Tokyo Re-search Laboratory. His primary research interests includecompiler optimization and parallel processing.

Motohiro Kawahito IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502, Japan (electronic mail: [email protected]). Mr. Kawa-hito received the B.S. degree from Waseda University in 1991.He joined IBM in 1991 as a member of the AIX Systems groupat the IBM Yamato Laboratory. He developed the PC simulator,5080 emulator, WABI, CMDS, and CATIA Viewer. He has beena research member at the Tokyo Research Laboratory since 1997,where he has been working on the exception check elimination,constant propagation, dead store elimination, and class variableprivatization for the Java JIT compiler project.

Kazuaki Ishizaki IBM Research Division, Tokyo Research Lab-oratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken, 242-8502, Japan (electronic mail: [email protected]). Mr. Ishizakireceived the B.S. and M.S. degrees, both in computer science fromWaseda University in 1990 and 1992, respectively. Since joiningIBM in 1992 at the Tokyo Research Laboratory, he has workedon the High Performance FORTRAN (HPF) compiler. He is cur-rently involved with the IBM Java Just-In-Time Compiler. Hisresearch interests include optimizing compiler and processor ar-chitectures.

Hideaki Komatsu IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502, Japan (electronic mail: [email protected]). Dr. Komatsureceived the B.S. and M.S. degrees in electrical engineering fromWaseda University in 1983 and 1985 and the Ph.D. in computerscience from Waseda University in 1998. Since joining IBM in1983 at the Tokyo Research Laboratory, he has carried out re-

search activity in the areas of the optimizing compiler, Prolog com-piler, data flow compiler, FORTRAN 90 compiler, High Perfor-mance FORTRAN compiler, and the IBM Java Just-in-TimeCompiler. His research interests include compiler optimizationtechniques (register allocation, code scheduling, and loop opti-mizations) for instruction-level parallel architecture and loop op-timizations for massively parallel computers. Dr. Komatsu is cur-rently a research staff member in the Network ComputingPlatform group.

Toshio Nakatani IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502, Japan (electronic mail: [email protected]). Dr. Naka-tani received the B.S. degree in mathematics from WasedaUniversity in 1975, and the M.S.E., M.A., and Ph.D. degrees incomputer science from Princeton University, in 1985, 1985, and1987, respectively. He joined the Tokyo Research Laboratory asa research staff member in 1987 and is currently manager of theNetwork Computing Platform group. His research interests in-clude architecture, compilers, and algorithms for parallel com-puter systems.

Reprint Order No. G321-5722.

IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000 SUGANUMA ET AL. 193