Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2007; 37:581–641Published online 24 October 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.776
An empirical study of Javabytecode programs
Christian Collberg, Ginger Myles∗,† and Michael Stepp
Department of Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A.
SUMMARY
We present a study of the static structure of real Java bytecode programs. A total of 1132 Java jar-fileswere collected from the Internet and analyzed. In addition to simple counts (number of methods per class,number of bytecode instructions per method, etc.), structural metrics such as the complexity of control-flow and inheritance graphs were computed. We believe this study will be valuable in the design of futureprogramming languages and virtual machine instruction sets, as well as in the efficient implementation ofcompilers and other language processors. Copyright c© 2006 John Wiley & Sons, Ltd.
Received 18 August 2004; Revised 28 June 2006; Accepted 28 June 2006
KEY WORDS: Java; bytecode; measure; software complexity metrics
1. INTRODUCTION
In a much cited study [1], Knuth examined FORTRAN programs collected from printouts found in a
computing center. Among other things, he found that arithmetic expressions tend to be small, which,
he argued, has consequences for code-generation and optimization algorithms chosen in a compiler.
Similar studies have been carried out for COBOL [2,3], Pascal [4], and APL [5,6] source code.
In this paper we report on a study on the static structure of real Java bytecode programs.
Using information gathered from an automated Google search, we collected a sample of 1132 Java
programs, in the form of jar-files (collections of Java class files). The static structure of these programs
was analyzed automatically using SandMark, a tool which, among other things, performs static
analysis of Java bytecode.
It is our hope that the information gathered and presented here will be of use in a variety of settings.
For example, information about the structure of real programs in one language can be used to design
∗Correspondence to: Ginger Myles, Department of Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A.†E-mail: [email protected]
Contract/grant sponsor: National Science Foundation; contract/grant number: 324360
Contract/grant sponsor: Air Force Research Lab; contract/grant number: F33615-02-C-1146
Copyright c© 2006 John Wiley & Sons, Ltd.
582 C. COLLBERG, G. MYLES AND M. STEPP
future languages in the same family. One example is the finally clause of Java exception handlers.
Special instructions (jsr and ret) were added to Java bytecode to handle this construct efficiently.
These instructions turn out to be a major source of complexity for the Java verifier [7]. If, instead,
the Java bytecode designers had known (from a study of MODULA-3 programs, for example) that the
finally clause is very unusual in real programs, they may have elected to keep jsr/ret out of the
instruction set. This would have simplified the Java bytecode verifier while imposing little overhead on
typical programs‡.
There are many types of tools that operate on programs. Compilers are an obvious example, but there
are many software engineering tools which transform programs in order to improve on their structure,
readability, modifiability, etc. Such language processors can benefit from knowing typical and extreme
counts of various aspects of real programs. For example, in our study we have found that while, on
average, a Java class file has 9.0 methods, in the extreme case we found a class with 570 methods.
This information can be used to select appropriate data structures, algorithms, and memory allocation
strategies.
As a further example, we have found that the average Java class has no more than one method that
overrides a method of its superclass. This means that most methods are written ‘from scratch’, and will
not be present in any of the superclasses of that class. Furthermore, it means that most methods written
in a given class are unlikely to be overridden in its subclasses. Thus, aggressive inlining appears to be
a good candidate for optimization. The usual obstacle to inlining in Java is virtual method invocation,
where a single method callsite could have many potential targets. However, given these data, we can see
that often this will not be a problem, because methods are rarely overridden. Combined with the fact
that the average method has 33.2 instructions and is thus quite small, we see that aggressive inlining is
an excellent candidate for optimization. We hope that this study will be useful in providing many other
insights that will facilitate the design of tools to study and improve programs.
Our own research is focused on the protection of software from piracy, tampering, and reverse
engineering, using code obfuscation and software watermarking [8]. Code obfuscation attempts to
introduce confusion in a program to slow down an adversary’s attempts at reverse engineering it.
Software watermarking inserts a copyright notice or customer identification number into a program
to allow the owner to assert their intellectual property rights. An important aspect of these techniques
is stealth. For example, a software watermarking algorithm should not embed a mark by inserting code
that is highly unusual, since that would make it easy to locate and remove. Our study of instruction
frequencies and instruction n-grams directly addresses this concern, by showing us exactly which
instruction sequences are common and which are not. For example, any code that contains the JSR Wor GOTO W instructions would be extremely unstealthy, since not a single one of our 1132 test jars
contain either of these instructions. We believe the information presented in this paper will be useful
in developing future evaluation models for the stealth of software protection algorithms.
The remainder of this paper is structured as follows. In Section 2 we describe how our statistics were
gathered. In Section 3 we give a brief overview of Java bytecode. In Sections 4, 5, 6, and 7, we present
application-level, class-level, method-level, and instruction-level statistics, respectively. In Section 8
we discuss related work, and in Section 9 we summarize our findings.
‡The finally clause can be implemented by copying code.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 583
Table I. Collected jar-file statistics.
Measure Count
Total number of jar-files 1132Total size of jar-files (bytes) 198 945 317Total number of class files 102 688Total number of packages 7682Total number of classes 90 500Total number of interfaces 12 188Total number of constant pool entries 12 538 316Total number of methods 874 115Total number of fields 422 491Total number of instructions 26 597 868
2. EXPERIMENTAL METHODOLOGY
Table I shows some statistics of the applications that were gathered. Figure 1 shows an overview of
how our statistics were collected.
To obtain a suitably random set of sample data, we queried the Google search engine using the key-
phrase ‘"index of" jar’. This query was designed to find Web pages that display server directory
listings that contain files with the extension .jar. In the resulting HTML pages we searched for any
<A> tag whose HREF attribute designated a jar-file. These files were then downloaded.
The initial collection of jar-files numbered in excess of 2000. An initial analysis discarded any files
that contained no Java classes, or were structurally invalid. Static statistics were next gathered using
the SandMark tool.
SandMark [9] is a tool developed to aid in the study of software-based software protection
techniques. The tool is implemented in Java and operates on Java bytecode. Included in SandMark
are algorithms for code obfuscation, software watermarking, and software birthmarking. A variety of
static analysis techniques are included to aid in the development of new algorithms and as a means to
study the effectiveness of these algorithms. Examples of such techniques are: class hierarchy, control-
flow, and call graphs; def-use and liveness analysis; stack simulation; forward and backward slicing;
various bytecode diffing algorithms; a bytecode viewer; and a variety of software complexity metrics.
Not all well-formed jar-files could be completely analyzed. In most cases this was because the jar-file
was not self-contained, i.e. it referenced classes that were not in the jar or in the Java standard library.
Missing class files prevent the class hierarchy from being constructed, for example. In these cases
we still computed as many statistics as possible. For example, while an incomplete class hierarchy
prevented us from gathering accurate statistics of class inheritance depth, it still allowed us to gather
control-flow graph (CFG) statistics. Our SandMark tool is also not perfect. In particular, it is known to
build erroneous CFGs for methods with complex subroutine structures (combinations of the jsr and
ret instructions used for Java’s finally clause). There are few such CFGs in our sample set, so this
problem is unlikely to adversely affect our data.
Owing to our random sampling of jar-files from the Internet, the collection is somewhat
idiosyncratic. We assume that any two jar-files with the same name are in fact the same program, and
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
584 C. COLLBERG, G. MYLES AND M. STEPP
"index of" jar
SCRIPT
classesPERpackage.jgrmethodsPERclass.jgrbasicblocksPERmethod.jgr
mosaic.jar
OligoWiz 1.0.4.jar mysql.jar
OligoWiz 1.0.2.jarOligoWiz 1.0.5.jar
OligoWiz 1.0.1.jar
OligoWiz 1.0.3.jar
⇓
⇒
⇓
⇒
⇐
Figure 1. Overview of how our statistics were gathered.
keep only one. However, we kept those files whose names indicated that they were different versions
of the same program, as shown by the OligoWiz files in Figure 1. Most likely, these files are very
similar and may contain methods that are identical between versions. It is reasonable to assume that
such redundancy will have somewhat skewed our results. An alternative strategy might have been to
guess (based on the file name) which files are versions of the same program, and keep only the higher-
numbered file. A less random sampling of programs could also have been collected from well-known
repositories of Java code, such as sourceforge.net.
Giving an informative presentation of this type of data turns out to be difficult. In many applications
we will only be interested in typical values (such as mode or mean) or extreme values (such as min
and max). Such values can easily be presented in tabular form. However, we would also like to be able
to quickly get a general ‘feel’ for the behavior of the data, and this is best presented in a visual form.
The visualization is complicated by the fact that most of our data have sharp ‘spikes’ and long ‘tails’.
That is, one or a few (typically small) values are very common, but there are a small number of large
outliers which by themselves are also interesting. This can be seen, for example, in Figure 30(b) below,
which shows that out of the 801 117 methods in our data, 99% have fewer than two subroutines but one
method has 29 subroutines.
We have chosen to visualize much of our data using binned bar graphs where extremely tall bars are
truncated to allow small values to be visualized. For example, consider the graph in Figure 2 which
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 585
0
5000
10000
15000 MIN: 4MAX: 26708AVG: 122.1
MODE: 24MEDIAN: 65STD DEV: 215
SAMPLES: 102688TOTAL: 12538316
0%
,1
1
4
0%
,2
86
6
0%
,6
1
7
1%
,1
14
7
8
1%
,3
75
9
11
%,1
02
29
10-19
22
%,1
10
57
20-29
31
%,9
12
8
30-39
38
%,7
49
1
40-49
44
%,6
02
0
50-59
49
%,5
14
7
60-69
54
%,4
57
0
70-79
58
%,4
09
0
80-89
61
%,3
98
5
90-998
3%
,2
24
13
100-1999
1%
,8
05
4200-299
95
%,3
70
4300-399
97
%,1
87
9
400-499
98
%,1
03
5
500-599
98
%,6
76
600-699
99
%,3
57
700-799
99
%,2
25
800-899
99
%,1
64
900-999
99
%,4
96
1000-1999
99
%,7
4
2000-2999
99
%,6
3000-3999
99
%,3
4000-4999
99
%,2
5000-5999
10
0%
,3
20000-29999
Cumulativepercentage
Striped barshave beentruncated
Values havebeen binned
Value
Figure 2. Illustration and explanation of the bar graphs used throughout the paper for data visualization.
shows the number of constants in the constant pool of the Java applications we studied. Most of our
graphs will have the same structure. Along the x-axis we show the bins into which our data have been
classified. On top of each bar the actual count and cumulative percentage are shown. Very tall bars are
truncated and shown striped. In a separate table to the top right of the graph we show the total number
of data points, the minimum, maximum, and average x-values, the mode§, the median (the middle
value), and the standard deviation. The SAMPLES value is the total number of items inspected for the
given statistic, and the TOTAL value is the total number of sub-items counted. For example, in the
above graph, the SAMPLES value will be the number of classes analyzed and the TOTAL value will
be the sum of all of the constant pool entries over all of the classes analyzed. The TOTAL value is
only included where appropriate. The FAILED value gives the number of unsuccessful measurements,
when appropriate.
3. THE STRUCTURE OF JAVA BYTECODE PROGRAMS
A Java application consists of a collection of classes and interfaces. Each class or interface is compiled
into a class file. A program consists of a number of class files which are collected together into a jar-file.
§The mode is the most frequently occurring value. This is often—but because of binning not always—the tallest bar of the graph.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
586 C. COLLBERG, G. MYLES AND M. STEPP
String: "HELLO"Method: "C.M(int)"Field: "int F"
.....
Constant Pool
Flags Name Type Attributes
[pub] "x" int
[priv] "y" C
Fields
ExceptionTable
Code[]= push,add,store...
[]= ...
MaxStack=5, MaxLocals=8
Attributes
Magic Number
Constant Pool
Access Flags
This Class
Super Class
Interfaces
Fields
Methods
Attributes
[pub]
Flags Name AttributesSig
"q" ()int
Methods
Code
Exceptions
Code
Figure 3. A view of the Java class file format.
A jar-file is directly executable by a Java virtual machine interpreter. The Java class file stores all
necessary data regarding the class. There is a symbol table (called the Constant Pool) which stores
strings, large literal integers and floats, and names and types of all fields and methods. Each method is
compiled to Java bytecode, a stack-based virtual machine instruction set. Figure 3 shows the structure
of the Java class file format. The JVM is defined by Lindholm and Yellin [10].
The Java bytecodes can manipulate data in several formats: integers (32 bits), longs (64 bits), floats
(32 bits), doubles (64 bits), shorts (16 bits), bytes (8 bits), Booleans (1 bit), chars (16 bit Unicode),
object references (32 bit pointers), and arrays. The Boolean, byte, char, and short types are compiled
down into integers.
Bytecode instructions have variable widths. Simple instructions such as iadd (integer addition) are
one byte wide, while some instructions (such as tableswitch) can be multiple bytes. Each method
can have up to 65 536 local variables and formal parameters, called slots. The bytecodes reference
slots by number. For example, the instruction ’iload 3’ pushes the third local variable onto the
stack. In order to access high-numbered slots, a special wide instruction can be used to modify load
and store instructions to use 16-bit indexes. The Java execution stack is 32 bits wide. Longs and doubles
take up two stack entries and two slot numbers.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 587
Table II. Notation used to refer to data values in the bytecode.
Notation Explanation
B An 8-bit integer valueS A 16-bit integer valueL A 32-bit integer valueCb An 8-bit constant pool indexCs A 16-bit constant pool indexFb An 8-bit local variable indexFs A 16-bit local variable indexC[i] The ith constant pool entryV[i] The ith variable/formal parameter in the current method
Local variable slots are untyped. In fact, a particular slot can hold different types of data at different
locations in a method. However, regardless of how execution reaches a given location in the method,
the type of data stored in a particular slot at that location will always be the same. A static analysis
known as a stack simulation can compute slot types without executing a method.
Some bytecodes reference data from the class’ constant pool, for example to push large constants
or to invoke methods. Constant pool references are 8 or 16 bits long. To push a reference to a literal
string with constant pool number 4567, the compiler would issue the instruction ’ldc w 4567’.
If the constant pool number instead fits into a byte (such as 123), the shorter instruction ’ldc 123’would suffice.
Some information is stored in attributes in the class file. This includes exception table ranges, and
(for debugging) line-number ranges and local variable names.
Table II explains the notation used in Tables III–VI, which give an overview of the JVM instruction
set.
4. PROGRAM-LEVEL STATISTICS
In this and the following three sections we will present the data collected about applications
(this section), classes (Section 5), methods (Section 6), and instructions (Section 7).
Figures 4–7 visualize application-level data about the programs we gathered.
4.1. Packages
Classes in Java are optionally organized into a hierarchy of packages. For example, Java’s Stringclass is in the package java.lang, and can be referred to as java.lang.String. As can be seen
from Figure 4(a), many Java programs put all classes into the same package. In fact, half of the 1132
applications we gathered have three or fewer packages, and only four have 50 or more.
A package α is counted if there exists some class β such that the fully qualified classname of β is
α.β. Thus, if an application has classes java.pack1.Class1 and java.pack2.Class2 then java.pack1 and
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
588 C. COLLBERG, G. MYLES AND M. STEPP
Table III. The first 87 Java bytecode instructions.
Opcode Mnemonic Args Stack Description
0 nop [] ⇒ []1 aconst null [] ⇒ [null] Push null object.2 iconst m1 [] ⇒ [−1] Push −1.3 . . . 8 iconst n [] ⇒ [n] Push integer constant n, 0 ≤ n ≤ 5.9 . . . 10 lconst n [] ⇒ [n] Push long constant n, 0 ≤ n ≤ 1.11 . . . 13 fconst n [] ⇒ [n] Push float constant n, 0 ≤ n ≤ 2.14 . . . 15 dconst n [] ⇒ [n] Push double constant n, 0 ≤ n ≤ 1.
16 bipush n:B [] ⇒ [n] Push 1-byte signed integer.17 sipush n:S [] ⇒ [n] Push 2-byte signed integer.18 ldc n:Cb [] ⇒ [C[n]] Push item from constant pool.19 ldc w n:Cs [] ⇒ [C[n]] Push item from constant pool.20 ldc2 w n:Cs [] ⇒ [C[n]] Push long/double from constant pool.
21 . . . 25 Xload n:Fb [] ⇒ [V [n]] X ∈{i,l,f,d,a}, Load int, long, float, double, objectfrom local var.
26 . . . 29 iload n [] ⇒ [V [n]] Load local integer var n, 0 ≤ n ≤ 3.30 . . . 33 lload n [] ⇒ [V [n]] Load local long var n, 0 ≤ n ≤ 3.34 . . . 37 fload n [] ⇒ [V [n]] Load local float var n, 0 ≤ n ≤ 3.38 . . . 41 dload n [] ⇒ [V [n]] Load local double var n, 0 ≤ n ≤ 3.42 . . . 45 aload n [] ⇒ [V [n]] Load local object var n, 0 ≤ n ≤ 3.46 . . . 53 Xload [A, I ] ⇒ [V ] X ∈{ia,la,fa,da,aa,ba,ca,sa}. Push the value V (an
int, long, etc.) stored at index I of array A.54 . . . 58 Xstore n:Fb [V [n]] ⇒ [] X ∈{i,l,f,d,a}, Store int, long, float, double, object
to local var.59 . . . 62 istore n [V [n]] ⇒ [] Store to local integer var n, 0 ≤ n ≤ 3.63 . . . 66 lstore n [V [n]] ⇒ [] Store to local long var n, 0 ≤ n ≤ 3.67 . . . 70 fstore n [V [n]] ⇒ [] Store to local float var n, 0 ≤ n ≤ 3.71 . . . 74 dstore n [V [n]] ⇒ [] Store to local double var n, 0 ≤ n ≤ 3.75 . . . 78 astore n [V [n]] ⇒ [] Store to local object var n, 0 ≤ n ≤ 3.79 . . . 86 Xstore [A, I, V ] ⇒ [] X ∈{ia,la,fa,da,aa,ba,ca,sa}. Store the value V (an
int, long, etc.) at index I of array A.
java.pack2 would be counted, but java would not. Also, the default or ‘null’ package is counted exactly
once, if there is a class in that package.
Figure 5(a) shows that while a small number of programs have packages with hundreds of classes,
the typical package will have only one, and the average is about 11.8.
Packages can be nested inside of other packages, allowing for the easy creation of unique names.
While it is possible to create a package hierarchy of arbitrary depth, Figure 4(b) shows that the
maximum depth for an application is 8, with an average depth of 3.9.
A Java interface is a special type of class that only contains constant declarations or method
signatures. A class which implements an interface must provide implementations of the methods.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 589
Table IV. Java bytecode instructions 87 to 169.
Opcode Mnemonic Args Stack Description
87 pop [A] ⇒ [] Pop top of stack.
88 pop2 [A, B] ⇒ [] Pop 2 elements.
89 dup [V ] ⇒ [V, V ] Duplicate top of stack.
90 dup x1 [B, V ] ⇒ [V, B, V ]
91 dup x2 [B, C, V ] ⇒ [V, B, C, V ]
92 dup2 [V, W ] ⇒ [V, W, V, W ]
93 dup2 x1 [A, V, W ] ⇒ [V, W, A, V, W ]
94 dup2 x2 [A, B, V, W ] ⇒ [V, W, A, B, V , W ]
95 swap [A, B] ⇒ [B, A] Swap top stack elements.
96 . . . 99 Xadd [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A + B .
100 . . . 103 Xsub [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A − B .
104 . . . 107 Xmul [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A ∗ B .
108 . . . 111 Xdiv [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A/B .
112 . . . 115 Xrem [A, B] ⇒ [R] X ∈{i,l,d,f}. R = A%B .
116 . . . 119 Xneg [A] ⇒ [R] X ∈{i,l,d,f}. R = −A.
120 . . . 121 Xshl [A, B] ⇒ [R] X ∈{i,l}. R = A << B .
122 . . . 123 Xshr [A, B] ⇒ [R] X ∈{i,l}. R = A >> B .
124 . . . 125 Xushr [A, B] ⇒ [R] X ∈{i,l}. R = A >>> B .
126 . . . 127 Xand [A, B] ⇒ [R] X ∈{i,l}. R = A&B .
128 . . . 129 Xor [A, B] ⇒ [R] X ∈{i,l}. R = A|B .
130 . . . 131 Xxor [A, B] ⇒ [R] X ∈{i,l}. R = AxorB .
132 iinc V :Fb ,B:B [] ⇒ [] V + = B .
133 . . . 144 X2Y [F ] ⇒ [T ] Convert F from type X to T of type Y . X ∈{i,l,f,d}, Y ∈{i,l,f,d}.145 . . . 147 i2X [F ] ⇒ [T ] X ∈{b,c,s}. Convert integer F to byte, char, or short.
148 lcmp [A, B] ⇒ [V ] Compare long values. A > B ⇒ V = 1, A < B ⇒ V = −1, A =B ⇒ V = 0.
149,151 Xcmpl [A, B] ⇒ [V ] Compare float or double values. X ∈{f,d}. A > B ⇒ V = 1, A <
B ⇒ V = −1, A = B ⇒ V = 0. A = NaN ∨ B = NaN ⇒ V =−1
150,152 Xcmpg [A, B] ⇒ [V ] Compare float or double values. X ∈{f,d}. A > B ⇒ V = 1, A <
B ⇒ V = −1, A = B ⇒ V = 0. A = NaN ∨ B = NaN ⇒ V = 1
153 . . . 158 if� L:S [A] ⇒ [] �={eq,ne,lt,ge,gt,le}. If A � 0 goto L + pc.
159 . . . 164 if icmp� L:S [A, B] ⇒ [] �={eq,ne,lt,ge,gt,le}. If A � B goto L + pc.
165 . . . 166 if acmp� L:S [A, B] ⇒ [] �={eq,ne}. A, B are object refs. If A � B goto L + pc.
167 goto I :S [] ⇒ [] Jump to I + pc.
168 jsr I :S [] ⇒ [A] Jump subroutine to instruction I + pc. A = the address of the
instruction after the jsr.
169 ret L:Fb [] ⇒ [] Return from subroutine. Address in local var L.
Interfaces are often used to compensate for Java’s lack of multiple inheritance. Figure 5(b) shows
that over 70% of Java packages contain 0 or 1 interface.
4.2. Protection
A Java class can be declared as abstract (it serves only as a superclass to classes which actually
implements its methods) or final (it cannot be extended). These declarations are, however, fairly
unusual. Figures 6(a) and 6(b) show that over 70% of all packages contain no abstract or final classes.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
590 C. COLLBERG, G. MYLES AND M. STEPP
Table V. Java bytecode instructions 170 to 195.
Opcode Mnemonic Args Stack
170 tableswitch D:L,l, h:L,oh−l+1 [K] ⇒ []JuLongDescrmp through the K :th offset. Else goto D.
171 lookupswitch D:L,n:L,(m, o)n [K] ⇒ []If, for one of the (m, o) pairs, K = m, then goto o. Else goto D.
172 . . . 176 Xreturn [V ] ⇒ []X ∈{i,f,l,d,a}. Return V .
177 return [] ⇒ []Return from void method.
178 getstatic F :Cs [] ⇒ [V ]Push value V of static field F .
179 putstatic F :Cs [V ] ⇒ []Store value V into static field F .
180 getfield F :Cs [R] ⇒ [V ]Push value V of field F in object R.
181 putfield F :Cs [R, V ] ⇒ []Store value V into field F of object R.
182 invokevirtual P :Cs [R, A1, A2, . . .] ⇒ []Call virtual method P , with arguments A1 · · · An, through object reference R.
183 invokespecial P :Cs [R, A1, A2, . . .] ⇒ []Call private/init/superclass method P , with arguments A1 · · · An, through object reference R.
184 invokestatic P :Cs [A1, A2, . . .] ⇒ []Call static method P with arguments A1 · · · An.
185 invokeinterface P :Cs ,n:S [R, A1, A2, . . .] ⇒ []Call interface method P , with n arguments A1 · · · An, through object reference R.
187 new T :Cs [] ⇒ [R]Create a new object R of type T .
188 newarray T :B [C] ⇒ [R]Allocate new array R, element type T , C elements long.
189 anewarray T :Cs [C] ⇒ [A]Allocate new array A of reference types, element type T , C elements long.
190 arraylength [A] ⇒ [L]Determines the length L of array A.
191 athrow [R] ⇒ [?]Throw exception.
192 checkcast C:Cs [R] ⇒ [R]Ensures that R is of type C.
193 instanceof C:Cs [R] ⇒ [V ]Push 1 if object R is an instance of class C, else push 0.
194 monitorenter [R] ⇒ []Get lock for object R.
195 monitorexit [R] ⇒ []Release lock for object R.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 591
Table VI. Java bytecode instructions 196 to 201.
Opcode Mnemonic Args Stack
196 wide C:B,I :Fs [] ⇒ []Perform opcode C on variable V[I]. C is one of the load/store instructions.
197 multianewarray T :Cs ,D:Cb [d1, d2, . . .] ⇒ [R]Create new D-dimensional multidimensional array R. d1, d2, . . . are the dimension sizes.
198 ifnull L:S [V ] ⇒ []If V = null goto L.
199 ifnonnull L:S [V ] ⇒ []If V �= null goto L.
200 goto w I :L [] ⇒ []Goto instruction I .
201 jsr w I :L [] ⇒ [A]Jump subroutine to instruction I . A is the address of the instruction right after the jsr w.
0
100
200
300
400 MIN: 1MAX: 74AVG: 6.8
MODE: 1MEDIAN: 2STD DEV: 9
SAMPLES: 1132TOTAL: 7682
36
%,
40
8
1
47
%,
12
6
2
53
%,
66
3
59
%,
73
4
65
%,
63
5
69
%,
53
6
73
%,
42
7
75
%,
23
8
78
%,
39
9
89
%,
12
4
10-19
96
%,
76
20-29
98
%,
21
30-39
99
%,
14
40-49
99
%,
2
50-59
99
%,
1
60-69
10
0%
,1
70-79
(a)
0
50
100
150
200
250MIN: 1
MAX: 8AVG: 3.9
MODE: 5MEDIAN: 3STD DEV: 1
SAMPLES: 1132
15
%,
17
2
1
18
%,
40
2
37
%,
21
3
3
59
%,
24
6
4
83
%,
27
2
5
97
%,
15
6
6
99
%,
26
7
10
0%
,7
8
(b)
Figure 4. Program-level statistics: (a) number of packages per application; (b) package depth per application.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
592 C. COLLBERG, G. MYLES AND M. STEPP
0
500
1000
MIN: 0MAX: 310AVG: 11.8
MODE: 1MEDIAN: 5STD DEV: 21
SAMPLES: 7682TOTAL: 90500
3%
,267
0
17%
,1077
1
27%
,778
2
36%
,682
3
45%
,692
4
52%
,502
5
57%
,432
6
61%
,306
7
64%
,248
8
67%
,217
9
84%
,1256
10-19
91%
,558
20-29
94%
,232
30-39
96%
,146
40-49
97%
,78
50-5997%
,43
60-6998%
,40
70-79
98%
,28
80-89
98%
,22
90-99
99%
,52
100-199
99%
,24
200-299
100%
,2
300-399
(a)
0
500
1000
1500
2000 MIN: 0MAX: 160AVG: 1.6
MODE: 0MEDIAN: 0STD DEV: 4
SAMPLES: 7682TOTAL: 12188
57%
,4449
0
75%
,1327
1
83%
,619
2
88%
,395
3
91%
,245
4
93%
,149
5
94%
,98
6
95%
,71
7
96%
,55
8
96%
,38
9
99%
,177
10-19
99%
,24
20-29
99%
,11
30-39
99%
,6
40-49
99%
,13
50-59
99%
,3
60-69
100%
,2
100-199
(b)
Figure 5. Program-level statistics: (a) number of classes per package; (b) number of interfaces per package.
4.3. Inheritance graphs
In addition to a class implementing an interface, a class can also extend another class. In this case
the subclass inherits all of the variables and methods of the class which it extends (the superclass),
thus creating an inheritance relationship. An inheritance graph can be constructed to represent the
superclass/subclass relationship. An inheritance graph is a rooted, directed, acyclic graph where the
nodes are classes and interfaces. There is an edge from node A to node B iff node B directly extends
or implements node A. Thus, classes will have edges to their direct subclasses, and interfaces will
have edges to the classes that implement them and to the interfaces that extend them. In order to make
this a rooted graph, we assert that all interfaces extend the class java.lang.Object (and, hence,
java.lang.Object becomes the root). Using this definition, we can see that node B is reachable
from node A iff it is possible to assign a value of type B to a variable of type A. The inheritance graph
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 593
0
100
200
300
400
500 MIN: 0MAX: 37AVG: 0.6
MODE: 0MEDIAN: 0STD DEV: 1
SAMPLES: 7682TOTAL: 4768
73%
,5679
0
87%
,1006
1
92%
,428
2
95%
,185
3
96
%,
14
2
4
97%
,83
5
98%
,51
6
99%
,47
7
99%
,13
8
99%
,21
9
99%
,22
10-1999%
,4
20-29
100%
,1
30-39
(a)
0
200
400
600
MIN: 0MAX: 270AVG: 1.3
MODE: 0MEDIAN: 0STD DEV: 6
SAMPLES: 7682TOTAL: 10299
74%
,5736
0
83%
,654
1
87%
,347
2
90%
,252
3
92
%,
14
9
4
94%
,1
04
5
95%
,77
6
96%
,65
7
96%
,35
8
97
%,
39
9
98%
,135
10-19
99%
,46
20-29
99%
,15
30-39
99%
,12
40-49
99%
,6
50-59
99
%,
3
60-69
99
%,
1
70-79
99
%,
4
100-199
100%
,2
200-299
(b)
Figure 6. Program-level statistics: (a) number of abstract classes per package; (b) number of final classes package.
height for a given application is the maximum number of superclasses that any class in the application
has. This will include some but not all of the Java library classes.
Figure 7(a) shows that on average the height of an inheritance graph for an application is 4.5 and
that over 90% of all applications have an inheritance graph with a height of less than 7.
It is important to note that for 303 of our applications, the inheritance graph construction failed due
to the jar-file not being self-contained. This means that some class in the application had a superclass
that was unavailable for analysis, therefore we could not place it correctly in the inheritance graph.
This is typical of applications that rely on external libraries which are not packaged with them.
Figure 7(b) shows the number of classes in each application that extend other classes in the same
application, as opposed to Java library classes. Some classes in each application must directly extend
a Java library class (most often java.lang.Object), but it is interesting to note that only about
one-third (32 899/90 500) extend other classes inside the same application. Table VII shows that most
of the classes in each application extend java.lang.Object directly.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
594 C. COLLBERG, G. MYLES AND M. STEPP
0
50
100
150
200 MIN: 1MAX: 10AVG: 4.5
MODE: 5MEDIAN: 4STD DEV: 1
SAMPLES: 1132FAILED: 303
10
%,8
6
1
14
%,3
5
2
24
%,8
6
3
45
%,1
71
4
69
%,2
00
5
92
%,1
86
6
97
%,4
7
7
99
%,1
5
89
9%
,2
9
10
0%
,1
10-19
(a)
0
50
100
150
200 MIN: 0MAX: 641AVG: 29.1
MODE: 0MEDIAN: 5STD DEV: 67
SAMPLES: 1132TOTAL: 32899
33
%,3
78
0
37
%,4
6
1
43
%,6
4
2
45
%,2
3
3
48
%,4
0
4
52
%,3
9
5
55
%,3
4
6
56
%,1
5
7
58
%,1
9
8
59
%,2
1
9
72
%,1
40
10-19
78
%,6
7
20-29
84
%,7
2
30-39
86
%,1
6
40-49
88
%,2
3
50-59
89
%,1
4
60-69
89
%,6
70-79
90
%,1
0
80-89
91
%,1
4
90-99
95
%,4
4
100-199
97
%,2
2
200-299
99
%,2
1
300-399
99
%,1
400-499
99
%,2
500-599
10
0%
,1
600-699
(b)
Figure 7. Inheritance graphs: (a) height per application; (b) number of user-class extenders per application.
5. CLASS-LEVEL STATISTICS
In this section we present data regarding the top-level structure of class files. This includes the number,
type/signature, and protection of fields and methods, and the class’ or interface’s position in the
application’s inheritance graph.
5.1. Fields
A Java class can contain data members, called fields. Fields are either class variables (they are declared
static and only one instance exists at runtime) or instance variables (every instantiation of the class
contains a unique copy).
Figures 8–10 show field statistics. In Figure 8(a) we see that 60% of all classes have two or fewer
fields, but in one extreme case a class declared almost a thousand fields. Instance variables are more
common than class variables. On average, a class will contain 2.8 instance variables and 1.6 class
variables, and 44% of all classes have more than one instance variable but only 17% have more than
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 595
Table VII. Most common standard classes to be extended by application classes.
Class Count %
java.lang.Object 42 629 47.1user class 34 805 38.5java.lang.Exception 1089 1.2javax.swing.AbstractAction 893 1.0java.lang.Thread 738 0.8javax.swing.JPanel 691 0.8java.lang.RuntimeException 464 0.5java.awt.event.WindowAdapter 341 0.4java.awt.Panel 313 0.3java.awt.event.MouseAdapter 309 0.3java.util.ListResourceBundle 276 0.3java.util.EventObject 248 0.3java.io.FilterInputStream 232 0.3org.omg.CORBA.portable.ObjectImpl 226 0.2org.omg.CORBA.SystemException 217 0.2org.xml.sax.helpers.DefaultHandler 203 0.2java.awt.Dialog 203 0.2java.io.FilterOutputStream 202 0.2java.applet.Applet 202 0.2java.awt.Canvas 197 0.2java.io.OutputStream 196 0.2java.awt.Frame 194 0.2java.io.IOException 192 0.2java.io.InputStream 183 0.2javax.swing.JFrame 149 0.2javax.swing.JDialog 135 0.1org.omg.CORBA.UserException 126 0.1java.lang.Error 120 0.1java.beans.SimpleBeanInfo 119 0.1java.awt.event.KeyAdapter 118 0.1javax.swing.table.AbstractTableModel 104 0.1java.awt.event.FocusAdapter 101 0.1java.util.AbstractSet 94 0.1java.security.Signature 80 0.1javax.swing.beaninfo.SwingBeanInfo 79 0.1java.security.GeneralSecurityException 78 0.1org.xml.sax.SAXException 70 0.1javax.swing.JComponent 70 0.1javax.swing.event.InternalFrameAdapter 60 0.1java.util.Hashtable 57 0.1java.lang.IllegalArgumentException 56 0.1java.io.Writer 54 0.1java.util.AbstractList 51 0.1java.util.Properties 50 0.1java.io.Reader 49 0.1
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
596 C. COLLBERG, G. MYLES AND M. STEPP
0
2000
4000
6000
8000 MIN: 0MAX: 968AVG: 4.4
MODE: 1MEDIAN: 2STD DEV: 11
SAMPLES: 90500TOTAL: 396816
23%
,21153
0
47%
,22033
1
60%
,11968
2
69%
,7669
3
75%
,5217
4
79%
,4143
5
82%
,2919
6
85%
,2411
7
87%
,1911
8
89%
,1535
9
96%
,6216
10-19
98%
,1683
20-29
98%
,682
30-39
99%
,358
40-49
99%
,127
50-59
99%
,97
60-69
99%
,94
70-7999%
,57
80-8999%
,40
90-99
99%
,139
100-199
99%
,23
200-299
99%
,18
300-399
99%
,5
400-499
99%
,1
500-599
100%
,1
900-999
(a)
0
5000
10000
15000 MIN: 0MAX: 742AVG: 2.8
MODE: 0MEDIAN: 1STD DEV: 6
SAMPLES: 90500TOTAL: 248930
34%
,31039
0
56%
,20177
1
69%
,11632
2
77%
,7128
3
82%
,4861
4
86%
,3533
5
89%
,2457
6
91%
,2042
7
93%
,1371
8
94%
,1238
9
98%
,3606
10-19
99%
,829
20-29
99%
,298
30-39
99%
,120
40-49
99%
,40
50-59
99%
,40
60-69
99%
,27
70-79
99%
,14
80-89
99%
,8
90-99
99%
,38
100-199
99%
,1
200-299
100%
,1
700-799
(b)
0
1000
2000
3000
4000
5000 MIN: 0MAX: 555AVG: 1.6
MODE: 0MEDIAN: 0STD DEV: 9
SAMPLES: 90500TOTAL: 147886
71%
,64397
0
83%
,10822
1
88%
,4442
2
90%
,2456
3
92%
,1517
4
93%
,1092
5
94%
,812
6
95%
,748
7
95%
,544
8
96%
,490
9
98%
,1940
10-19
99%
,596
20-29
99%
,200
30-39
99%
,111
40-49
99%
,76
50-59
99%
,63
60-69
99%
,35
70-79
99%
,20
80-89
99%
,13
90-99
99%
,80
100-199
99%
,22
200-299
99%
,18
300-399
99%
,5
400-499
100%
,1
500-599
(c)
Figure 8. Field declarations in classes: (a) number of fields per class; (b) number of instance variables per class;(c) number of class (static) variables per class.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 597
0
2000
4000
6000 MIN: 0MAX: 553AVG: 1.5
MODE: 0MEDIAN: 0STD DEV: 7
SAMPLES: 102688TOTAL: 152468
70%
,72849
0
81%
,11355
1
87%
,5579
2
90%
,2817
3
92%
,2275
4
93%
,1499
5
94%
,1055
6
95%
,895
7
96%
,701
8
96%
,487
9
98%
,1956
10-19
99%
,551
20-29
99%
,240
30-39
99%
,171
40-49
99%
,46
50-5999%
,42
60-6999%
,35
70-79
99%
,9
80-89
99%
,31
90-99
99%
,67
100-199
99%
,20
200-299
99%
,7
300-399
100%
,1
500-599
(a)
0
5000
10000
15000 MIN: 0MAX: 968AVG: 2.6
MODE: 0MEDIAN: 1STD DEV: 8
SAMPLES: 102688TOTAL: 270023
35%
,35966
0
60%
,26492
1
73%
,12769
2
81%
,8079
3
85%
,4618
4
88%
,3354
5
91%
,2314
6
92%
,1795
7
94%
,1261
8
95%
,975
9
98%
,3527
10-19
99%
,900
20-29
99%
,238
30-39
99%
,104
40-49
99%
,67
50-59
99%
,40
60-69
99%
,43
70-79
99%
,22
80-89
99%
,8
90-99
99%
,85
100-199
99%
,13
200-299
99%
,10
300-399
99%
,6
400-499
99%
,1
500-599
100%
,1
900-999
(b)
0
1000
2000
3000
4000
5000 MIN: 0MAX: 967AVG: 1.4
MODE: 0MEDIAN: 0STD DEV: 8
SAMPLES: 90500TOTAL: 126861
65%
,59481
0
84%
,17059
1
89%
,4517
2
92%
,2548
3
93%
,1390
4
94%
,914
5
95%
,660
6
96%
,617
7
96%
,480
8
97%
,374
9
98%
,1535
10-19
99%
,420
20-29
99%
,163
30-39
99%
,106
40-49
99%
,48
50-59
99%
,59
60-69
99%
,19
70-79
99%
,14
80-89
99%
,6
90-99
99%
,60
100-199
99%
,15
200-299
99%
,12
300-399
99%
,1
400-499
99%
,1
500-599
100%
,1
900-999
(c)
Figure 9. Field declarations in classes: (a) number of primitive fields per class or interface; (b) number of referencefields per class or interface; (c) number of final fields per class.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
598 C. COLLBERG, G. MYLES AND M. STEPP
0
100
200
300 MIN: 0MAX: 52AVG: 0.0
MODE: 0MEDIAN: 0STD DEV: 0
SAMPLES: 90500TOTAL: 3213
98
%,
88
94
6
0
99
%,
10
18
1
99
%,
20
6
2
99
%,
12
2
3
99
%,
99
4
99
%,
21
5
99
%,
19
6
99
%,
14
7
99
%,
7
8
99
%,
7
9
99
%,
37
10-19
99
%,
3
20-29
10
0%
,1
50-59
(a)
0
20
40
60
80
100 MIN: 0MAX: 6AVG: 0.0
MODE: 0MEDIAN: 0STD DEV: 0
SAMPLES: 90500TOTAL: 256
99
%,
90
35
4
0
99
%,
80
1
99
%,
40
2
99
%,
13
3
99
%,
9
4
99
%,
3
5
10
0%
,1
6
(b)
Figure 10. Field declarations in classes: (a) number of transient fields per class;(b) number of volatile fields per class.
one static variable. It is also more common for a class to have fields of reference types rather than
primitive types. On average, a class will have 1.5 fields of primitive type, but 2.6 fields of reference
type.
Fields may also be declared final, transient, or volatile. A final field is one whose value cannot be
altered after it is first assigned in the instance or class initializer. A transient field is one that is not part
of the persistent state of its parent object. A volatile field is one that cannot be internally cached by
the JVM, since it is assumed to be accessed by multiple threads. Figures 9(c), 10(a), and 10(b) report
on the number of final, transient, and volatile fields per class, respectively. We see that 98% of all
classes have no transient fields, 99% have no volatile fields, and more than half of all classes have no
final fields. The rareness of these modifiers makes the outliers in these graphs particularly interesting.
While in general there are very few transient fields, Figure 10(a) shows us that one class had 52 of
them. Also, when we compare Figures 9(b) and 9(c), we see that they have very similar MAX values.
On closer inspection, this is due to a single class in the file ‘kawa-1.7.jar’ which has 968 fields,
all of which are reference types, and only one of which is not final.
Table VIII gives a breakdown of the declared types of fields. Only primitive types and types exported
from the Java standard library are shown. Our data also contained some user defined types with high
usage counts. This is due to idiosyncrasies of our collected programs, such as a program declaring
vast numbers of fields of one of its classes. Table VIII shows that the vast majority of types are ints,
Strings, and booleans. We note that, somewhat surprisingly, java.lang.Class (Java’s notion
of a class) is a frequent field type, and doubles are more frequent than floats.
5.2. Constant pool
Figure 11 shows the number of entries in the constant pool (the class file’s symbol table) per class.
While small literal integers are stored directly in the bytecode, large integers as well as Strings and
real numbers are, instead, stored in the constant pool. Figures 12–14 show the relative distribution of
literal types.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 599
Table VIII. Most common field types.
Field type Count %
int 153 861 21.8java.lang.String 105 787 15.0boolean 44 914 6.4java.lang.Class 24 355 3.4long 16 556 2.3java.lang.Object 14 472 2.0byte[] 10 229 1.4int[] 8157 1.1java.util.Vector 7601 1.0java.util.Hashtable 7095 1.0short 7048 1.0byte 6464 0.9java.lang.String[] 6412 0.9java.util.Map 5692 0.8double 5256 0.7java.util.List[] 4971 0.7float 3115 0.4java.io.File 3019 0.4char[] 2995 0.4java.math.BigInteger 2782 0.3java.lang.StringBuffer 2472 0.3java.sql.Connection 2443 0.3javax.swing.JLabel 2066 0.3java.util.HashMap 2064 0.3java.awt.Color 2058 0.3char 1987 0.3java.util.ArrayList 1748 0.2
The difference between Figures 14(a) and 14(b) is that ‘string constants’ are user-defined strings,
such as all literal strings that appear in the source code. The ‘UTF8 strings’ include all user strings,
but also include strings used internally by the classfile format, as well as the names of all referenced
classes, interfaces, methods, fields, etc. It is interesting to note that UTF8 strings comprise over half of
the total constants counted, whereas the sum of all numeric constants is less than 2% of the total.
5.3. Methods
Figures 15–17 give statistics of methods. Of interest is that 73% of all classes have nine or fewer
methods (Figure 15(a)), and that the vast majority of classes have no abstract or native methods
(Figures 15(b) and 16(a)). Almost all classes have at least one virtual method, with an average of
7.7 methods per class (Figure 17(a)). Static methods are quite rare: 80% of all classes have at most one
static method, with an average of 1.3 methods per class (Figure 16(b)).
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
600 C. COLLBERG, G. MYLES AND M. STEPP
0
5000
10000
15000 MIN: 4MAX: 26708AVG: 122.1
MODE: 24MEDIAN: 65STD DEV: 215
SAMPLES: 102688TOTAL: 12538316
0%
,1
1
4
0%
,2
86
6
0%
,6
1
7
1%
,1
14
7
8
1%
,3
75
9
11
%,
10
22
9
10-19
22
%,
11
05
7
20-29
31
%,
91
28
30-39
38
%,
74
91
40-49
44
%,
60
20
50-59
49
%,
51
47
60-69
54
%,
45
70
70-79
58
%,
40
90
80-89
61
%,
39
85
90-99
83
%,
22
41
3
100-199
91
%,
80
54
200-299
95
%,
37
04
300-399
97
%,
18
79
400-499
98
%,
10
35
500-599
98
%,
67
6
600-699
99
%,
35
7
700-799
99
%,
22
5
800-8999
9%
,1
64
900-9999
9%
,4
96
1000-1999
99
%,
74
2000-2999
99
%,
6
3000-3999
99
%,
3
4000-4999
99
%,
2
5000-5999
10
0%
,3
20000-29999
(a)
Figure 11. Number of constant pool entries per class or interface.
0
1000
2000
3000
4000 MIN: 0MAX: 2818AVG: 1.9
MODE: 0MEDIAN: 0STD DEV: 42
SAMPLES: 102688TOTAL: 190363
89%
,91653
0
92%
,3814
1
94%
,1684
2
95%
,1202
3
96%
,866
4
97%
,562
5
97%
,406
6
97%
,354
7
98%
,234
8
98%
,180
9
99%
,902
10-19
99%
,273
20-29
99%
,91
30-39
99%
,88
40-49
99%
,55
50-59
99%
,22
60-69
99%
,29
70-79
99%
,41
80-89
99%
,24
90-99
99%
,89
100-199
99%
,36
200-299
99%
,1
300-399
99%
,6
400-499
99%
,20
500-599
99%
,22
1000-1999
100%
,34
2000-2999
(a)
0
200
400
600
800
1000 MIN: 0MAX: 1033AVG: 0.3
MODE: 0MEDIAN: 0STD DEV: 13
SAMPLES: 102688TOTAL: 33383
95%
,98232
0
98%
,3031
1
99%
,663
2
99%
,221
3
99%
,92
4
99%
,102
5
99%
,68
6
99%
,28
7
99%
,68
8
99%
,19
9
99%
,64
10-19
99%
,16
20-29
99%
,6
30-39
99%
,3
40-49
99%
,1
50-59
99%
,1
60-69
99%
,2
70-79
99%
,16
80-89
99%
,32
100-199
99%
,6
200-299
99%
,2
300-399
100%
,15
1000-1999
(b)
Figure 12. Literal constants in classes: (a) number of integer entries per class or interface; (b) numberof long entries per class or interface.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 601
0
50
100
150
200 MIN: 0MAX: 287AVG: 0.1
MODE: 0MEDIAN: 0STD DEV: 3
SAMPLES: 102688TOTAL: 7278
99
%,
10
18
35
0
99
%,
41
1
1
99
%,
13
9
2
99
%,
84
3
99
%,
52
4
99
%,
42
5
99
%,
19
6
99
%,
18
7
99
%,
8
8
99
%,
5
9
99
%,
42
10-19
99
%,
7
20-29
99
%,
7
30-39
99
%,
170-79
99
%,
2
100-199
10
0%
,16
200-299
(a)
0
100
200
300
400
500 MIN: 0MAX: 1292AVG: 0.1
MODE: 0MEDIAN: 0STD DEV: 5
SAMPLES: 102688TOTAL: 12148
97%
,100185
0
98%
,1365
1
99%
,435
2
99%
,240
3
99%
,117
4
99%
,64
5
99%
,64
6
99%
,36
7
99%
,32
8
99%
,12
9
99%
,85
10-19
99%
,16
20-29
99%
,9
30-39
99%
,2
40-49
99%
,1
50-59
99%
,1
60-69
99%
,1
70-79
99%
,1
80-89
99%
,3
90-99
99%
,7
100-199
99%
,10
200-299
99%
,1
300-399
100%
,1
1000-1999
(b)
Figure 13. Literal constants in classes: (a) number of float entries per class or interface; (b) numberof double entries per class or interface.
5.4. Member protection
Figures 18 and 19 show the frequency of visibility restrictions of class members (fields and methods).
A member can be package private, private, protected, or public. Figure 19(c)
summarizes the information by giving average numbers of members with a particular protection.
5.5. Inheritance
Figure 20 shows information about class inheritance. Figure 20(a) shows the number of immediate
subclasses of a class, i.e. the number of classes that directly extend a particular class. Figure 20(b)
shows the number of classes that directly or indirectly extend a particular class. We found that 97%
of all classes have two or fewer direct subclasses. One of the classes in our collection is extended
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
602 C. COLLBERG, G. MYLES AND M. STEPP
0
5000
10000
15000 MIN: 0MAX: 13266AVG: 7.1
MODE: 0MEDIAN: 1STD DEV: 73
SAMPLES: 102688TOTAL: 726412
45
%,
47
13
9
0
58
%,
12
54
9
1
64
%,
69
16
2
70
%,
55
03
3
73
%,
38
58
4
76
%,
30
34
5
79
%,
2393
6
81
%,
2245
7
83
%,
1913
8
84
%,
1596
9
92
%,
75
84
10-19
95
%,
34
28
20-29
97
%,
1488
30-39
97%
,920
40-49
98%
,516
50-59
98%
,363
60-69
98%
,168
70-79
99%
,187
80-89
99%
,159
90-99
99%
,437
100-19999%
,100
200-29999%
,59
300-399
99%
,62
400-499
99%
,24
500-599
99%
,22
600-699
99%
,2
700-799
99%
,2
800-899
99%
,2
900-999
99%
,12
1000-1999
99%
,4
2000-2999
100%
,3
10000-19999
(a)
0
5000
10000
15000MIN: 2
MAX: 13335AVG: 64.1
MODE: 14MEDIAN: 38STD DEV: 101
SAMPLES: 102688TOTAL: 6582750
0%
,11
2
0%
,50
3
0%
,247
4
0%
,364
5
1%
,1205
6
2%
,888
7
4%
,1529
8
5%
,1628
9
21%
,16562
10-19
37%
,15828
20-29
49%
,12120
30-39
57%
,8881
40-49
65%
,8089
50-59
71%
,5946
60-69
76%
,4731
70-79
79%
,3835
80-89
82%
,3065
90-99
95%
,12889
100-199
98%
,3048
200-299
99%
,978
300-399
99%
,352
400-499
99%
,166
500-599
99%
,153
600-699
99%
,59
700-799
99%
,22
800-899
99%
,9
900-999
99%
,25
1000-1999
99%
,5
2000-2999
100%
,3
10000-19999
(b)
Figure 14. Literal constants in classes: (a) number of string entries per class or interface; (b) numberof UTF8 string constants per class.
by 187 classes. In addition, 48% of classes are at depth 1 in the inheritance graph, i.e. they extend
java.lang.Object, the root of the inheritance graph (Figure 20(c)). The average depth of a class
is low (only 2.1), although six of our classes are at depth 30–39. In many cases we failed to build
the inheritance hierarchy due to the program containing references to classes not in the jar-file or the
standard Java library.
Figure 21 shows the same information for interfaces.
Table VII was computed by looking at which classes each application class extended. Every interface
is considered to extend java.lang.Object. Similarly, Table IX looks at which interfaces were
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 603
0
2000
4000
6000
8000
10000 MIN: 0MAX: 570AVG: 9.0
MODE: 2MEDIAN: 5STD DEV: 15
SAMPLES: 90500TOTAL: 814603
0%
,3
83
0
6%
,5
93
9
1
25
%,
16
60
1
2
36
%,
99
48
3
45
%,
82
39
4
53
%,
71
88
5
58
%,
50
92
6
64
%,
51
81
7
69
%,
42
10
8
73
%,
35
58
9
90
%,
15
25
4
10-19
95
%,
46
54
20-29
97
%,
19
77
30-39
98
%,
93
5
40-49
99
%,
463
50-59
99
%,
240
60-699
9%
,171
70-799
9%
,97
80-89
99
%,
100
90-99
99
%,
225
100-199
99
%,
12
200-299
99
%,
4
300-399
99
%,
11
400-499
10
0%
,18
500-599
(a)
0
100
200
300
400
500 MIN: 0MAX: 123AVG: 0.1
MODE: 0MEDIAN: 0STD DEV: 1
SAMPLES: 90500TOTAL: 12519
96%
,87477
0
98%
,1234
1
98%
,472
2
98%
,391
3
99%
,238
4
99%
,125
5
99%
,166
6
99%
,98
7
99%
,21
8
99%
,51
9
99%
,112
10-19
99%
,61
20-29
99%
,21
30-39
99%
,26
40-49
99%
,4
80-89
100%
,3
100-199
(b)
Figure 15. Method declarations in classes: (a) number of methods per class;(b) number of abstract methods per class.
extended by other interfaces. There is a bit of ambiguity here, because in Java source code an interface
uses the extends keyword to extend another interface, although technically the interface is really being
implemented. Table X shows which interfaces were implemented by any application class, including
other interfaces.
Method overriding occurs when a method in a class has the same name and signature as a method
in its superclass. This is a technique used to provide a more specialized implementation of a particular
method. Figure 22 shows that the majority of classes have at most one overridden method.
6. METHOD-LEVEL STATISTICS
In this section we present method-level statistics. This includes information about method signatures,
local variables, CFGs, and exception handlers.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
604 C. COLLBERG, G. MYLES AND M. STEPP
0
20
40
60
80
100 MIN: 0MAX: 179AVG: 0.0
MODE: 0MEDIAN: 0STD DEV: 0
SAMPLES: 90500TOTAL: 1676
99
%,
90
27
1
0
99
%,
60
1
99
%,
31
2
99
%,
30
3
99
%,
21
4
99
%,
15
5
99
%,
3
6
99
%,
27
7
99
%,
7
8
99
%,
17
10-19
99
%,
8
20-29
99
%,
3
30-39
99
%,
3
40-499
9%
,1
50-59
99
%,
260-69
10
0%
,1
100-199
(a)
0
1000
2000
3000 MIN: 0MAX: 533AVG: 1.3
MODE: 0MEDIAN: 0STD DEV: 8
SAMPLES: 90500TOTAL: 114225
67%
,61138
0
80%
,12150
1
89%
,8093
2
92%
,2145
3
93%
,1278
4
94%
,1039
5
95%
,973
6
96%
,856
7
97%
,427
8
97%
,323
9
99%
,1437
10-19
99%
,351
20-29
99%
,120
30-39
99%
,76
40-49
99%
,29
50-59
99%
,5
60-69
99%
,4
70-79
99%
,4
80-89
99%
,8
90-99
99%
,25
100-199
99%
,11
400-499
100%
,8
500-599
(b)
Figure 16. Method declarations in classes: (a) number of native methods per class;(b) number of static methods per class.
6.1. Method sizes
Figure 23 shows the sizes in bytes and instructions of bytecode methods. The maximum size allowed
by the JVM is 65 535 bytes, but only one of our methods (63 019 bytes long) approached this limit.
6.2. Local variables and formal parameters
Figure 24 shows the maximum number of slots used by a method. All instance methods will use at
least one slot (for the this parameter). No method used more than 157 slots, indicating that the wideinstruction (used to access up to 65 536 slots) will be rarely used.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 605
0
5000
10000
15000
MIN: 0MAX: 569AVG: 7.7
MODE: 2MEDIAN: 4STD DEV: 12
SAMPLES: 90500TOTAL: 700378
0%
,4
54
0
13
%,
12
19
0
1
33
%,
17
66
8
2
45
%,
10
46
8
3
53
%,
72
04
4
60
%,
68
19
5
65
%,
46
06
6
70
%,
43
93
7
74
%,
36
53
8
77
%,
27
76
9
91
%,
12
83
5
10-19
96
%,
39
48
20-29
97
%,
16
63
30-39
98
%,
758
40-49
99
%,
356
50-59
99
%,
187
60-699
9%
,150
70-799
9%
,92
80-89
99
%,
66
90-99
99
%,
195
100-199
99
%,
9
200-299
10
0%
,10
500-599
(a)
0
1000
2000
3000 MIN: 0MAX: 558AVG: 0.5
MODE: 0MEDIAN: 0STD DEV: 9
SAMPLES: 90500TOTAL: 43710
92%
,83935
0
95%
,2688
1
97%
,1219
2
97%
,678
3
98%
,418
4
98%
,218
5
98%
,321
6
99%
,163
7
99%
,115
8
99%
,65
9
99%
,418
10-19
99%
,122
20-29
99%
,62
30-39
99%
,25
40-49
99%
,5
50-59
99%
,1
60-69
99%
,3
70-79
99%
,9
80-89
99%
,1
90-99
99%
,3
100-199
99%
,2
200-299
99%
,12
400-499
100%
,17
500-599
(b)
Figure 17. Method declarations in classes: (a) number of non-static methods per class;(b) number of final methods per class.
Table XI gives a breakdown of slot types. Note that Java’s short, byte, char, and booleantypes are compiled into integers in the bytecode, and thus will not show up as distinct types. Also, a
slot may contain more than one type within a method, although at any one particular location it must
always have the same type. Table XI shows that ints and Strings make up the majority of slot types.
Only 3.8% of slots contain two types, and only 0.6% contain three types. This indicates that the design
of the JVM could have been simplified by requiring each slot to contain exactly one type throughout
the body of a method, without much adverse effect.
Slots are not explicitly typed in the bytecode. Instead, slot types have to be computed using a static
analysis known as stack simulation. This involves simulating the behavior of each instruction on the
stack and the local variable slots, while following all possible paths of control flow within the method.
A similar algorithm is used in the Java bytecode verifier.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
606 C. COLLBERG, G. MYLES AND M. STEPP
0
2000
4000
6000
8000
10000 MIN: 0MAX: 743AVG: 2.0
MODE: 0MEDIAN: 0STD DEV: 9
SAMPLES: 90500TOTAL: 177960
51
%,
46
75
1
0
74
%,
20
79
3
1
82
%,
73
35
2
87
%,
47
48
3
90
%,
23
63
4
92
%,
20
35
5
94
%,
11
68
6
94
%,
76
2
7
95
%,
64
1
8
96
%,
508
9
98
%,
22
14
10-19
99
%,
53
3
20-29
99
%,
249
30-39
99
%,
131
40-49
99
%,
52
50-59
99
%,
32
60-69
99
%,
52
70-79
99
%,
21
80-899
9%
,21
90-99
99
%,
58
100-199
99
%,
21
200-299
99
%,
8
400-499
99
%,
3
500-599
10
0%
,1
700-799
(a)
0
2000
4000
6000
8000
10000 MIN: 0MAX: 119AVG: 0.9
MODE: 0MEDIAN: 0STD DEV: 3
SAMPLES: 90500TOTAL: 81574
78%
,70806
0
86%
,7828
1
90%
,3497
2
93%
,2247
3
94%
,1263
4
95%
,920
5
96%
,679
6
97%
,618
7
97%
,417
8
97%
,321
9
99%
,1395
10-19
99%
,311
20-29
99%
,89
30-39
99%
,50
40-49
99%
,11
50-59
99%
,7
60-69
99%
,21
80-89
100%
,20
100-199
(b)
Figure 18. Protection of class members: (a) number of package private members per class;(b) number of private members per class.
Figure 24(b) shows the maximum stack depth required by a method. This is stored as an attribute in
the class file, and could thus be larger then the actual stack size needed at runtime.
The number of slots used by a method in Figure 24(a) includes those slots reserved for method
parameters. Figure 25 breaks out the number of formal parameters per method. This is the number
of parameters, not the number of slots those parameters would consume (i.e. longs and doubles
count as one). As expected, the average is low (1.0), with 90% of all methods having two or
fewer formals. Table XII shows the most common method signatures. The signatures are presented
with the method’s parameter type list first, followed by the method’s return type. So, for example,
a method that takes two integer parameters and returns a String would have a signature of
‘(int,int)java.lang.String’. We have also abstracted away any reference types that do not
appear in the standard Java libraries (i.e. user-defined classes). If a user-defined class is a parameter
or the return type of a method, we replace it with ‘user class’. One reason that ‘()void’ is so
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 607
0
2000
4000
6000
8000
10000 MIN: 0MAX: 487AVG: 3.0
MODE: 0MEDIAN: 1STD DEV: 10
SAMPLES: 90500TOTAL: 268810
41
%,3
78
76
0
62
%,1
88
55
1
73
%,9
46
0
2
79
%,5
45
1
3
83
%,3
97
7
4
86
%,2765
5
88
%,2080
6
90%
,1701
7
92%
,1143
8
93%
,862
9
97
%,4
10
4
10-19
98%
,1259
20-2999%
,494
30-39
99%
,175
40-49
99%
,69
50-59
99%
,94
60-69
99%
,27
70-79
99%
,15
80-89
99%
,16
90-99
99%
,47
100-199
99%
,1
200-299
100%
,29
400-499
(a)
0
5000
10000
15000 MIN: 0MAX: 556AVG: 7.5
MODE: 1MEDIAN: 4STD DEV: 13
SAMPLES: 90500TOTAL: 683075
4%
,3949
0
20%
,15042
1
34%
,11790
2
44%
,9565
3
53%
,7837
4
59%
,5872
5
66%
,5958
6
71%
,4607
7
75%
,3803
8
78%
,2949
9
92%
,12121
10-19
96%
,3626
20-29
97%
,1569
30-39
98%
,717
40-49
99%
,355
50-59
99%
,199
60-69
99%
,116
70-79
99%
,67
80-89
99%
,91
90-99
99%
,222
100-199
99%
,26
200-299
99%
,16
300-399
99%
,2
400-499
100%
,1
500-599
(b)
Protection %
Package private 14.7
Private 6.7
Protected 22.2
Public 56.4
(c)
Figure 19. Protection of class members: (a) number of protected members per class; (b) number of public methodsper class; (c) average of class members with particular protection.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
608 C. COLLBERG, G. MYLES AND M. STEPP
0
500
1000
1500
2000 MIN: 0MAX: 187AVG: 0.4
MODE: 0MEDIAN: 0STD DEV: 2
SAMPLES: 90500FAILED: 13724
89
%,
68
45
9
0
94
%,
40
22
1
96
%,
17
92
2
97
%,
78
1
3
98
%,
42
9
4
98
%,
308
5
98
%,
182
6
99
%,
134
7
99
%,
102
8
99
%,
74
9
99
%,
295
10-19
99
%,
122
20-299
9%
,38
30-399
9%
,10
40-49
99%
,15
50-59
99%
,1
60-69
99%
,9
70-79
99%
,1
80-89
100%
,2
100-199
(a)
0
1000
2000
3000
4000
5000 MIN: 0MAX: 346AVG: 0.6
MODE: 0MEDIAN: 0STD DEV: 5
SAMPLES: 90500FAILED: 13724
89%
,68459
0
93%
,3546
1
95%
,1648
2
96%
,812
3
97%
,472
4
97%
,284
5
98%
,228
6
98%
,134
7
98%
,182
8
98%
,136
9
99%
,460
10-19
99%
,193
20-29
99%
,58
30-39
99%
,42
40-49
99%
,10
50-59
99%
,32
60-69
99%
,10
70-79
99%
,9
80-89
99%
,9
90-99
99%
,46
100-199
99%
,3
200-299
100%
,3
300-399
(b)
0
10000
20000
30000
MIN: 1MAX: 10AVG: 2.1
MODE: 1MEDIAN: 1STD DEV: 1
SAMPLES: 90500FAILED: 14383
48%
,37275
1
70
%,
165
99
2
84%
,1
02
78
3
91%
,5
529
4
96
%,
38
75
5
99
%,
19
93
6
99
%,
45
7
7
99
%,
66
8
99
%,
39
9
10
0%
,6
10-19
(c)
Figure 20. (a) Number of immediate subclasses per class; (b) total number of subclasses per class;(c) inheritance depth of a class.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 609
0
100
200
300
400
500 MIN: 0MAX: 69AVG: 0.4
MODE: 0MEDIAN: 0STD DEV: 2
SAMPLES: 12188FAILED: 1862
87
%,
90
55
0
95
%,
80
8
1
97
%,
20
8
2
98
%,
87
3
98%
,27
4
98%
,18
5
99%
,20
6
99%
,17
7
99%
,2
8
99%
,10
9
99
%,
44
10-19
99%
,4
20-29
99%
,11
30-39
99%
,13
50-59
100%
,2
60-69
(a)
0
50
100
150
200 MIN: 0MAX: 105AVG: 0.7
MODE: 0MEDIAN: 0STD DEV: 5
SAMPLES: 12188FAILED: 1862
87%
,9055
0
94%
,705
1
96%
,194
2
97%
,101
3
97%
,48
4
98%
,28
5
98%
,18
6
98%
,12
7
98%
,7
8
98%
,5
9
99%
,85
10-19
99%
,7
20-29
99%
,13
30-39
99%
,4
40-49
99%
,22
50-59
99%
,6
60-69
99%
,9
80-89
100%
,7
100-199
(b)
0
500
1000
1500
2000 MIN: 0MAX: 16AVG: 0.9
MODE: 0MEDIAN: 0STD DEV: 1
SAMPLES: 12188FAILED: 1862
61%
,6376
0
79%
,1839
1
84%
,4
61
2
95%
,117
0
3
97
%,
22
5
4
98
%,
12
7
5
99
%,
39
6
99
%,
13
7
99
%,
30
8
99
%,
2
9
10
0%
,4
4
10-19
(c)
Figure 21. (a) Number of immediate subinterfaces per interface; (b) total number of subinterfaces per interface;(c) interface extends depth of a class.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
610 C. COLLBERG, G. MYLES AND M. STEPP
Table IX. Most common standard interfaces to be extended byapplication interfaces.
Interface Count %
user interface 3359 57.7org.w3c.dom.html.HTMLElement 676 11.6java.util.EventListener 362 6.2java.io.Serializable 251 4.3org.w3c.dom.Node 225 3.9java.lang.Cloneable 118 2.0org.omg.CORBA.Object 96 1.6java.security.PrivateKey 43 0.7org.w3c.dom.CharacterData 42 0.7org.w3c.dom.events.EventTarget 39 0.7java.security.PublicKey 39 0.7org.omg.CORBA.portable.IDLEntity 38 0.7org.omg.CORBA.IDLType 36 0.6org.w3c.dom.Element 29 0.5org.w3c.dom.Document 28 0.5java.rmi.Remote 24 0.4org.w3c.dom.css.CSSRule 23 0.4java.security.Key 23 0.4org.w3c.dom.events.Event 22 0.4org.w3c.dom.DOMImplementation 22 0.4org.w3c.dom.Text 21 0.4org.xml.sax.XMLReader 20 0.3org.omg.CORBA.IRObject 18 0.3org.xml.sax.ContentHandler 16 0.3java.lang.Comparable 16 0.3javax.crypto.interfaces.DHKey 14 0.2java.util.Map 12 0.2java.sql.ResultSet 11 0.2java.util.List 10 0.2java.sql.Connection 9 0.2java.lang.Runnable 9 0.2org.w3c.dom.css.CSSValue 8 0.1org.xml.sax.Locator 7 0.1org.xml.sax.DTDHandler 7 0.1java.util.Collection 7 0.1java.sql.ResultSetMetaData 7 0.1org.xml.sax.ext.LexicalHandler 6 0.1org.omg.CORBA.DynAny 6 0.1org.xml.sax.DocumentHandler 5 0.1org.omg.CORBA.Policy 5 0.1javax.xml.transform.SourceLocator 5 0.1java.sql.PreparedStatement 5 0.1org.w3c.dom.events.UIEvent 4 0.1org.w3c.dom.events.DocumentEvent 4 0.1
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 611
Table X. Most common standard interfaces to be implemented by application classes.
Interface Count %
user interface 21 955 55.9java.io.Serializable 3534 9.0java.awt.event.ActionListener 2880 7.3java.lang.Runnable 1447 3.7java.lang.Cloneable 1009 2.6org.omg.CORBA.portable.Streamable 793 2.0java.awt.event.ItemListener 302 0.8java.lang.Comparable 266 0.7java.util.Iterator 262 0.7java.util.Enumeration 216 0.6java.util.Comparator 215 0.5javax.swing.event.ChangeListener 211 0.5java.awt.event.MouseListener 187 0.5org.xml.sax.EntityResolver 173 0.4java.security.PrivilegedAction 145 0.4org.xml.sax.ErrorHandler 130 0.3java.security.spec.AlgorithmParameterSpec 130 0.3java.beans.PropertyChangeListener 114 0.3java.awt.event.MouseMotionListener 113 0.3org.xml.sax.ext.LexicalHandler 109 0.3java.awt.event.KeyListener 109 0.3org.xml.sax.ContentHandler 100 0.3javax.swing.event.ListSelectionListener 99 0.3java.io.Externalizable 99 0.3java.security.spec.KeySpec 87 0.2org.xml.sax.DocumentHandler 83 0.2org.xml.sax.DTDHandler 82 0.2java.awt.event.AdjustmentListener 81 0.2javax.sql.DataSource 80 0.2java.awt.event.WindowListener 80 0.2java.awt.image.ImageObserver 76 0.2java.awt.image.renderable.RenderedImageFactory 74 0.2javax.naming.spi.ObjectFactory 72 0.2java.sql.Connection 71 0.2java.awt.event.FocusListener 71 0.2org.w3c.dom.NodeList 70 0.2org.xml.sax.AttributeList 59 0.2javax.naming.Referenceable 58 0.1java.io.FilenameFilter 55 0.1org.xml.sax.Locator 52 0.1java.util.Map$Entry 52 0.1java.lang.reflect.InvocationHandler 52 0.1javax.swing.event.DocumentListener 51 0.1java.awt.event.ComponentListener 50 0.1org.xml.sax.Attributes 48 0.1
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
612 C. COLLBERG, G. MYLES AND M. STEPP
0
1000
2000
3000
4000 MIN: 0MAX: 107AVG: 0.7
MODE: 0MEDIAN: 0STD DEV: 2
SAMPLES: 90500TOTAL: 59638
FAILED: 2767
76
%,
67
07
1
0
89
%,
11
15
2
1
92
%,
32
95
2
95
%,
25
22
3
96
%,
91
4
4
97
%,
55
5
5
97
%,
40
3
6
98
%,
37
0
7
98
%,
229
8
98
%,
181
9
99
%,
73
7
10-19
99
%,
164
20-29
99
%,
84
30-399
9%
,30
40-499
9%
,5
50-59
99
%,
560-69
99
%,
9
70-79
99
%,
3
80-89
99
%,
2
90-99
10
0%
,2
100-199
Figure 22. Number of method overrides per class.
common is that this is the signature of default constructors, especially that for java.lang.Object,
which must be called in the constructors of all classes that directly extend it.
6.3. CFGs
A method body can be converted into a CFG, where the nodes (the basic blocks) are straight-line
pieces of code. Control always enters the top of the basic block and exits at the bottom, either through
an explicit branch or by falling through to another block. There is an edge from basic block A to basic
block B if control can flow from A to B.
Building CFGs for Java bytecode is not straightforward. A major complication is how to deal with
exception handling. Several instructions in the JVM can throw exceptions implicitly. This includes the
division instructions (which may throw a divide-by-zero exception), and the getfield, putfield,
and invokevirtual instructions (which may throw null-reference exceptions). These changes in
control flow can be represented by adding exception edges to the CFG, which connect a basic block
ending in an exception-throwing instruction to the CFG’s sink node. If every such instruction (which are
very common in real code) is allowed to terminate a basic block, blocks become very small. Since some
analyses can safely ignore implicit exceptions, SandMark supports building the CFGs both with and
without implicit exception edges. The jsr and ret instructions used for Java’s finally clause
also cause problems. In general, a data flow analysis is necessary in order to correctly build CFGs
in the presence of complex jsr/ret combinations. SandMark currently does not support this and,
as a consequence, will sometimes introduce spurious edges out of blocks ending in ret instructions.
Since there are few such CFGs in our sample set this problem is unlikely to significantly affect our
data.
Figure 26 shows the number of basic blocks per method body (we make the distinction method
body to rule out methods with no instructions, such as native or abstract methods). We can see that
97% of all method bodies have fewer than 100 basic blocks. We can corroborate this information with
Figures 27(a) and 23(b), to see that the average basic block size is 2.0, and that 97% of all method
bodies have fewer than 200 instructions.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 613
0
10000
20000
30000
40000
50000
MIN: 1MAX: 63019AVG: 66.2
MODE: 5MEDIAN: 17STD DEV: 281
SAMPLES: 801117TOTAL: 53057748
1%
,1
3971
1
3%
,1
4661
2
4%
,7584
3
5%
,7300
4
18
%,
10
07
11
5
23
%,
45
47
9
6
25
%,
17
620
7
29
%,
32
37
1
8
33
%,
27
79
3
9
52
%,
15
10
21
10-19
61
%,
76
61
5
20-29
68
%,
51
78
5
30-39
73
%,
38
45
0
40-49
76
%,
28
33
9
50-59
79
%,
24
20
6
60-69
81
%,
18
70
0
70-79
83
%,
15
552
80-89
85
%,
12881
90-99
93
%,
63
39
2
100-199
96
%,
22
14
6
200-299
97
%,
10596
300-39998%
,5811
400-499
98%
,3782
500-599
99%
,2551
600-699
99%
,1527
700-799
99%
,1108
800-899
99%
,951
900-999
99%
,2699
1000-1999
99%
,757
2000-2999
99%
,361
3000-3999
99%
,113
4000-4999
99%
,44
5000-5999
99%
,46
6000-6999
99%
,87
7000-7999
99%
,32
8000-8999
99%
,1
9000-9999
99%
,57
10000-19999
99%
,14
20000-29999
99%
,2
40000-49999
100%
,1
60000-69999
(a)
0
20000
40000
60000
80000 MIN: 1MAX: 42725AVG: 33.2
MODE: 3MEDIAN: 9STD DEV: 156
SAMPLES: 801117TOTAL: 26597868
1%
,13971
1
5%
,29170
2
18%
,103439
3
27%
,76971
4
34%
,52415
5
40%
,45790
6
43%
,28719
7
46%
,21379
8
49%
,23837
9
66%
,140920
10-19
76%
,72423
20-29
81%
,46637
30-39
85%
,29709
40-49
88%
,21440
50-59
90%
,15808
60-69
91%
,11776
70-79
92%
,10059
80-89
93%
,7075
90-99
97%
,31320
100-199
98%
,9045
200-299
99%
,3720
300-399
99%
,1551
400-499
99%
,778
500-599
99%
,654
600-699
99%
,369
700-799
99%
,455
800-899
99%
,216
900-999
99%
,878
1000-1999
99%
,320
2000-2999
99%
,77
3000-3999
99%
,96
4000-4999
99%
,27
5000-5999
99%
,3
6000-6999
99%
,10
7000-7999
99%
,24
8000-8999
99%
,1
9000-9999
99%
,32
10000-19999
99%
,2
20000-29999
100%
,1
40000-49999
(b)
Figure 23. Method sizes: (a) size in bytes; (b) number of instructions per method body.
As can be seen from Figure 27(a), the average number of instructions in a basic block is very small,
only 2.0, and 98% of all blocks have fewer than six instructions. Figure 28(a) shows the average
out-degree of a basic block node is, predictably, low, only 1.2. Out-degrees higher than two can
only be achieved either when an instruction is inside an exception handler’s try block, or with the
JVM’s tableswitch and lookupswitch instructions. In the first case, edges are added from
each basic block inside a try block to the first basic block of the handler code. Therefore, if a
basic block is inside multiple nested try blocks its out-degree may be high. In the second case,
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
614 C. COLLBERG, G. MYLES AND M. STEPP
0
50000
100000
150000
200000
MIN: 0MAX: 157AVG: 2.9
MODE: 1MEDIAN: 2STD DEV: 3
SAMPLES: 801117TOTAL: 2297561
3%
,3
07
84
0
34
%,
24
39
62
1
61
%,
21
96
70
2
75
%,
10
99
30
3
83
%,
65
22
8
4
88
%,
42
11
3
5
92
%,
26
44
2
6
94
%,
17
401
7
95
%,
12
107
8
96
%,
8155
9
99
%,
22
483
10-19
99
%,
2066
20-29
99
%,
487
30-399
9%
,145
40-49
99
%,
69
50-59
99
%,
23
60-69
99
%,
10
70-79
99
%,
13
80-89
99
%,
18
90-99
10
0%
,11
100-199
(a)
0
50000
100000
150000
200000
MIN: 0MAX: 262AVG: 2.9
MODE: 2MEDIAN: 2STD DEV: 2
SAMPLES: 8011171%
,14138
0
23%
,175747
1
52%
,228656
2
71%
,155015
3
84%
,103273
4
92%
,60850
5
95%
,28811
6
97%
,17043
7
98%
,6487
8
99%
,3728
9
99%
,6629
10-19
99%
,506
20-29
99%
,94
30-39
99%
,52
40-49
99%
,27
50-59
99%
,21
60-69
99%
,19
70-79
99%
,6
80-89
99%
,3
90-99
99%
,10
100-199
100%
,2
200-299
(b)
Figure 24. Local variables: (a) number of max locals per method body; (b) numberof max stack weights per method body.
the tableswitch and lookupswitch instructions are Java’s implementation of switch-
statements, which may have many possible branch targets. Higher in-degrees can occur when a trycatch block has many instructions inside it that could potentially trigger the exception. Each of these
instructions will end its block, and have an edge going from it to the handler block. Thus, the in-degree
of the handler block will become large.
Figure 27(b) shows the number of instructions per basic block when implicit exception edges have
not been generated. As can be seen, this increases the average number of instructions per block to 7.7.
A node x in a directed graph G with a single exit node dominates node y in G if every path from
the entry node to y must pass through x. The dominator set of a node y is the set of all nodes which
dominate y. Dominator information is used in code optimizations such as loop identification and code
motion. Figure 29 shows the number of dominator blocks per basic block.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 615
Table XI. Most common slot types.
Register type Count %
int 614 910 16.2java.lang.String 365 915 9.62 types 144 145 3.8java.lang.Object 76 764 2.0byte[] 50 658 1.3long 49 903 1.3java.lang.Throwable 38 046 1.0double 25 541 0.63 types 23 426 0.6java.lang.StringBuffer 21 716 0.6java.lang.String[] 20 600 0.5java.util.Iterator 16 036 0.4float 15 595 0.4java.lang.Class 15 129 0.4java.util.Vector 14 795 0.4int[] 14 604 0.4java.lang.Exception 14 149 0.4java.io.File 13 334 0.4java.io.InputStream 11 686 0.3java.util.List 11 615 0.3java.lang.ClassNotFoundException 11 331 0.3java.util.Enumeration 10 732 0.3char[] 9534 0.3java.lang.Object[] 9417 0.2
6.4. Subroutines and exception handlers
Java subroutines are implemented by the instructionsjsr and ret. They are chiefly used to implement
the finally clause of an exception handler. This clause can be reached from multiple locations.
For example, a return instruction within the body of a try block will first jump to the finallyclause before returning from the method. Similarly, before returning from within an exception handler,
the finally block must be executed. To avoid code duplication (inlining thefinally block at every
location from which it could be called) the designers of the JVM added the jsr and ret instructions
to jump to and return from a block of code. This has caused much complication in the design of the
JVM verifier. See, for example, Stata et al. [7]. Figure 30 shows that 98% of methods have no more
than two exception handlers, and 98% of all methods have no subroutines. Figure 30(c) shows that
the average size of a subroutine is 7.5 instructions. The length of a subroutine was computed as the
number of instructions between a jsr’s target and its corresponding ret. Together, our data indicate
that jsr and ret could have been left out of the JVMs instruction set without much code increase
from finally clauses being implemented by code duplication.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
616 C. COLLBERG, G. MYLES AND M. STEPP
0
10000
20000
30000
40000
50000 MIN: 0MAX: 27AVG: 1.0
MODE: 0MEDIAN: 1STD DEV: 1
SAMPLES: 874115TOTAL: 861689
40
%,
35
32
35
0
77
%,
32
56
17
1
90
%,
11
55
11
2
96
%,
46
04
0
3
98
%,
17
95
2
4
99
%,
81
63
5
99
%,
38
84
6
99
%,
1676
7
99
%,
918
89
9%
,401
9
99
%,
714
10-19
10
0%
,4
20-29
Figure 25. Number of formal parameters per method.
6.5. Interference graphs
An interference graph models the variables and live range interferences of a method. The live ranges
of a local variable are the locations in a method between where the variable is first assigned and where
it is last used. Since method parameters and the ‘this’ reference are in local variable slots, they
are considered to have their first assignment before the first instruction of the method. The graph has
one vertex per local variable and an edge between two vertices when the corresponding variables’
live ranges overlap (or interfere). As an example consider the sample code in Figure 31(a) and the
corresponding interference graph in Figure 31(b). Since the code has 5 variables, the graph has 5 nodes.
The graph has an edge v1 → v2 since variables v1 and v2 are live at the same time. An interference
graph is often used during the code generation pass of a compiler to perform register allocation.
Two variables with intersecting live ranges cannot be assigned to the same register. Figure 32 shows that
95% of the methods have nine or fewer nodes in their interference graphs. This means that methods
typically will need very few local variable slots. This analysis agrees with the data in Figure 24(a),
which show that methods declare their maximum number of slots to be small.
After examining these data, it appears that the designers of the Java instruction set were wise to make
the typical instruction use only 1 byte to refer to a local variable index. The wide prefix allows such an
instruction to use a 2-byte index, but as we can see this will almost never be necessary. Thus, had the
designers simply made all instructions use 2-byte indices, there would have been much wasted space
in the bytecode.
7. INSTRUCTION-LEVEL STATISTICS
In this section we present information regarding the frequency of individual instructions and patterns of
instructions. We also show the most common subexpressions and constant values found in the bytecode.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 617
Table XII. Most common method signatures.
Method signature Count %
()void 120 997 13.8(user class)void 57 762 6.6()java.lang.String 53 047 6.1()user class 44 098 5.0(java.lang.String)void 43 810 5.0()boolean 39 772 4.5()int 35 064 4.0(int)void 18 959 2.2(boolean)void 11 461 1.3(user class)user class 10 332 1.2(user class,user class)void 9652 1.1(java.lang.String)user class 7781 0.9(java.lang.String)java.lang.String 7777 0.9(user class)boolean 6880 0.8(user class)java.lang.Object 6812 0.8()java.lang.Object 6461 0.7(java.lang.String)java.lang.Class 6258 0.7(java.lang.String,java.lang.String)void 5561 0.6(java.lang.Object)boolean 5373 0.6(int)int 4776 0.5(java.lang.Object)void 4697 0.5(java.awt.event.ActionEvent)void 4479 0.5(int)user class 4270 0.5(int)boolean 4116 0.5(java.lang.String[])void 4044 0.5(java.lang.String)boolean 3933 0.4(int,int)void 3726 0.4(int)java.lang.String 3473 0.4()byte[] 3380 0.4()user class[] 3322 0.4(user class,int)void 3251 0.4()java.util.List 2970 0.3(user class,java.lang.String)void 2821 0.3(byte[])void 2697 0.3()java.lang.String[] 2292 0.3(user class,user class)user class 2289 0.3(int,int)int 2023 0.2(java.awt.event.MouseEvent)void 2008 0.2(user class)int 1998 0.2(java.lang.String)int 1993 0.2(user class)java.lang.String 1951 0.2()org.omg.CORBA.TypeCode 1941 0.2(java.lang.String,user class)void 1900 0.2(java.lang.Object)user class 1873 0.2
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
618 C. COLLBERG, G. MYLES AND M. STEPP
0
20000
40000
60000 MIN: 1MAX: 10699AVG: 16.7
MODE: 2MEDIAN: 4STD DEV: 63
SAMPLES: 801117TOTAL: 13416883
3%
,3
16
26
1
26
%,
18
11
03
2
40
%,
10
96
04
3
46
%,
52
71
7
4
51
%,
37
22
6
5
55
%,
33
72
5
6
58
%,
25
96
3
7
61
%,
22
51
2
8
64
%,
21
22
5
9
80
%,
13
17
86
10-19
87
%,
54
15
7
20-29
91
%,
29
66
0
30-39
93
%,
17
950
40-49
95
%,
11939
50-59
96%
,7904
60-69
96%
,6152
70-79
97%
,4257
80-89
97%
,3338
90-999
9%
,12788
100-19999%
,2499
200-299
99%
,1067
300-399
99%
,548
400-499
99%
,381
500-599
99%
,195
600-699
99%
,192
700-799
99%
,75
800-899
99%
,93
900-999
99%
,322
1000-1999
99%
,56
2000-2999
99%
,9
3000-3999
99%
,41
4000-4999
99%
,1
5000-5999
99%
,3
6000-6999
99%
,2
7000-7999
100%
,1
10000-19999
Figure 26. Number of basic blocks (in CFGs with implicit exception edges) per method body.
7.1. Instruction counts
There are 200 usable JVM instruction opcodes. Table XIII shows the frequency of each of those
bytecode instructions. The most frequently occurring instruction is aload 0 which is responsible for
pushing the local variable 0, the this reference of non-static methods. Even though this is the most
frequently occurring instruction it only has a frequency of 10%. The invokevirtual instruction
which calls a non-static method is also common, as is getfield, dup, and invokespecial, the
last two being used to implement Java’s new operator. These five instructions account for 33.8% of
all instructions. Our data indicate that the majority of the remaining instructions each occur with a
frequency of at most 1%, and that the jsr w and goto w instructions (used for long branches) do not
occur at all.
7.2. Instruction patterns
A k-gram is a contiguous substring of length k which can be comprised of letters, words, or, in our case,
opcodes. The k-gram is based on static analysis of the executable program. To compute the unique set
of k-grams for a method, we slide a window of length k over the static instruction sequence as it is laid
out in the class file.
We computed data for k-grams where k = 2, 3, 4, which is shown in Tables XIV, XV, and XVI,
respectively. These tables show that as the value of k increases the percentages of the most frequently
occurring sequences decrease. For example, the most frequently occurring 2-gram, aload 0,getfield, has a frequency of only 4.7%. For 3- and 4-grams, the most frequently occurring
sequence is less than 1%. This indicates that these sequences become quite unique for each individual
application.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 619
0
50000
100000
150000
200000 MIN: 1MAX: 209AVG: 2.0
MODE: 1MEDIAN: 1STD DEV: 1
SAMPLES: 13416883TOTAL: 26597868
45
%,
60
99
22
2
1
73
%,
37
37
11
8
2
89
%,
22
33
95
4
3
97
%,
96
01
96
4
98
%,
19
13
08
5
99
%,
87
22
5
6
99
%,
40
29
9
7
99
%,
25
945
8
99
%,
11
739
9
99
%,
27
17
1
10-19
99
%,
2112
20-29
99
%,
324
30-39
99
%,
103
40-499
9%
,67
50-59
99
%,
21
60-69
99%
,6
70-79
99
%,
28
80-89
99
%,
15
90-99
99
%,
26
100-199
10
0%
,4
200-299
(a)
0
200000
400000
600000MIN: 1
MAX: 42725AVG: 7.7
MODE: 3MEDIAN: 3STD DEV: 67
SAMPLES: 3463565TOTAL: 26597868
10%
,364196
1
25%
,528134
2
44%
,632798
3
56%
,421403
4
66%
,343282
5
72%
,230377
6
77%
,170226
7
80%
,113225
8
83%
,96463
9
95%
,399132
10-19
97%
,87611
20-29
98%
,31321
30-39
99%
,14105
40-49
99%
,7659
50-59
99%
,5509
60-69
99%
,3391
70-79
99%
,2402
80-89
99%
,1735
90-99
99%
,5568
100-199
99%
,1783
200-299
99%
,981
300-399
99%
,363
400-499
99%
,251
500-599
99%
,215
600-699
99%
,125
700-799
99%
,180
800-899
99%
,72
900-999
99%
,518
1000-1999
99%
,286
2000-2999
99%
,60
3000-3999
99%
,94
4000-4999
99%
,31
5000-5999
99%
,2
6000-6999
99%
,10
7000-7999
99%
,23
8000-8999
99%
,1
9000-9999
99%
,31
10000-19999
99%
,1
20000-29999
100%
,1
40000-49999
(b)
Figure 27. Number of instructions per basic block in CFGs (a) with and (b) without implicit exception edges.
7.3. Expressions
Figure 33 shows the size (number of nodes in the tree) and height (length of longest path from root
to leaf) of expression trees in our samples. As reported already by Knuth [1], expressions tend to be
small. We found that 61% of all expressions only have one node.
Expressions are constructed by performing a stack simulation over each method. For each instruction
that will produce a result on the stack the simulator determines which instructions may have put its
operands on the stack. This information is used to build up a dependency graph with instructions and
operands as nodes, and an edge from node a to node b if b is used by a. If the program contains certain
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
620 C. COLLBERG, G. MYLES AND M. STEPP
0
1000
2000
3000
4000
5000 MIN: 0MAX: 536AVG: 1.2
MODE: 1MEDIAN: 1STD DEV: 1
SAMPLES: 13416883
0%
,15
0
83
%,
11
17
19
42
1
98
%,
19
89
69
5
2
99
%,
17
04
55
3
99
%,
43
05
0
4
99
%,
20
56
2
5
99
%,
95
23
6
99
%,
26
00
7
99
%,
21
83
8
99
%,
91
5
9
99
%,
30
26
10-19
99
%,
88
3
20-29
99
%,
434
30-39
99
%,
141
40-49
99
%,
358
50-599
9%
,196
60-699
9%
,122
70-79
99
%,
322
80-89
99
%,
241
90-99
99
%,
177
100-199
99%
,33
200-299
99%
,6
400-499
100%
,4
500-599
(a)
0
10000
20000
30000
40000
50000 MIN: 0MAX: 1877AVG: 1.2
MODE: 1MEDIAN: 1STD DEV: 2
SAMPLES: 13416883
0%
,997
0
94%
,12643958
1
98%
,564832
2
99%
,86581
3
99%
,35498
4
99%
,16301
5
99%
,9563
6
99%
,6310
7
99%
,5623
8
99%
,3918
9
99%
,20892
10-19
99%
,9102
20-29
99%
,4241
30-39
99%
,2705
40-49
99%
,1484
50-59
99%
,1238
60-69
99%
,755
70-79
99%
,545
80-89
99%
,408
90-99
99%
,1499
100-199
99%
,289
200-299
99%
,64
300-399
99%
,56
400-499
99%
,8
500-599
99%
,9
600-699
99%
,1
700-799
99%
,2
800-899
99%
,2
900-999
100%
,2
1000-1999
(b)
Figure 28. CFGs: (a) out-degree of basic block nodes; (b) in-degree of basic block nodes.
types of loops these graphs might have cycles, in which case they are discarded. The following code
segment is an example of such a loop:
0: ICONST_21: ICONST_32: IADD3: DUP4: IFEQ <1>
In this example, the IADD instruction may become its own child. If the branch at offset 4 is taken,
then the result from the first iteration of the IADD will be used as the first operand in the second
iteration of the IADD.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 621
0
200000
400000
600000
MIN: 0MAX: 10698AVG: 105.3
MODE: 1MEDIAN: 11STD DEV: 437
SAMPLES: 13416883
5%
,8
02
36
3
0
12
%,
85
28
36
1
18
%,
78
48
08
2
23
%,
66
58
00
3
27
%,
62
60
05
4
31
%,
55
18
35
5
35
%,
51
94
29
6
39
%,
48
35
76
7
42
%,
44
59
74
8
45
%,
40
94
36
9
65
%,
26
17
86
6
10-19
74
%,
12
15
10
0
20-29
79
%,
64
80
96
30-39
82
%,
39
20
92
40-49
84
%,
26
36
57
50-59
85
%,
18
7572
60-69
86
%,
142886
70-79
87
%,
112374
80-8988%
,92959
90-999
1%
,4
79
24
9100-199
93
%,
22
0598
200-299
94
%,
146686
300-399
95
%,
109706
400-499
95%
,83629
500-599
96%
,70436
600-699
96%
,57647
700-799
97%
,50364
800-899
97%
,44180
900-999
98
%,
18
8378
1000-1999
99%
,64405
2000-2999
99%
,50667
3000-3999
99%
,22138
4000-4999
99%
,5854
5000-5999
99%
,3900
6000-6999
99%
,1683
7000-7999
99%
,1000
8000-8999
99%
,1000
9000-9999
100%
,699
10000-19999
Figure 29. Number of dominator blocks per basic block (in CFGs with implicit exception edges).
In Table XVII we show the most common subexpressions found in our sample method bodies.
Table XVIII explains the abbreviations used. L, for example, represents a local variable, d a double
constant, (α+α) addition, etc. Since we are counting subexpressions, the same piece of an expression
will be counted more than once. For example, if x and y are local variables, then the expression x+y*2will generate subexpressions L, L, i, (L*i), and (L+(L*i)), each of which will increase the count
of its respective expression class.
To compute subexpressions we convert each expression tree into a string representation, classify
each subexpression into equivalence classes according to Table XVIII, and count each subexpression
individually.
What we see from Table XVII is that, unsurprisingly, local variable references, method calls,
integer constants, and field references make up the bulk of expressions. Somewhat more surprising
is that the expression ((Class)M()) is very frequent. Most likely this is the result of references
to ‘generic’ methods (particularly Java library functions such as java.util.Vector.get())
returning java.lang.Objects which then have to be cast into a more specific type.
7.4. Constant values
Tables XIX, XX, and XXI show the most common literal constants found in the bytecode. Constants
can occur in three different ways: as references to entries in the constant pool (instructionsldc, ldc w,
and ldc2 w), as arguments to bytecode instructions (bipush n, sipush n, and iinc n,c), or
embedded in special instructions (iconst n, etc.).
Figure 34 shows the distribution of integer constant values. It is interesting to note that 63% of
all literal integers are 0, powers of two, or powers of two plus/minus one. This has implications for,
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
622 C. COLLBERG, G. MYLES AND M. STEPP
0
2000
4000
6000
MIN: 0MAX: 105AVG: 0.1
MODE: 0MEDIAN: 0STD DEV: 0
SAMPLES: 801117TOTAL: 116627
91
%,
72
95
54
0
97
%,
49
97
8
1
98
%,
11
18
3
2
99
%,
59
57
3
99
%,
19
01
4
99
%,
914
5
99
%,
563
6
99
%,
356
7
99
%,
199
8
99
%,
141
9
99
%,
320
10-19
99%
,33
20-2999%
,10
30-3999%
,5
40-49
99%
,1
50-59
100%
,2
100-199
(a)
0
20
40
60
80
100 MIN: 0MAX: 29AVG: 0.0
MODE: 0MEDIAN: 0STD DEV: 0
SAMPLES: 801117TOTAL: 9501
FAILED: 2
98%
,792483
0
99%
,8049
1
99%
,446
2
99%
,89
3
99%
,30
4
99%
,4
5
99%
,4
6
99%
,2
7
99%
,2
8
99%
,1
9
99%
,3
10-19
100%
,2
20-29
(b)
0
200
400
600
800
1000 MIN: 2MAX: 125AVG: 7.5
MODE: 4MEDIAN: 4STD DEV: 5
SAMPLES: 9501TOTAL: 71634
0%
,1
2
2
0%
,16
3
27%
,2592
4
43%
,1540
5
51%
,724
6
59%
,75
0
7
65
%,
585
8
73%
,73
4
9
97%
,2326
10-19
99
%,
168
20-29
99%
,2
4
30-39
99
%,
12
40-49
99
%,
8
50-59
99
%,
6
60-69
10
0%
,4
100-199
(c)
Figure 30. (a) Number of exception handlers per method body; (b) number of subroutines per method body;(c) number of instructions per subroutine.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 623
v1 = a×a
v2 = a×b
v3 = 2× v2
v4 = v1 + v2
v5 = b× v3
(a)
v3
v2
v1
v4
v5
(b)
Figure 31. (a) Sample code and (b) corresponding interference graph.
0
50000
100000
MIN: 0MAX: 342AVG: 3.1
MODE: 1MEDIAN: 2STD DEV: 4
SAMPLES: 801117TOTAL: 2516760
3%
,30858
0
34%
,244128
1
61%
,219542
2
74%
,101653
3
82%
,65441
4
87%
,40603
5
90%
,24532
6
92%
,17245
7
94%
,12274
8
95%
,8768
9
99%
,28411
10-19
99%
,4902
20-29
99%
,1409
30-39
99%
,453
40-49
99%
,335
50-59
99%
,166
60-69
99%
,160
70-79
99%
,54
80-89
99%
,44
90-99
99%
,113
100-199
99%
,6
200-299
100%
,20
300-399
(a)
0
10000
20000
30000
40000
50000MIN: 0
MAX: 5694AVG: 6.2
MODE: 0MEDIAN: 1STD DEV: 42
SAMPLES: 801117TOTAL: 4982823
38%
,312328
0
64%
,202327
1
67%
,2
2946
2
76%
,79144
3
78
%,
14
37
5
4
80
%,
126
04
5
84%
,3
662
1
6
85
%,
729
7
7
86
%,
72
30
8
87
%,
720
4
9
93%
,50679
10-19
96%
,1
89
72
20-29
97%
,8
528
30-39
98%
,5
076
40-49
98%
,3
368
50-59
98
%,
26
59
60-69
98
%,
15
31
70-79
99
%,
122
7
80-89
99%
,9
16
90-99
99
%,
37
13
100-199
99
%,
11
05
200-299
99
%,
39
7
300-399
99
%,
22
5
400-499
99
%,
154
500-599
99
%,
130
600-699
99
%,
66
700-799
99
%,
65
800-899
99%
,2
6
900-999
99
%,
12
9
1000-1999
99
%,
33
2000-2999
99
%,
40
3000-3999
99
%,
1
4000-4999
10
0%
,1
5000-5999
(b)
Figure 32. (a) Number of interference graph nodes per method body; (b) number ofinterference graph edges per method body.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
624 C. COLLBERG, G. MYLES AND M. STEPP
Table XIII. Instruction frequencies.
Opcode Count % Opcode Count %
aload 0 2 672 134 10.0 aaload 108 016 0.4invokevirtual 2 360 924 8.9 anewarray 106 780 0.4dup 1 521 855 5.7 putstatic 105 900 0.4getfield 1 447 792 5.4 astore 1 99 436 0.4invokespecial 1 003 439 3.8 isub 93 852 0.4ldc 936 890 3.5 if icmplt 89 901 0.3aload 1 909 356 3.4 if icmpne 89 452 0.3aload 876 138 3.3 iconst 3 85 851 0.3bipush 865 346 3.3 arraylength 81 083 0.3new 665 727 2.5 iaload 70 687 0.3iconst 0 634 481 2.4 iconst m1 67 600 0.3iload 601 808 2.3 ldc2 w 66 717 0.3putfield 552 241 2.1 iconst 4 64 544 0.2goto 507 322 1.9 istore 3 58 529 0.2iconst 1 495 114 1.9 istore 2 57 750 0.2aload 2 494 004 1.9 iand 55 782 0.2invokestatic 457 014 1.7 instanceof 50 049 0.2getstatic 438 851 1.6 if acmpne 48 379 0.2return 433 081 1.6 if icmpeq 47 866 0.2astore 395 436 1.5 newarray 44 390 0.2sipush 383 115 1.4 iconst 5 39 857 0.1areturn 351 978 1.3 baload 39 482 0.1aastore 332 112 1.2 castore 38 065 0.1aload 3 314 398 1.2 istore 1 37 068 0.1invokeinterface 300 563 1.1 if icmpge 35 921 0.1ifeq 286 898 1.1 sastore 30 690 0.1iastore 285 979 1.1 ixor 29 811 0.1ldc w 281 190 1.1 if icmple 28 212 0.1pop 270 894 1.0 imul 27 174 0.1istore 264 341 1.0 iload 0 26 837 0.1ireturn 259 627 1.0 dload 26 640 0.1iload 2 200 600 0.8 lastore 23 738 0.1iload 1 197 241 0.7 ifle 23 025 0.1checkcast 193 243 0.7 monitorexit 22 023 0.1aconst null 178 499 0.7 jsr 20 074 0.1iload 3 172 820 0.6 lconst 0 19 617 0.1bastore 171 902 0.6 nop 18 136 0.1iadd 171 637 0.6 lload 17 515 0.1ifne 167 878 0.6 ifge 17 494 0.1iconst 2 163 348 0.6 i2b 17 432 0.1athrow 151 515 0.6 ishl 17 233 0.1astore 2 144 741 0.5 fload 16 966 0.1iinc 132 890 0.5 ior 15 500 0.1astore 3 121 477 0.5 ishr 15 363 0.1ifnull 121 318 0.5 lcmp 15 033 0.1ifnonnull 110 290 0.4 dstore 14 261 0.1
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 625
Table XIII. Continued.
Opcode Count % Opcode Count % Opcode Count %
dup x1 14 202 0.1 fconst 0 4316 0.0 fneg 496 0.0dmul 14 077 0.1 f2d 4086 0.0 pop2 378 0.0idiv 13 833 0.1 laload 3781 0.0 fstore 1 374 0.0if icmpgt 12 477 0.0 dload 3 3538 0.0 d2l 273 0.0iflt 11 944 0.0 lconst 1 3515 0.0 dup2 x1 263 0.0caload 11 871 0.0 dload 2 3286 0.0 lneg 208 0.0if acmpeq 11 595 0.0 fload 1 3261 0.0 l2f 187 0.0dastore 11 325 0.0 lsub 3212 0.0 dstore 0 168 0.0astore 0 10 400 0.0 fload 2 3130 0.0 dup2 x2 164 0.0tableswitch 10 197 0.0 dcmpg 3040 0.0 lstore 0 159 0.0land 10 047 0.0 dup x2 2984 0.0 drem 55 0.0monitorenter 9961 0.0 fdiv 2930 0.0 f2l 42 0.0daload 9782 0.0 saload 2867 0.0 fstore 0 20 0.0fastore 9708 0.0 ineg 2577 0.0 frem 12 0.0ret 9670 0.0 multianewarray 2500 0.0 jsr w 0 0.0dconst 0 9295 0.0 d2f 2402 0.0 goto w 0 0.0i2l 8832 0.0 freturn 2360 0.0fstore 8765 0.0 fload 3 2357 0.0lookupswitch 8738 0.0 lshl 2195 0.0lload 1 8723 0.0 fconst 1 2122 0.0faload 8634 0.0 dload 0 2093 0.0iushr 8468 0.0 lload 0 1982 0.0dadd 8407 0.0 fcmpl 1917 0.0i2d 8403 0.0 istore 0 1787 0.0ifgt 7976 0.0 lstore 3 1727 0.0fmul 7674 0.0 lmul 1664 0.0lstore 7512 0.0 lor 1520 0.0dup2 7332 0.0 lstore 2 1475 0.0lload 2 7125 0.0 f2i 1391 0.0ddiv 6644 0.0 lshr 1386 0.0lload 3 6514 0.0 lstore 1 1352 0.0dsub 5882 0.0 fcmpg 1243 0.0i2c 5839 0.0 dneg 1230 0.0l2i 5680 0.0 dstore 3 1215 0.0dload 1 5548 0.0 ldiv 1194 0.0fadd 5515 0.0 lxor 1146 0.0irem 5508 0.0 dstore 2 1085 0.0dreturn 5159 0.0 lushr 993 0.0dconst 1 5146 0.0 dstore 1 908 0.0dcmpl 5065 0.0 l2d 878 0.0i2f 5003 0.0 swap 849 0.0lreturn 4935 0.0 fconst 2 839 0.0ladd 4621 0.0 fstore 3 695 0.0fsub 4463 0.0 fstore 2 650 0.0i2s 4338 0.0 lrem 539 0.0d2i 4331 0.0 fload 0 510 0.0
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
626 C. COLLBERG, G. MYLES AND M. STEPP
Table XIV. Most common 2-grams.
Opcode Count %
aload 0,getfield 1 219 837 4.7new,dup 664 718 2.6ldc,invokevirtual 353 412 1.4invokevirtual,invokevirtual 332 487 1.3dup,bipush 330 887 1.3putfield,aload 0 311 038 1.2iastore,dup 250 744 1.0invokevirtual,aload 0 235 924 0.9dup,sipush 226 520 0.9aload 1,invokevirtual 223 958 0.9aload,invokevirtual 222 692 0.9getfield,invokevirtual 219 107 0.8aload 0,aload 1 214 369 0.8aastore,dup 208 247 0.8dup,invokespecial 202 840 0.8aload 0,invokevirtual 200 872 0.8invokevirtual,pop 193 105 0.7aload 0,invokespecial 159 742 0.6astore,aload 146 309 0.6bastore,dup 141 779 0.5ldc,aastore 133 994 0.5getfield,aload 0 129 300 0.5invokespecial,aload 0 122 168 0.5ldc,invokespecial 120 935 0.5dup,ldc 116 043 0.4invokespecial,athrow 115 994 0.4aload 2,invokevirtual 115 394 0.4goto,aload 0 115 340 0.4putfield,return 113 044 0.4dup,iconst 0 109 473 0.4invokevirtual,ldc 109 060 0.4invokevirtual,return 106 765 0.4invokevirtual,ifeq 103 093 0.4bipush,bastore 102 969 0.4invokevirtual,astore 100 355 0.4ifeq,aload 0 99 667 0.4bipush,bipush 98 715 0.4ldc w,iastore 98 376 0.4iconst 0,ireturn 98 199 0.4invokevirtual,aload 93 325 0.4aload 0,new 90 040 0.3anewarray,dup 81 992 0.3dup,aload 0 80 579 0.3aload 0,aload 0 80 329 0.3aload,aload 78 718 0.3
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 627
Table XV. Most common 3-grams.
Opcode Count %
new,dup,invokespecial 202 836 0.8aload 0,getfield,invokevirtual 194 765 0.8iastore,dup,bipush 132 759 0.5invokevirtual,aload 0,getfield 125 019 0.5new,dup,ldc 115 036 0.5aload 0,getfield,aload 0 111 950 0.4getfield,aload 0,getfield 111 002 0.4iastore,dup,sipush 102 667 0.4bipush,bastore,dup 100 197 0.4invokevirtual,ldc,invokevirtual 98 964 0.4ldc w,iastore,dup 97 826 0.4dup,ldc,invokespecial 91 303 0.4aload 0,new,dup 90 029 0.4dup,bipush,bipush 83 402 0.3ldc,aastore,dup 82 970 0.3anewarray,dup,iconst 0 81 984 0.3aastore,dup,bipush 80 740 0.3new,dup,aload 0 80 524 0.3invokevirtual,invokevirtual,invokevirtual 80 161 0.3invokespecial,ldc,invokevirtual 69 626 0.3aload 0,getfield,aload 1 68 922 0.3bastore,dup,sipush 67 634 0.3aload 0,invokespecial,aload 0 66 723 0.3dup,sipush,bipush 64 736 0.3aload 0,aload 1,putfield 60 661 0.2bastore,dup,bipush 60 580 0.2goto,aload 0,getfield 60 205 0.2aload 0,aload 0,getfield 58 350 0.2dup,invokespecial,ldc 57 764 0.2dup,sipush,ldc w 57 139 0.2dup,bipush,ldc w 56 309 0.2aastore,dup,iconst 1 56 004 0.2putfield,aload 0,getfield 55 324 0.2aastore,aastore,dup 55 149 0.2aload 0,getfield,areturn 55 073 0.2ldc,invokevirtual,invokevirtual 53 465 0.2new,dup,aload 1 52 185 0.2ldc,invokevirtual,aload 0 51 342 0.2invokespecial,putfield,aload 0 50 827 0.2sipush,bipush,bastore 50 642 0.2dup,bipush,ldc 50 439 0.2dup,iconst 0,ldc 49 992 0.2aload 0,getfield,getfield 49 974 0.2iconst 0,ldc,aastore 49 056 0.2sipush,ldc w,iastore 48 252 0.2
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
628 C. COLLBERG, G. MYLES AND M. STEPP
Table XVI. Most common 4-grams.
Opcode Count %
aload 0,getfield,aload 0,getfield 95 199 0.4new,dup,ldc,invokespecial 91 302 0.4new,dup,invokespecial,ldc 57 764 0.2dup,invokespecial,ldc,invokevirtual 57 239 0.2bipush,bastore,dup,sipush 50 663 0.2dup,sipush,bipush,bastore 50 642 0.2bastore,dup,sipush,bipush 50 642 0.2sipush,bipush,bastore,dup 50 392 0.2anewarray,dup,iconst 0,ldc 48 862 0.2dup,iconst 0,ldc,aastore 48 697 0.2iastore,dup,sipush,ldc w 48 252 0.2dup,sipush,ldc w,iastore 48 252 0.2ldc w,iastore,dup,sipush 48 198 0.2sipush,ldc w,iastore,dup 47 875 0.2iastore,dup,bipush,ldc w 47 528 0.2dup,bipush,ldc w,iastore 47 528 0.2ldc w,iastore,dup,bipush 47 520 0.2bipush,ldc w,iastore,dup 47 384 0.2dup,bipush,bipush,bastore 44 682 0.2bastore,dup,bipush,bipush 44 680 0.2bipush,bastore,dup,bipush 44 674 0.2bipush,bipush,bastore,dup 44 209 0.2aload 0,new,dup,invokespecial 43 141 0.2new,dup,new,dup 42 678 0.2invokevirtual,ldc,invokevirtual,invokevirtual 42 594 0.2ldc,aastore,aastore,dup 41 430 0.2aastore,aastore,dup,bipush 40 443 0.2dup,ldc,invokespecial,athrow 40 441 0.2new,dup,aload 0,getfield 36 325 0.1new,dup,invokespecial,putfield 34 800 0.1ldc,aastore,dup,iconst 1 34 705 0.1iconst 0,ldc,aastore,dup 34 705 0.1aastore,dup,iconst 1,ldc 34 585 0.1putfield,aload 0,new,dup 34 499 0.1dup,iconst 1,ldc,aastore 34 191 0.1invokevirtual,aload 0,getfield,invokevirtual 34 185 0.1aload 0,iconst 0,putfield,aload 0 33 108 0.1aload 0,aload 1,putfield,return 32 147 0.1aload 0,aconst null,putfield,aload 0 31 719 0.1aload 0,getfield,aload 1,invokevirtual 31 472 0.1iconst 2,anewarray,dup,iconst 0 30 710 0.1ldc,invokevirtual,aload 0,getfield 30 470 0.1putfield,aload 0,iconst 0,putfield 27 739 0.1iastore,dup,bipush,ldc 26 735 0.1dup,bipush,ldc,iastore 26 735 0.1
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 629
0
200000
400000
600000 MIN: 0MAX: 105AVG: 0.8
MODE: 0MEDIAN: 0STD DEV: 1
SAMPLES: 19285712
61
%,
11
87
56
83
0
83
%,
42
87
70
1
1
91
%,
15
74
03
8
2
94
%,
54
38
08
3
96
%,
26
59
24
4
96
%,
15
92
72
5
97
%,
90
657
6
97
%,
80
228
7
98
%,
10
54
87
8
99
%,
13
21
41
9
99
%,
16
42
25
10-19
99
%,
3299
20-299
9%
,1617
30-399
9%
,1087
40-49
99
%,
342
50-59
99
%,
131
60-69
99%
,36
70-79
99%
,20
80-89
99%
,10
90-99
100%
,6
100-199
(a)
0
100000
200000
300000
400000
500000 MIN: 1MAX: 1545AVG: 3.6
MODE: 1MEDIAN: 0STD DEV: 10
SAMPLES: 19285712
61%
,11875683
1
76%
,2891306
2
84%
,1525188
3
89%
,892267
4
91%
,482602
5
93%
,297245
6
94%
,234464
7
95%
,147304
8
95%
,122818
9
97%
,353167
10-19
97%
,44736
20-29
97%
,19117
30-39
97%
,11737
40-49
98%
,21951
50-59
98%
,134921
60-69
99%
,187613
70-79
99%
,39684
80-89
99%
,1009
90-99
99%
,2014
100-199
99%
,623
200-299
99%
,159
300-399
99%
,26
400-499
99%
,12
500-599
99%
,8
600-699
99%
,23
700-799
99%
,5
800-899
99%
,5
900-999
100%
,25
1000-1999
(b)
Figure 33. (a) Height and (b) size of expression trees.
for example, software watermarking algorithms such as that by Cousot and Cousot [11], which hides a
watermark in unusual constants. Figure 34 tells us that in real programs most constants are small (93%
are less than 1000) or very close to powers of two, and hence hiding a mark in unusual constants is
likely to be unstealthy.
7.5. Method calls
Table XXII reports the most frequently called Java library methods. To collect these data, we looked at
every INVOKE instruction to see what method it named. No attempt at method resolution was made.
Figure 35 measures the size of receiver sets of method calls. That is, for a virtual method invocation
o.M() we count the number of methods M() that might potentially be called. This depends on the
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
630 C. COLLBERG, G. MYLES AND M. STEPP
Table XVII. Most common subexpressions.
Expression Count % Expression Count %
L 6 574 522 34.9 L.F.F.F 5087 0.0M() 4 119 506 21.9 (L.F+L) 4607 0.0i 2 967 193 15.7 (L.F&i) 4572 0.0L.F 1 366 327 7.3 (M()+i) 4511 0.0" 1 039 672 5.5 (L.F+L.F) 4326 0.0N 665 727 3.5 (L&l) 4149 0.0S 438 851 2.3 ((L+M())+L.F[]) 4128 0.0A 153 670 0.8 (M()+L) 3997 0.0L[] 121 966 0.6 (L.length-i) 3994 0.0((Class)M()) 115 363 0.6 ((L>>i)&i) 3917 0.0null 95 366 0.5 (i*L) 3792 0.0L.F[] 83 259 0.4 ((L&l)<>l) 3734 0.0l 67 088 0.4 (L/i) 3689 0.0L.F.F 54 078 0.3 (L.F>>i) 3654 0.0L.length 51 938 0.3 (((L+M())+L.F[])+i) 3392 0.0((Class)L) 49 937 0.3 L.F[].F 3367 0.0(L+i) 41 182 0.2 S[][] 3351 0.0(L instanceof Class) 39 081 0.2 (L.F instanceof Class) 3342 0.0d 37 202 0.2 ((L>>>i)&i) 3162 0.0S[] 29 776 0.2 (L+L.F) 3051 0.0(L+L) 24 534 0.1 ((Class)L[]) 2995 0.0(L.F+i) 24 296 0.1 (L.F-L) 2977 0.0L.F.length 23 898 0.1 (S[]&i) 2898 0.0(L-i) 19 852 0.1 (LˆL) 2818 0.0f 17 746 0.1 (L.F[]&i) 2773 0.0((Class)S) 14 614 0.1 ((long)L) 2748 0.0(L-L) 13 495 0.1 ((byte)L) 2699 0.0(L&i) 13 436 0.1 (L.F*L.F) 2690 0.0(L.F-i) 10 759 0.1 S.length 2626 0.0(L>>i) 9050 0.0 (l&L) 2618 0.0(L[]&i) 8285 0.0 ((l&L)<>l) 2612 0.0((Class)L.F) 7715 0.0 ((double)L.F) 2509 0.0M().F 7518 0.0 ((L[]&i)<<i) 2473 0.0(L+M()) 7451 0.0 L.F.F[] 2437 0.0(M()-i) 7181 0.0 (L-L.F) 2423 0.0(L>>>i) 7181 0.0 -(L) 2423 0.0(M() instanceof Class) 6305 0.0 (L<>l) 2322 0.0(L<<i) 6013 0.0 (L&L) 2285 0.0(L*i) 5818 0.0 ((L.F>>i)&i) 2275 0.0L[][] 5757 0.0 S.F 2214 0.0(L.F-L.F) 5509 0.0 ((char)L) 2201 0.0((double)L) 5300 0.0 (S+i) 2132 0.0(L*L) 5234 0.0 ((float)L) 2129 0.0L.F[][] 5230 0.0 ((int)M()) 2081 0.0
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 631
Table XVIII. Abbreviations used in Table XVII.
null ≡ ACONST NULL ((τ)α) ≡ typecast, τ is a primitive-(α) ≡ negation type or Class(α+α) ≡ addition A ≡ create new array(α-α) ≡ subtraction S ≡ static field(α*α) ≡ mult F ≡ non-static field(α/α) ≡ div M() ≡ method call(α%α) ≡ mod/rem (α instanceof κ) ≡ instanceof(α&α) ≡ and " ≡ string constant(α|α) ≡ or f ≡ float constant(αˆα) ≡ xor d ≡ double constant(α<<α) ≡ left shift l ≡ long constant(α>>α) ≡ signed right shift N ≡ NEW(α>>>α) ≡ IUSHR or LUSHR (α<α) ≡ DCMPL or FCMPLα[] ≡ array element (α>α) ≡ DCMPG or FCMPGα.length ≡ ARRAYLENGTH (α<>α) ≡ LCMPi ≡ int constant L ≡ load local variable
static type of o, and the number of methods in type(o)’s subclasses that override o’s M(). A static
class hierarchy analysis [12] is used to compute the receiver set.
The size of the receiver set has implications for, among other things, code optimization. A virtual
method call that has only one member in its receiver set can be replaced with a direct call. Furthermore,
if, for example, o.M()’s receiver set is {Class1.M(), Class2.M()}, then to expand o.M()inline, the code if o instanceof Class1 then Class1.M() else Class2.M() has
to be generated. The larger the receiver set, the more type tests will have to be inserted.
To compute receiver sets for an INVOKEVIRTUAL instruction, we first resolve the method
reference. We then gather all of the subclasses of the resolved method’s parent class (including itself)
and for each one look to see whether it contains a non-abstract method with the same name and
signature as the resolved method. If so, we check to see whether the resolved method is accessible
from the given subclass. If this is true, then the INVOKEVIRTUAL instruction could possibly execute
the subclass’ method, and it is added to the receiver set for the INVOKEVIRTUAL instruction.
For an INVOKEINTERFACE instruction, we perform the same test but we look instead at
all implementors of the resolved method’s parent interface. This set will contain all classes
that directly implement the interface, as well as subclasses of those classes, and classes that
implement any subinterfaces of the interface (i.e. anything that could be cast to the interface type).
The INVOKESPECIAL and INVOKESTATIC instructions do not use dynamic method invocation;
the method they will call can always be determined statically. Thus, they all have receiver sets of
size 1.
Since we count only method bodies in the receiver sets, it is possible to have receiver sets of size 0.
This can occur if an abstract class has no subclasses to implement its abstract methods, yet code is
written to call its abstract methods with future subclasses in mind. Similarly, an INVOKEINTERFACEcall may have no receivers if no classes implement the given interface.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
632 C. COLLBERG, G. MYLES AND M. STEPP
Table XIX. Common integer constants.
Most common int constants Most common long constants
Value Count % Value Count %
0 634 484 20.5 0 19 617 29.21 611 382 19.7 1 3515 5.22 165 656 5.3 −1 2320 3.53 86 253 2.8 1000 1114 1.7−1 78 187 2.5 287948901175001088 740 1.14 65 209 2.1 2 722 1.18 45 619 1.5 255 669 1.05 40 047 1.3 3 387 0.610 31 762 1.0 100 347 0.5255 31 249 1.0 5 344 0.56 29 798 1.0 8388608 343 0.57 28 356 0.9 4294967295 331 0.59 25 497 0.8 10 323 0.516 24 931 0.8 7 304 0.532 19 401 0.6 71776119061217280 270 0.412 17 889 0.6 4 269 0.413 17 228 0.6 60000 217 0.311 15 763 0.5 9 199 0.315 15 008 0.5 541165879422 196 0.314 13 733 0.4 9223372036854775807 190 0.324 13 607 0.4 64 182 0.320 10 844 0.3 9007199254740992 170 0.348 9759 0.3 8 167 0.217 9513 0.3 -9223372036854775808 160 0.263 8963 0.3 500 159 0.246 8540 0.3 60 156 0.247 8214 0.3 36028797018963968 148 0.218 8115 0.3 2147483647 144 0.234 8029 0.3 67108864 139 0.231 7681 0.2 3600000 134 0.264 7602 0.2 17179869184 132 0.240 7187 0.2 144115188075855872 132 0.221 7044 0.2 140737488355328 130 0.2100 6984 0.2 1024 129 0.245 6970 0.2 10000 125 0.223 6860 0.2 137438953504 123 0.219 6855 0.2 43980465111040 122 0.241 6631 0.2 1099511627776 122 0.230 6621 0.2 562949953421312 118 0.258 6551 0.2 17592186044416 118 0.2128 6547 0.2 33554432 117 0.222 6510 0.2 268435456 117 0.225 6441 0.2 16384 109 0.2
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 633
Table XX. Common real constants.
Most common float constants Most common double constants
Value Count % Value Count %
0.0 4316 24.3 0.0 9295 25.01.0 2122 12.0 1.0 5146 13.82.0 839 4.7 2.0 1920 5.20.5 573 3.2 0.5 1296 3.5255.0 319 1.8 100.0 710 1.9−1.0 311 1.8 10.0 689 1.94.0 177 1.0 5.0 585 1.6100.0 164 0.9 -Infinity 467 1.310.0 151 0.9 −1.0 463 1.20.75 146 0.8 1000.0 454 1.264.0 124 0.7 3.0 409 1.13.0 124 0.7 3.141592653589793 (π) 364 1.01000.0 114 0.6 NaN 333 0.920.0 109 0.6 4.0 311 0.890.0 79 0.4 0.25 285 0.83.1415927 (π) 73 0.4 Infinity 206 0.6NaN 68 0.4 8.0 198 0.557.29578 (180/π ) 68 0.4 1.797693· · · 7E308 (MAX) 182 0.550.0 68 0.4 180.0 162 0.46.2831855 (2π) 64 0.4 1.5 156 0.46.0 64 0.4 0.1 152 0.43.4028235E38 (MAX) 62 0.3 360.0 145 0.41.0E-4 61 0.3 20.0 120 0.3180.0 60 0.3 6.283185307179586 (2π) 118 0.35.0 58 0.3 −2.0 112 0.30.85 58 0.3 0.01 107 0.30.1 58 0.3 255.0 105 0.30.01 52 0.3 0.2 104 0.3−10.0 51 0.3 6.0 103 0.30.0010 47 0.3 0.6 92 0.28.0 45 0.3 7.0 91 0.21.5 45 0.3 9.0 88 0.20.8 45 0.3 1.25 83 0.20.3 45 0.3 16.0 77 0.20.25 45 0.3 60.0 75 0.2-Infinity 44 0.2 31.0 72 0.2Infinity 44 0.2 26.0 71 0.2100000.0 42 0.2 0.75 71 0.21.5707964 (π/2) 40 0.2 12.0 69 0.2
0.70710677 (1/√
2) 40 0.2 0.05 67 0.2−100.0 38 0.2 0.3 66 0.2200.0 37 0.2 15.0 65 0.265536.0 36 0.2 645.0 64 0.2
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
634 C. COLLBERG, G. MYLES AND M. STEPP
Table XXI. Most common string constants.
Value Count %
empty string 36 456 3.5" " 9003 0.9newline 5281 0.5")" 4860 0.5"." 4718 0.5"S" 4540 0.4"’" 4201 0.4"Q" 4139 0.4":" 4083 0.4"," 3885 0.4"R" 3796 0.4"P" 3663 0.4", " 3562 0.3"/" 3481 0.3""" 3113 0.3"0" 3024 0.3"name" 2725 0.3"(" 2561 0.2"false" 2536 0.2"true" 2461 0.2"]" 2115 0.2": " 2093 0.2"Center" 1931 0.2"-" 1699 0.2"BC" 1658 0.2"PvQ" 1649 0.2">" 1634 0.2"id" 1450 0.1"P->Q" 1373 0.1"P&Q" 1370 0.1"java.lang.String" 1314 0.1"line.separator" 1313 0.1";" 1307 0.1"W" 1237 0.1"=" 1210 0.1"shortDescription" 1207 0.1tab 1159 0.1"}" 1151 0.1"RvS" 1110 0.1"null" 1107 0.1"*" 1084 0.1"A" 1074 0.1"[" 1046 0.1"class" 1012 0.1"˜P" 996 0.1
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 635
0
20000
40000
60000
MIN: -9.22337e+18MAX: 9.22337e+18AVG: 467327912110143.2
MODE: 0MEDIAN: 2STD DEV: 4.20786e+17
SAMPLES: 3167173
6%
,2
02
48
0
0
61
%,
17
57
91
4
[0- 10)
80
%,
60
34
02
[10- 100)
93
%,
40
09
73
[100- 1000)
95
%,
61
84
4
[1000- 10000)
96
%,
31
37
2
[10000- 100000)
96%
,6412
[100000- 1e6)
96%
,6646
[1e6- 1e7)
97
%,
10825
[1e7- 1e8)
98
%,
30
47
6
[1e8- 1e9)
99
%,
37
64
1
[1e9- 1e10)
99%
,468
[1e10- 1e11)
99%
,907
[1e11- 1e12)
99%
,570
[1e12- 1e13)
99%
,682
[1e13- 1e14)
99%
,734
[1e14- 1e15)
99%
,975
[1e15- 1e16)
99%
,1078
[1e16- 1e17)
99%
,2720
[1e17- 1e18)
10
0%
,9054
[1e18- 1e19)
(a)
Value Count %
0 654101 20.71 695404 22.02 169760 5.42n,n > 1 205877 6.52n −1,n > 1 198280 6.32n
+1,n > 1 93544 3.0other 1150207 36.3
(b)
Figure 34. Constant values: (a) distribution of integers (int and long); (b) integers(int and long) close to powers of two.
Figure 35(a) shows that 88% of all virtual method calls have a receiver set with size at most 2,
with the average size being 4.5. It is interesting to note the large number of methods with a receiver
size between 20 and 29. As can be expected, the average receiver set size is significantly larger for an
interface method call. Figure 35(b) shows an average set size of 16.5.
7.6. Switches
Figure 36(a) measures the number of case labels for each tableswitch and lookupswitchinstruction. We had to treat the tableswitch instruction specially, since it uses a contiguous range
of label values. Not all of the labels in the tableswitch instruction necessarily appeared in the
source code for the program. As a result, some of the branch targets for the cases will be the same as
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
636 C. COLLBERG, G. MYLES AND M. STEPP
Table XXII. Most common calls to methods in the Java library.
Method Count %
java.lang.StringBuffer.append(String)StringBuffer 340 044 15.8StringBuffer.toString()String 143 985 6.7StringBuffer.<init>()void 93 837 4.3Object.<init>()void 52 597 2.4StringBuffer.<init>(String)void 48 408 2.2String.equals(Object)boolean 46 645 2.2java.util.Hashtable.put(Object,Object)Object 42 629 2.0java.io.PrintStream.println(String)void 42 594 2.0StringBuffer.append(int)StringBuffer 31 702 1.5StringBuffer.append(Object)StringBuffer 27 284 1.3String.length()int 25 505 1.2String.valueOf(Object)String 20 146 0.9java.lang.IllegalArgumentException.<init>(String)void 15 737 0.7StringBuffer.append(char)StringBuffer 15 116 0.7String.substring(int,int)String 14 441 0.7java.util.Vector.size()int 13 817 0.6java.util.Vector.addElement(Object)void 12 705 0.6java.util.Vector.elementAt(int)Object 12 087 0.6java.lang.System.arraycopy(Object,int,Object,int,int)void 11 969 0.6java.util.Iterator.hasNext()boolean 11 891 0.6String.charAt(int)char 11 831 0.5java.util.Iterator.next()Object 11 800 0.5java.lang.Integer.<init>(int)void 11 658 0.5java.lang.Throwable.getMessage()String 11 216 0.5java.util.Vector.<init>()void 10 434 0.5java.util.List.add(Object)boolean 10 166 0.5Object.getClass()java.lang.Class 9641 0.4java.util.Hashtable.get(Object)Object 9584 0.4String.equalsIgnoreCase(String)boolean 9391 0.4java.util.List.size()int 9226 0.4java.util.Map.put(Object,Object)Object 8830 0.4java.util.List.get(int)Object 8797 0.4java.lang.Class.forName(String)java.lang.Class 8641 0.4java.util.Map.get(Object)Object 8313 0.4java.awt.Container.add(java.awt.Component)java.awt.Component 8270 0.4String.substring(int)String 7862 0.4java.io.PrintWriter.println(String)void 7767 0.4java.util.Enumeration.nextElement()Object 7539 0.3java.lang.Class.getName()String 7288 0.3String.startsWith(String)boolean 7186 0.3String.indexOf(String)int 6960 0.3java.util.ArrayList.<init>()void 6705 0.3java.lang.Integer.parseInt(String)int 6667 0.3java.util.Enumeration.hasMoreElements()boolean 6526 0.3java.lang.NullPointerException.<init>(String)void 6403 0.3
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 637
0
10000
20000
30000
40000
50000MIN: 0
MAX: 917AVG: 4.5
MODE: 1MEDIAN: 1STD DEV: 39
SAMPLES: 2360924FAILED: 413186
0%
,684
0
82
%,
16
15
66
6
1
88
%,
10
54
71
2
90
%,
49
70
5
3
93
%,
51
19
7
4
94
%,
15
175
5
94
%,
9597
6
95
%,
9783
7
96
%,
15
214
8
96
%,
5769
9
97
%,
15
74
4
10-19
98
%,
30
19
4
20-29
99
%,
7370
30-39
99%
,4587
40-49
99%
,2874
50-5999%
,475
60-69
99%
,32
70-79
99%
,120
80-89
99%
,6
90-99
99%
,251
100-199
99%
,1577
300-399
99%
,2503
400-499
99%
,518
500-599
99%
,1451
700-799
99%
,1703
800-899
100%
,72
900-999
(a)
0
10000
20000
30000
MIN: 0MAX: 652AVG: 16.5
MODE: 1MEDIAN: 7STD DEV: 24
SAMPLES: 300563FAILED: 82205
1%
,2491
0
14%
,29488
1
24%
,20433
2
30%
,13673
3
34%
,10078
4
39%
,9186
5
46%
,15637
6
51%
,10785
7
54%
,6299
8
59%
,12267
9
75%
,34378
10-19
83%
,17108
20-29
84%
,1881
30-39
88%
,8768
40-49
91%
,8367
50-59
97%
,12029
60-69
99%
,4143
70-79
99%
,625
80-89
99%
,193
90-99
99%
,145
100-199
99%
,252
200-299
99%
,93
400-499
99%
,36
500-599
100%
,3
600-699
(b)
Figure 35. Number of receiver set sizes per (a) virtual method call (invokevirtual) and(b) interface method call (invokeinterface).
the default case target. Therefore, when computing the label set size and density of a tableswitchinstruction, we ignore all of the labels whose branch targets are the same as the default case’s target.
The figure shows that the average number of labels per switch is 12.8 and that 89% of the switches
contain fewer than 30 labels.
Figure 36(b) shows the density of switch labels, computed as
number of case arms
max label − min label + 1(1)
This measure is important for selecting the most appropriate implementation of switch
statements [13,14]. In the JVM, the tableswitch instruction is used when the density is high and
the lookupswitch is used when the density is low.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
638 C. COLLBERG, G. MYLES AND M. STEPP
0
1000
2000
3000 MIN: 0MAX: 500AVG: 8.7
MODE: 2MEDIAN: 4STD DEV: 18
SAMPLES: 18935TOTAL: 164325
0%
,1
86
0
14
%,
26
23
1
30
%,
30
21
2
44
%,
26
84
3
56
%,
22
02
4
63
%,
12
69
5
69
%,
12
21
6
73
%,
66
4
7
76
%,
57
8
8
78
%,
45
9
9
90
%,
22
16
10-19
95
%,
94
0
20-29
96
%,
19
7
30-39
97
%,
24
5
40-49
98
%,
92
50-59
98
%,
72
60-69
98
%,
34
70-799
8%
,30
80-89
99
%,
91
90-99
99
%,
84
100-199
99
%,
18
200-299
99
%,
1
300-399
99
%,
6
400-499
10
0%
,2
500-599
0
500
1000
1500 MIN: 0MAX: 1AVG: 0.7
MODE: 1MEDIAN: 1018STD DEV: 0
SAMPLES: 18935FAILED: 186
5%
,1061
[0,0.05)
11%
,1135
[0.05,0.1)
16%
,885
[0.1,0.15)
20%
,755
[0.15,0.2)
23%
,557
[0.2,0.25)
27%
,713
[0.25,0.3)
30%
,674
[0.3,0.35)
32%
,354
[0.35,0.4)
35%
,545
[0.4,0.45)
36%
,205
[0.45,0.5)
39%
,535
[0.5,0.55)
40%
,127
[0.55,0.6)
41%
,285
[0.6,0.65)
43%
,411
[0.65,0.7)
45%
,237
[0.7,0.75)
46%
,270
[0.75,0.8)
48%
,300
[0.8,0.85)
49%
,162
[0.85,0.9)
49%
,130
[0.9,0.95)
50%
,103
[0.95,1)
100%
,9305
[1,1.05)
(b)
(a)
Figure 36. Switching statements: (a) number of case arms in tableswitch and lookupswitch; (b) labeldensity of tableswitch and lookupswitch.
8. RELATED WORK
In a widely cited empirical study, Knuth conducted an analysis of 440 FORTRAN programs [1].
The study was conducted in an attempt to understand how FORTRAN was actually being used by
typical programmers. By understanding how the language was being used, a better compiler could be
designed. Each of the programs were subjected to static analysis in order to count common constructs
such as assignment statements, ifs, gotos, do loops, etc. In addition, dynamic analysis was performed on
25 programs which examined the frequency of the constructs during a single execution of the program.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 639
The final analysis studied the effects of various local and global optimizations on the inner loops of
17 programs.
Knuth’s study was the first attempt to understand how programmers actually wrote programs.
Since that initial study, many similar explorations have been conducted for a variety of languages.
Salvadori et al. [3] and Chevance and Heidet [2] both examined the profile of Cobol programs.
Salvadori et al. looked at the static profile of 84 Cobol programs within an industrial environment.
In addition to examining the frequency of specific constructs, they also studied the development
history by recording the number of runs per day and the time interval between the runs. Chevance
and Heidet studied the static nature of Cobol programs through the number of occurrences of source-
level constructs in more than 50 programs. The authors took their study a step further by computing
the frequency of the constructs as the program executed. In this study, for categories of data were
examined: constants, variables, expressions, and statements.
Other than Chevance and Heidet [2], most studies of programmer behavior have concentrated on
the static structure of programs. Of equal importance is to examine how programs change over time.
Collberg et al. [15] showed how to visualize the evolution of a program by taking snapshots of its
development from a CVS repository and presenting these data using a temporal graph-drawing system.
Cook and Lee [4] undertook a static analysis of 264 Pascal programs to gain an understanding
of how the language was being used. The analysis was conducted within 12 different contexts, e.g.
procedures, then-parts, else-parts, for-loops, etc. In addition, they compared their results with those of
other language studies. Cook [16] conducted a static analysis of the instructions used in the system
software on the Lilith computer. An analysis of APL programs was conducted by Saal and Weiss [5,6].
Antonioli and Pilz [17] conducted the first analysis of the Java class file. The goal of their study was
to answer three questions. (1) What is the size of a typical class file? (2) How is the size of the class file
distributed between its different parts? (3) How are the bytecode instructions used? To answer these
questions, they examined six programs with a total of 4016 unique classes. In contrast to the present
study, they examined the size in bytes of each of the five parts of a class file (i.e. header, constant,
class, field, and method). They also examined instruction frequencies to see what percentage of the
instruction set was actually being used. They found that on average only 25% of the instruction set was
used by any one program. Our analysis does not focus on the frequency of a particular instruction per
program but instead looks at the frequency over all programs. Overall, their study is different from ours
in that they were interested in answering a few very specific questions, where our analysis is focused
on obtaining a complete understanding of JVM programs.
Gustedt et al. [18] conducted a study of Java programs that measures the tree width of CFGs. The tree
width is effected by such constructs as goto usage, short-circuit evaluation, multiple exits, breakstatements, continue statements, and returns. The authors examined both Java API packages as well
as Java applications obtained through Internet searches.
O’Donoghue et al. [19] performed an analysis of Java bytecode bigrams. Their analysis was
performed on 12 benchmark applications. The only similarity between their data and ours is that
we both found aload 0, getfield to be the most frequently occurring bigram. We attribute the
differences to the small sample size used in their study.
One of the byproducts of our analysis is a large repository of publicly available data on Java
programs. Appel [20] maintains a collection of interference graphs which can be used in studying
graph-coloring algorithms. The availability of such repositories is highly useful in the study of compiler
implementation techniques.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
640 C. COLLBERG, G. MYLES AND M. STEPP
9. DISCUSSION AND SUMMARY
In this paper we have performed a static analysis of 1132 Java programs obtained from the Internet.
Through the use of SandMark, we were able to analyze the structure of the Java bytecode. Our analysis
ranged from simple counts, such as methods per class, instructions per method, and instructions per
basic block, to structural metrics such as the complexity of CFGs.
Our main goal in conducting the study was to use the data in our research on software protection,
however we believe these data are useful in a variety of settings. These data could be used in the
design of future programming languages and virtual machine instruction sets, as well as in the efficient
implementation of compilers.
It would be interesting to perform a similar study of Java source code. Even though Java bytecode
contains much of the same information as in the source from which it was compiled, some aspects
of the original code are lost. Examples include comments, source code layout, some control structures
(when translated to bytecode, for and while loops may be indistinguishable), some type information
(Booleans are compiled to JVM integers), etc.
Owing to our random sampling of code from the Internet, it is possible that our set of Java jar-
files is somewhat skewed. It would be interesting to further validate our results by comparing against
a different set of programs, such as standard benchmark programs (for example, SpecJVM [21]), or
programs collected from standard source code repositories (for example, sourceforge.net).
We would also welcome studies for other languages. It would be interesting to validate our results
by performing a similar study for MSIL, the bytecode generated from C# programs, since MSIL and
JVM (and C# and Java) share many common features. It would also be interesting to compare our
results with languages very different from Java, such as functional, logic, and procedural languages.
It might then be possible to derive a set of ‘linguistic universals’, programming behaviors that apply
across a range of languages. Such information would be invaluable in the design of future programming
languages.
Our experimental data and the SandMark tool that was used to collect it can be downloaded from
http://sandmark.cs.arizona.edu/download.html.
REFERENCES
1. Knuth DE. An empirical study of FORTRAN programs. Software—Practice and Experience 1971; 1:105–133.2. Chevance RJ, Heidet T. Static profile and dynamic behavior of COBOL programs. SIGPLAN Notices 1978; 13(4):44–57.3. Salvadori A, Gordon J, Capstick C. Static profile of COBOL programs. SIGPLAN Notices 1975; 10(8):20–33.4. Cook RP, Lee I. A contextual analysis of Pascal programs. Software—Practice and Experience 1982; 12:195–203.5. Saal HJ, Weiss Z. Some properties of APL programs. Proceedings of 7th International Conference on APL. ACM Press:
New York, 1975; 292–297.6. Saal HJ, Weiss Z. An empirical study of APL programs. International Journal of Computer Languages 1977; 2(3):47–59.7. Stata R, Abadi M. A type system for Java bytecode subroutines. Conference Record of POPL 98: The 25th ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages, San Diego, CA, 1998. ACM Press: New York, 1998;149–160.
8. Collberg CS, Tomborson C. Watermarking, tamper-proofing, and obfuscation—tools for software protection. IEEETransactions on Software Engineering 2002; 8(8):735–746.
9. Collberg C, Myles M, Huntwork A. SANDMARK—A tool for software protection research. IEEE Magazine of Security
and Privacy 2003; 1(4):40–49.10. Lindholm T, Yellin F. The Java Virtual Machine Specification (2nd edn). Addison-Wesley: Reading, MA, 1999.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe
AN EMPIRICAL STUDY OF JAVA BYTECODE PROGRAMS 641
11. Cousot P, Cousot R. An abstract interpretation-based framework for software watermarking. Proceedings of the ACM
Conference on Principles of Programming Languages. ACM Press: New York, 2004.12. Dean J, Grove D, Chambers C. Optimization of object-oriented programs using static class hierarchy analysis. Proceedings
of the 9th European Conference on Object-Oriented Programming. Springer: Berlin, 1995; 77–101.13. Bernstein R. Producing good code for the case statement. Software—Practice and Experience 1985; 15(10):1021–1024.14. Kannan S, Proebsting TA. Correction to ‘producing good code for the case statement’. Software—Practice and Experience
1994; 24(2):233.15. Collberg C, Kobourov S, Nagra J, Pitts J, Wampler K. A system for graph-based visualization of the evolution of software.
Proceedings of the ACM Symposium on Software Visualization, June 2003. ACM Press: New York, 2003.16. Cook RP. An empirical analysis of the Lilith instruction set. IEEE Transactions on Computers 1989; 38(1):156–158.17. Antonioli DN, Pilz M. Analysis of the Java class file format. Technical Report, University of Zurich, 1998. Available at:
ftp://ftp.ifi.unizh.ch/pub/techreports/TR-98/ifi-98.04.ps.gz.18. Gustedt J, Mæhle OA, Telle JA. The treewidth of Java programs. Proceedings of the 4th International Workshop on
Algorithm Engineering and Experiments (ALENEX) (Lecture Notes in Computer Science, vol. 2409). Springer: Berlin,2002.
19. O’Donoghue D, Leddy A, Power J, Waldron J. Bigram analysis of Java bytecode sequences. Proceedings of the 2ndWorkshop on Intermediate Representation Engineering for the Java Virtual Machine. National University of Ireland:Ireland, 2002; 187–192.
20. Appel A. Sample graph coloring problems. http://www.cs.princeton.edu/∼appel/graphdata/.21. SpecJVM98. http://www.specbench.org/osg/jvm98.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2007; 37:581–641DOI: 10.1002/spe