14

Click here to load reader

Intermediate Languages

Embed Size (px)

Citation preview

Page 1: Intermediate Languages

Intermediate LanguagesIntermediate languages are with us for quite a long time. Most of the compilers employ them in one-way or another. The way of using intermediate languages have also sometimes contributed to the success of the languages like Pascal, Java and recently C# (with .NET initiative in general). Even some tools like MS-Word makes use of an intermediate language (p-code). Intermediate languages have made a big difference in achieving portability of code. This article explores the concept of intermediate languages and their use in achieving complete portability and language interoperability.

In 1950s itself, this idea of having an intermediate language was experimented with the introduction of UNCOL [1]. Later Pascal became widely available because of p-code. Not until recently did Java employed the same technology with its promise of portability to become a big success. Now Microsoft's .NET initiative is also based on this approach, but the keyword is language interoperability. A closer look reveals that all are based on same idea: use of intermediate code, and that is the current trend in the language translation technology.

Approaches in Language Translation

Two major approaches towards language translation are compilation and interpretation. In compilation, since the native code is generated, different compilers are needed for different platforms. An interpreter is also particular to the platform, but its specifics are abstracted making only the behavior of the program transparent as the code is translated and executed 'on-the-fly'.

Neither of them is ‘the’ best approach - both boasts of significant advantages and also suffer from some serious disadvantages.

The advantages of compilation mostly overweigh its disadvantages and that is why it is preferred in most of the languages. The translation and optimization needs to be done only once, and native code ensures efficiency of execution. However as native code is generated, it is platform specific.

Interpretation has some very good advantages that compilation doesn’t have. In particular, they can achieve full portability. Also it can provide better interactive features and diagnostic (error) messages. But one problem that cripples interpretation is performance. Since

Page 2: Intermediate Languages

translation needs to be done every time (and for the same reason, optimization is less attractive) execution becomes generally very slow (10 to 100 times) compared to the equivalent compiled code.

You can think of an interpreter as the one that implements a virtual machine. So, it can be inferred that the 'machine language' of that virtual machine is the high-level language for which the interpreter is written!

Interpreters provide many advantages including greater flexibility, better diagnostics and interactive features like debugging. True power of interpretation is generally underestimated. To illustrate, you can think of printf library routine in C as a tiny interpreter for formatted output - depending on the requirement it parses the format string, gets the arguments, computes the format on the fly, prints it to the output along with facilities like padding, aligning, specific notations etc.

Interpretation allows the code to be generated or modify itself "on the fly" and execute it. For example, interpreted languages like Lisp or even partly interpreted languages like Java or C# provides such facilities (although not recommended to use such facilities for practical programming). Note that few language features like reflection are only possible with interpretation. This is the reason why compilers cannot be written for some languages and interpretation is the only option.

So, instead of restricting to pure compilation or interpretation, you can infer that a combination of the both is better to enjoy the advantages of these approaches.

It seems to be very simple to use both compilation and interpretation. For that itself, various languages use different approaches.

Few languages provides options for both interpretation and compilation and can be used depending on the requirement. For example, for development and debugging, Microsoft Visual Basic uses an interpreter and for deployment, it employs a compiler.

As already stated, writing compilers for the languages that are meant for interpretation is sometimes not possible. In such cases, it is possible to bundle the source code with the whole interpreter as an executable file and distribute it. Java Native Interface (JNI) follows a similar approach by embedding a Java interpreter in an executable code created by a C/C++ source code, and run the Java code through

Page 3: Intermediate Languages

it. This solution is feasible because the Java interpreter’s size is not very big.

A very interesting approach towards combining compilation and interpretation through hardware support was widely used in as early as 1960s in IBM chips. IBM released chips that incorporated the interpreter software burnt into chips (nothing but firmware) that supported a general instruction set. The main advantage being that the same instruction set (of the interpreter) shall be used across different processor architectures. Since, the compilers would generate the code just for that one instruction set, the effort is minimized. Later, the advent of VLSI design with RISC architecture obviated this microprogramming approach.

The very popular and widely used approach that combines the compilation and interpretation is making use of intermediate code directly. Since its use in Java this approach seems to have received widespread acceptance, even though its idea was used long back in Pascal and in many other languages. Languages following this approach are sometimes referred to as p-code languages since p-code of Pascal is one of the first intermediate codes to follow this approach successfully. Few of important p-code languages are Python, Perl, Java and now, C# (.NET framework). The idea is to use a compiler to generate the intermediate language code. The intermediate code can be moved on to other platforms or even across the network where an interpreter (also called as virtual machine) for that platform will take over to interpret the code.

The usage of intermediate code in compilers is not a new idea. There are various phases in a compiler and they can be grouped generally as the front-end and the back-end. The analysis and translation of the code (in general 'parsing') is done by the front end and the synthesis of the code is taken care by the back-end. The information in the form of intermediate languages is represented that is generally independent any particular architecture and thus serves as a link to transfer the information from front-end to the back-end. The advantage with this approach is that, the front-end can be same and only the back-end needs to be changed to support new architectures and hence the effort is limited only in rewriting the back-end of the compiler. Intermediate languages also help in general optimization of the code that are not particular to any platform and provides many related advantages. So most of the compilers employ intermediate code.

Page 4: Intermediate Languages

If intermediate languages have been in usage for a long time in compilers, why is this renewed interest?

The difference is that previously most of the compilers hid the use of intermediate languages as ‘implementation details’. But now, to make the best use of them, they are extended, exposed and used in conjunction with interpretation.

An advantage with the intermediate language approach is ‘binary code portability’, i.e., the next level of portability promised by languages like C/C++ that provides source code portability. As long as an interpreter is there to understand and execute the intermediate code, no recompilation is necessary to move the code to a new platform. It also provides the ability to verify code and thus type-safety can be ensured and as a result, ability to build security systems on that.

Intermediate languages can also provide the advantage of ‘language interoperability’. As more than one language can be converted to the same intermediate language, the code compiled through two different high-level languages can interact at intermediate code level agnostic of the different possible source languages. A related benefit is that the number of translators that needs to be written is also drastically reduced. This is a significant benefit as the cost of retargeting the translator is high and writing a new one is very tedious. As a result only a few translators need to be developed.

To illustrate, in traditional compilation approach, for N high-level languages and M target machines, N * M translators are required. With intermediate language approach, where the IL is capable of supporting N high-level languages and M target platforms, then the number of translators required effectively reduces to just N + M.

Apart from portability and interoperability, there are many features that make the use of intermediate languages attractive. One of the advantages that seems to be insignificant at the first look is that the file size tends to be small compared to the equivalent native code. That is one of the reasons why intermediate code approach was popular around 1980s when disk and memory space was precious. The small size of the code is advantageous even now: in the networked environment, the small size of applets (nothing but Java class files) makes their transmission across the network faster.

Page 5: Intermediate Languages

Approaches in Representing Intermediate Code

A challenge for an intermediate language is to satisfy seemingly two contradicting goals:

1) To represent the source code without significant loss of information

2) To keep only the necessary information so that the source code can be translated directly and efficiently to the target machine code later.

Depending on the level of abstraction of the intermediate language, we can classify them as high-level, medium-level or low-level ILs.

Usually the high-level ILs are represented as trees or DAGs (Directed Acyclic Graphs) and sometimes as stack based. The middle-level ILs typically use approaches like triples and quadruples. The low-level ILs generally use code closer to the assembly language of the target machine.

High-level ILs takes more effort to get the code converted to the native code, but are sufficiently high-level that it can represent rich features of various source languages directly. So it is possible to retarget new languages to high-level ILs. However, more effort is needed in converting the intermediate code to native code. On another extreme, low-level ILs make the translation to native code easier, but it is generally hard to retarget other high-level languages than that it is originally intended for.

Another design approach is to abstract the programming language (for example, ANDF) or the machine (for example, RTL) and an IL has to strike a balance between the two extremes.

There are many methods used to represent the information at intermediate level. Some widely used approaches are discussed here.

Three-address code

In most of the language compilers, three address codes are generated as intermediate code and later code-generators generate code for that particular target machine. They are mostly register based, and are fairly low-level, so target-code generation is straightforward. A similar

Page 6: Intermediate Languages

approach is Register Transfer Language (RTL) widely used in GNU C compilers. Following is an example for three address code:

R1 := R2 + R3 And the general syntax is. X := Y op Z

Object files

An interesting approach to represent information at intermediate level is object files. With each platform, the format of object code file and the information stored differs. This makes them platform dependent. An effort has been made to combine the information that is required in various formats. It resulted in creating a huge format that contains all such details, and is referred to as 'fat binaries' effectively used in NExtStep platforms. On the other hand, 'slim binaries' are also available that generates the portable object code 'on-the-fly'.

Tree representations

ANDF [2,3] stands for Architectural Neutral Distribution Format abstracts high-level language constructs. It aims at converting various language constructs in a programming language to an intermediate form that shall retain all the information available in the source. It takes care to represent different language semantics, platform dependencies etc. Languages including C, C++, Fortran, Ada has ANDF producers (high-level compilers). This is achieved through an intermediate representation of the language constructs (keeping its variants in various programming languages). For example, expressions in the source language are converted into EXPs, identifiers as TAGs, parameterization as TOKENs etc. ANDF follows a tree representation of constructs called CAPSULEs. It also has installers (low-level compilers) in wide variety of platforms including Dec-Alpha, Sun-Sparc, PowerPC and Intel and executed as native code.

Using another High Level Language!

Instead of creating a new intermediate language, this approach is towards using a high-level language as a intermediate language that is already available, highly portable, efficient and sufficiently low-level. As you would have guessed, C is an interesting high-level language with ability to write low-level code. Owing to this property, it is sometimes referred to as a portable assembler or 'middle-level' language. The beauty of C is that, in spite of its low-level abilities, the

Page 7: Intermediate Languages

code is highly portable. Due to this nature, C has effectively served as a form of a portable intermediate language for other high-level programming languages. In fact, in initial days of UNIX, translators generating C code as target code was written so that re-writing them for every platform were not required. Many languages follow this approach, including Mercury, Scheme, APL, Haskell etc. Also GNU Fortran and Pascal compilers (f2c and p2c) generate C code and native C compilers take over to compile and execute them.

Stack-based machines

This representation assumes the presence of a run-time stack and generates code for that. It uses the stack for evaluation of expressions by making use of stack itself to store the intermediate values. Thus the underlying concept is very simple and that is its main strength. Also, least assumptions are made about the hardware and support available in the target architecture. When a virtual machine (runtime interpreter) simulates a stack-based machine and provides necessary resources, this approach can effectively provide platform-independence. In other words, the virtual machine becomes a platform of its own that shall be simulated in possibly any hardware platform (and thus needs to be written for those platforms). One problem with this approach however is that the code optimization is difficult. Java virtual machine is a very good example for such stack-based representation.

Other major efforts include Oberon Module Interchange (OMI) [4], 'Juice'[5] that is based on OMI, Dynamic binding and VP code of TAO operating system and U-Code.

Implementations based on stack approach

Since stack based machines are one of the most successful ones and is gaining widespread acceptance in recent times, let us discuss few implementations based on that in detail here.

P-code:

P-system was popular in 1970-80s as an operating system that made use of p-code [6]. Compilers would emit p-code, an architectural neutral one that was executed by a virtual machine. UCSD Pascal was the most popular compiler that generated p-code and also the whole p-system was written in UCSD Pascal itself making it easier to port to new architectures with much ease.

Page 8: Intermediate Languages

Pascal implementations contained a compiler generating p-code and later an interpreter will take over to execute that p-code. The interpreter was very simple and so it was easy to write an interpreter for a new architecture. With that interpreter, a Pascal compiler in the form of p-code can be used to compile Pascal programs that can in-turn be run on the interpreter.

However, this steps need to be done only in initial steps. Later the compiler can be modified to generate native code, and once, the Pascal compiler itself is given as input for the Pascal compiler in p-code, the compiler becomes available in native code. This process is referred to as 'bootstrapping', that is generally used in porting compilers to new environments. But the point here is that this 'bootstrapping' process becomes easy and elegant with the p-code approach. This let to the popularity and widespread use of Pascal compilers.

Java bytecodes:

Java[7] was originally intended to be used in embedded systems and set-top boxes and thus a heterogeneous environment. Internet was becoming very popular at that time and there was no language that best suited that heterogeneous environment. Java suited for that purpose very well and thus became a sort of lingua franca of Internet.

The technology is as we already saw of intermediate languages. Java compiler converts the source into intermediate code - bytecode and pack with related information in class files. These class files (applets or applications) can be distributed across the network and executed in any platform provided that a Java Virtual Machine [8] is available there. The advantage of this technology is, as everybody knows now, 'write once, run anywhere'. Programmers need not worry about the target platform. They just concentrate on the problem and develop the code.

To illustrate, for each and every class or interface defined, a class file is generated. Every method is compiled into byte codes - the instructions for the hypothetical microprocessor. For example:

int a = b + 2;

this may be converted to the bytecodes as,

iload_1 // push the variable 1 (int type) to evaluation stack

Page 9: Intermediate Languages

iconst_2 // push the constant value 2 (int type) on the stack

iadd // pop two integer values from stack, add them, push the result back.

istore_2 // pop the int value from the stack and store it in variable 2 // (int type)

All the information that you give inside the class definition are converted and get stored to form an intermediate class file.

The intermediate class file format is a specification given by Sun. It is a complex format and holds the information like the methods declared along with the byte codes, the class variables used, initializing values (if any), super class info, signatures of variables and methods, etc. It has a constant pool - a per class runtime data structure, which is similar to the symbol table created by the compiler.

Virtual machine plays a major role in Java technology. All the Java virtual machines have the same behavior, as defined by SUN, but the implementation differs. It forms a layer over the physical machine in the memory.

Even though the Java technology provides the basis for achieving full portability, Java’s design aids to make it architectural neutral. To illustrate, lets compare it with a C design consideration with that of Java. C gives much importance to efficiency and for that, it leaves size of its data types (except char) to be implementation-dependent. But Java primitive types are of predetermined size, independent of the implementation. In general, Java improves upon C by removing the constructs having various behavioral types in C by having well defined behavior instead.

JVMs are available in very wide variety of platforms. However, Java class file format and bytecodes are closely tied with the Java language semantics. So, it becomes tough to retarget other language compilers to generate Java class files. Few of the languages including Eiffel, Ada and Python have compilers generating class files to be executed by JVMs. Since the class files and bytecodes are not versatile enough to accommodate wide range of languages, class files (and bytecodes) doesn’t fare as a feasible universal intermediate language.

.NET Architecture

.NET architecture addresses an important need - language interoperability, the concept that can change the way programming is done and is significantly different from what Java offers. If Java came

Page 10: Intermediate Languages

as a revolution providing platform independence, .NET has language interoperability to offer.

To illustrate, JVM is implemented for multiple platforms including Windows, Linux, Solaris, OS/2 etc. But it has only few languages targeting at JVM to generate class files. On the other hand, .NET supports multiple languages generating code (MSIL – Microsoft Intermediate Language) targeting Common Language Runtime (CLR). This list includes, C#, VB, C, C++, COBOL, etc. and nevertheless this list is an impressive one. CLR, the runtime interpreter of .NET platform (equivalent of JVM of Java) is currently implemented only for Windows platform and efforts are underway to implement it for Linux and other platforms. A Java program can be compiled on a PC and can be executed by a Mac or a Mainframe. Whereas with .NET, we can write a code in COBOL, extend it in VB and deploy it as a component. Thus the Java and .NET have fundamentally different design goals. However, you may have noticed that they partially satisfy the two different requirements for universal intermediate language: to support different source languages and target architectures.

In .NET, the unit of deployment is the PE (Portable Executable) file - a predefined binary standard (similar to class files of Java). It is made up of collection of modules, exported types and resources and is put together as an .exe file. It is very useful in versioning, deploying, sharing, etc. The modules in PE file are known as assemblies.

The IL is designed such that it can accommodate wide range of languages. Also at the heart of the .NET is the type system where the types are declared and the metadata of how to use the types is also stored. One important difference between MSIL and bytecodes is that MSIL is type-neutral. For example, iconst_0 is a bytecode that tells to push integer value 0 on the stack, meaning that the type information is kept in the bytecodes. But the similar IL code just tells to push four bytes, meaning that no type information is passed on.

With .NET you can write a class in C#, extend it in VB and instantiate it in managed C++. This gives you the ability to work in diversified languages depending on your requirement and still benefit from the unified platform to which the code for any .NET compliant language is compiled into.

SummaryThe benefits of using intermediate languages are well known, and the current trend is towards achieving fully portable code and language

Page 11: Intermediate Languages

interoperability. To understand the concept of portable code compilation, the understanding of the conventional translation technology and their advantages/disadvantages needs to be known. There are many issues involved in intermediate code design and understanding them shall enable us to appreciate the various efforts to generate portable code.

Way back in 1950s itself, the first effort towards achieving portability was taken through UNCOL initiative. Later, during 1970-80s, Pascal in the form of p-code picked up the idea. In 1995s Java came as a revolution to capture the imagination of the language world. Now it is .NET’s turn with the age-old concept of language interoperability. Thanks to such technological advances, achieving 100% portable code with language interoperability is no more a dream. It is clear that the concept of intermediate languages is here to stay for a long time and we are not far from having a universal intermediate language.

References:

[1] see ftp://riftp.osf.org/pub/andf/andf_coll_papers/UNCOLtoANDF.ps[2] see ANDF home page: http://www.andf.net/[3] refer 'Architecture Neutral Distribution Format (XANDF)', Xopen Co. Ltd., 1996[4] see http://huxley.inf.ethz.ch/~marais/OMI.html[5] see http://www.ics.uci.edu/~juice/[6] see http://www.threedee.com/jcm/psystem/[7] see http://java.sun.com/ [8] refer James Gosling, Bill Joy, Guy Steele, "The Java Language Specification", Addison-Wesley, 1996[9] refer Tim Lindolm, Frank Yellin, "The Java Virtual Machine Specification", Addison-Wesley, 1997[10] ECMA C# specification, Microsoft Corporation, 2001

[11] "The Development of the C language", Dennis M. Ritchie, Second History of programming Languages conference, Cambridge, Mass. 1993

Sidebar 1:

Approaches in improving the performance of the intermediate code in Java/.NET

One very obvious disadvantage of the combination of compilation and interpretation is the speed factor. Although the code is compiled, still

Page 12: Intermediate Languages

an interpreter needs to be employed to execute the code and typically the execution speed is reduced to the factor of 10 to 30 times to the equivalent of the native C/C++ code. There are many suggested ways to improve this speed to achieve the speed of "native code". The remaining part of the article explores the approaches that are made to overcome this problem with Java and .NET in particular.

An interesting approach is to avoid the interpretation at the target platform and do compilation again. This can avoid the slowness inherent in the interpretation approach, as translation is required each and every time the code needs to be executed. So an alternative approach of compiling the intermediate code again into native code to execute can be thought of. This makes the life of the virtual machine tougher but can improve the performance and it should be noted that the approach should make sure that the platform independence is not lost for this performance gain.

One such approach is Just-In-Time compilation. The aim is to compile the code when the code is called and 'boot-strap' that the compiled native code be used the next time the code is called. Since, in general, 90% of the time is spent in executing 10% of the code, this approach can reap rich dividends. JIT compilers are also referred to as "load and go" compilers as they don’t write the target code into a file. When a method is called for the first time, it is compiled into native code and kept in memory that the compiled code may be used next time the method is called again. Java uses this approach of compilation on demand extensively and many of commercial JVMs are JITters. Note that, JVMs are allowed to follow the interpretative approach or JIT and hence it is an optional feature for performance improvement. For .NET, the use of JIT is mandatory and thus all MSIL code is JITted to native code before execution.

Aggressive optimizations are also possible to make that are not possible with static optimization techniques, since:

i) precious runtime information is available for doing optimization

ii) depending on the situation and user requirement, only particular parts of the code will be called and doing aggressive optimizations on that part of the code and bootstrapping it shall improve the performance considerably.

Page 13: Intermediate Languages

iii) since optimization is done the frequently called code, it shall provide improved performance in general than the general optimization without any profiler information.

However it should be noted that, the performance of JITters is still not on par with the equivalent C/C++ code, but it is considerably better than the equivalent interpreted code. Also, in environments like embedded systems where RAM is precious, it may not be possible to bootstrap the methods for later use.

Providing ‘Ahead of time’ recompilation is yet another approach for improving the performance through native code. A compiler reads the information in the Java class files and generates a file that contains the original bytecodes and the native code equivalent for various platforms. The resultant file is sometime referred to as "fat" class file (refer to the "fat" binaries in the main text) and depending on the platform, the native code shall be invoked. This can effectively avoid the necessity to recompile every time the code needs to be executed and at the same time not losing portability.

Another approach is Ahead of time compilation. When the file is loaded into the memory itself, all the code gets compiled into native code. The advantage with this approach is that the code shall be compiled before any method is called, as opposed to JIT approach and hence is sometimes referred to as PreJIT. So, it doesn't suffer the problem of overhead with JIT compilation. This can thus effectively reduce the time for start-up and execution, as the code is ready of execution in form of native code.

So it seems that Ahead of time compiling is a very good alternative for JIT, but that is not the end of the story. The virtual machine still needs the meta-data about the classes, for example for supporting features like reflection. If no meta-data is maintained, then reflection cannot be supported which would be a serious disadvantage. Moreover, if the environment or security settings change, then the code needs to be compiled again. Thus PreJIT is not much promising approach. .NET supports PreJIT with its support with its NGEN tool in Visual Studio.NET

One of the interesting ideas to overcome the performance problem is implementing the interpreter in hardware. That is what the JavaChip intends to do - the bytecodes shall be directly executed by a microprocessor designed specifically for that. Since Java bytecode are not too low level to be implemented at chip level, still a JVM runs over

Page 14: Intermediate Languages

that, with the difference that the bytecodes are now directly executed by the hardware. This idea too, is reminiscent of the old idea in IBM machines that is discussed in the main text.

Sidebar - 2

C - A retrospect on its evolution and design decision

C is known for its high performance through generation of native code. Its interesting to note that its predecessor B didn't generate native code, instead it generated a stack based intermediate code called as threaded code. When Ritchie wanted to write his operating system with his new language, he found that it was handicapped to use the interpreted approach. To overcome this problem, Thomson tried by offering a 'virtual B' compiler that still had an interpreter, but it was too slow for practical purposes. So, to make C sufficiently low level that it can be used to write operating systems and compilers, Ritchie abandoned threaded code approach and instead generated native code. So this drastic change in the translation approach itself was influenced by two main problems with interpretation: the runtime overhead of interpretation and performance. In retrospect, had C continued the B's tradition of interpretive approach, would it have become such a huge success?

All rights reserved. Copyright Jan 2004.