21
Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical user interfaces, pocket calculators, and domain specific sys- tems (e.g. Mathematica) can all be described as interpreters. Also computers can be described that way, and so can even compilers. These notes describe the characteristics of interpreters, their internal struc- ture, and how to construct them, possibly using some kind of tool support. They are intended just as an appetiser for courses that go into the details, e.g. compiler construction, language design, or formal semantics. Contents Interpreter characteristics 2 Recursive descent parsing 4 Interpreters vs. compilers 9 Breakpoint interpretation 11 Translation into FSM-like control 12 LIJVM and its compiler 18 1

Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Interpreters, computers, and compilersCourse 02131, week 10, SW

Jørgen Steensgaard-Madsen

November 16, 2005

Abstract

Shells, graphical user interfaces, pocket calculators, and domain specific sys-tems (e.g. Mathematica) can all be described as interpreters. Also computerscan be described that way, and so can even compilers.

These notes describe the characteristics of interpreters, their internal struc-ture, and how to construct them, possibly using some kind of tool support.They are intended just as an appetiser for courses that go into the details,e.g. compiler construction, language design, or formal semantics.

Contents

Interpreter characteristics 2

Recursive descent parsing 4

Interpreters vs. compilers 9

Breakpoint interpretation 11

Translation into FSM-like control 12

LIJVM and its compiler 18

1

Page 2: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Interpreter characteristics

An interpreter is a tool to perform computations according to given directions.Often an interpreter iterates the actions of accepting an instruction and performingthe requested computation, but some just terminates after the reaction to its firstinstruction (e.g. the Gezel interpreter).

Use of interpreters

• General programming languages (e.g. ML, Scheme, LISP, Prolog, Matlab)

• Transformation (of text)e.g. LATEX, HTML browsers, tth, program ⇒ basic blocks + control

• Scripting languages (e.g. PERL, Tcl/Tk, shell command languages, make)

• Virtual machine implementation (e.g. JVM, .NET, P-code)

• Microprogrammed computers (e.g. Mic-1, Pentium, GIER)

• Preprocessors and ‘macro languages’

• Debuggers (e.g. gdb)

With interpreters, use of compilers is avoidedA compiler does translation — possibly as an interpreter

Organisation of an interpreter

• Any interpreter decodes input and finds a most significant operation

It then dispatches control to the definition of the operation

The dispatching can be considered similar to a routine call

An interpreter can perform a finite number of operations

Example: matching a while(...){...}-construct there is an operationdefined in terms of the construct’s condition and body

• A common code structure for interpreters is

while (more) {

switch (get_operation()) { /* analyse and dispatch */

case ... : ...; break; /* perform operation */

...

case exit: more = 0; break;

}

}

In simple interpreters get operation() may be a parser routine.

2

Page 3: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

An FSMD is an interpreter

Decode Dispatch

while (1) {

}

switch(action) {case 0: x = a; y = b; x != y; break;case 1: x < y; break;case 2: y = y − x; break;case 3: x = x − y; break;case 4: x != y; break;case 5: write(x); break;}

c

action

c = datapath(action); action = A[S];

S = N[c,S];

Looking ahead: Coupled interpreters ∼ coupled FSMDsPossible opcodes could be: (FSMD number, Action number)Claim: this fits nicely with control structures of high-level languages

not including the notion of general subroutines

3

Page 4: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Recursive descent parsing

An interpreter, like a compiler, handles a formal language, i.e. a set of texts thatare called expressions (or programs) of the language. Both typically depend on ascanner to collect characters into tokens which are presented to a parser.

The syntax of a language requires that tokens come in certain sequences, and theparser checks a seen sequence against the requirements. Error reporting and errorrecovery (skipping seen or subsuming forgotten tokens) is also done in a parser.

A grammar states language syntax in a suitable notation, of which several existand are in actual use. The relation between grammars and parsers for their languageshas been extensively studied. Ideally, one wants to generate a parser automaticallyfrom a grammar, and several tools of this kind exist.

A semantics for a language is a machine independent description of the com-putation represented by an expression. An implementation encodes the semanticsin some implementation language, often the same as used for the language parser,e.g. C. Developers typically write an implementation such that the encoded seman-tics is integrated with the parser.

This section presents a simple way to develop an interpreter from a grammarexpressed in a notation known as EBNF (Extended Baccus-Naur Form) — whereBNF refers to a similar language used to describe an early programming languagecalled Algol 60.

An expression interpreter

An example formal grammar

expression = product { (’+’ | ’-’) product }product = factor {(’*’ | ’/’ | ’%’) factor}factor = numeral | ’(’ expression ’)’

A language is a set of strings

e.g. C programs are elements of the set known as the C language.

A syntax category ∼ a sublanguage ∼ a subset

e.g. product, factor, and numeral are sublanguages of expression.

EBNF: a language of formal grammers is a meta language

The names of sublanguages are sometimes called non-terminals.

Symbols (terminals) of the target language are quoted, e.g. ’(’ and ’+’.

Unquoted symbols, like = | { and ), are special symbols of the meta language

4

Page 5: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

A recursive descent parser has a routine for each syntax cat-egory

A grammar rule like: expression = product { (’+’ | ’-’) product }can be transliterated to a corresponding routine in a C program:

myint expr(symset follow) { // never skips a token in follow

myint tmp, res; // prepares for use of XLong?

token_t op;

res = product(follow | adder);

while (op = get_present(adder,follow)) { // { adder product }

tmp = product(follow | adder);

switch (op) {

case ADD: res += tmp; break;

case SUB: res -= tmp; break;

}

};

return res;

}

Error detection and -recovery

Recursive descent parsers can behave fairly good w.r.t. error handling:

• Each recognizer has a parameter representing a set of possible follower tokens.

• Error handling depends on expected - and acceptable sets of symbols:

static token_t get_present(symset expected,symset follow);/* Returns current token if in expected, and advances.* Otherwise NONE==0 is returned perhaps after an error message******************************************************************/

static int skip_absent(symset start, symset follow);/* Any token not in ‘start | follow’ will be skipped.* Returns 1 when resulting current token is in start, 0 otherwise.*******************************************************************/

Users must consider if a token should be read when get present returns 0.

Pure parsers (with no semantics) may be generated from a grammar,and appropriate sets may be (automatically?) calculated.

5

Page 6: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Using a library for long arithmetic of unbounded precisionintegers

Gezel uses a public mp-library for long arithmetic. We could use it also:

void expr(myint res,symset follow) { // product { adder product }

myint tmp;

token_t op;

mpz_init(tmp); // required discipline

product(res,follow | adder);

while (op = get_present(adder,follow)) {

product(tmp,follow | adder);

switch (op) {

case ADD: mpz_add(res,res,tmp); break; // result parameters!

case SUB: mpz_sub(res,res,tmp); break;

}

};

mpz_clear(tmp); // releases memory

}

expr has been adjusted to use result parameters as do the library routines.The interpreter routines get marginally longer, but no more complex.

The product routine

myint product(symset follow) { // never skips a token in follow

myint res,tmp;

token_t op; // product = factor { multiplier factor }

res = factor(follow | multiplier);

while (op = get_present(multiplier,follow)) {

tmp = factor(follow | multiplier);

switch(op) {

case MUL: res *= tmp; break;

case DIV:

if (tmp) res /= tmp; else bye("Division by zero\n");

break;

case MOD:

if (tmp) res %= tmp; else bye("Division by zero\n");

break;

}

}

return res;

}

6

Page 7: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

The factor routine

myint factor(symset follow) {

myint res=0;

if (skip_absent(primer,follow)) { // current token is in follow

fprintf(stderr,"Missing factor at %s\n",symbol_string[token]);

parse_err++;

} else // current token is in primer

switch (token) {

case LPAR:

get_token();

res = expr(follow | 1<<RPAR);

get_present(1<<RPAR,follow);

return res;

case NUM:

res = atoi(numeral);

get_token();

return res;

}

}

... and its mp-counterpart

void factor(myint res,symset follow) { // constant or (expression)

if (skip_absent(primer,follow)) {

fprintf(stderr,"Missing factor at %s\n",symbol_string[token]);

parse_err++;

mpz_init(res); // similar to a zero return

} else

switch (token) {

case LPAR:

get_token();

expr(res,follow | 1<<RPAR);

get_present(1<<RPAR,follow);

break;

case NUM:

mpz_init_set_str(res,numeral,10);

get_token();

break;

}

}

7

Page 8: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

The interpreter’s main routine

int main(){

myint res;

printf("Ready for expressions -- Use Ctr-d to exit\n>\n");

get_token();

do {

skip_absent(primer,1<<EOFTOK);

if (token == EOFTOK) break;

parse_err = 0;

res = expr(1<<STOP);

if (parse_err == 0) printf("%i\n",res);

printf(">\n"); get_present(1<<STOP,primer | 1<<EOFTOK);

} while (1);

return 0;

}

Illustration: use of the mp-based interpreter

Ready for expressions -- Use Ctr-d to exit>9999999999999999999+8888888888;10000000008888888887>-14;Skipping unexpected symbols: -14>0-14;-14>8888888888888888 % 777777777;342222221>...

The interpreter needs improvements for actual use, but this is easy— first modify syntax, then the corresponding routines— use another library readline to support line-editing.

8

Page 9: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Interpreters vs. compilers

A compiler translates programs of a programming language into some other lan-guage, in principle into (binary) machine language program, but in practice intosome intermediate representation, e.g. assembly code- or object code programs. Abinary machine language program can be executed by a computer, and should havean effect as stated by the semantics of the programming language.

An interpreter likewise accepts programs of a programming language, but hasthe effects stated by the semantics of the language. This means that the conceptsof interpreters and compilers are intimately related, and this can be used in tools toconstruct compilers.

Types play an important role in some programming languages and help pro-grammers be aware of mistakes at an early stage. They can play a similar role indefinition of concepts. In physics it is common to check equations by formally com-puting ’dimensions’ (i.e. units of measurements) of both sides, which should give thesame, and this is closely related to use of types.

One kind of type is somewhat special in this context: the type of functions.In programming languages such types correspond to interfaces (or C-prototypes),and some languages like ML has a type sublanguage that includes function types.Whether to include function types in a type sublanguage of a programming languageis an important design issue, but for discussions about programming languages thenotion of function types is of great value.

Types of interpreters and compilers

• The type of interpreters: I ≡ E → VWhere E is the type of expressions and V of values

• An expression may be: p(d) (of type E of course) with d a constant of type DGiven an appropriate interpreter I : I we can obtain I(p(d)) of type VWe may strive for semantic compositionality such that I(p(d)) = I(p)(I(d)),i.e. I(p) : D → V and I(d) : D, so that V = D → V | D, which is morecomplicated that it may look. Often one is sloppy and says p : D → V andd : D

• A simplified type of compilers is: C ≡ E → D → V with programs p ∈ EWhere D → V is the type of object programs (and D is the type of inputs)

The complexity associated with separate compilation is disregarded

• A physical machine is essentially an interpreter for (object programs, inputs):

M(C(p), d) = I(p(d))

9

Page 10: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Conceptual parts of interpreters

• Typical tasks performed by an interpreter can be broken down into:

– Scanner: lexical analysis (identify tokens like numerals and names)

– Parser: syntax analysis (build an AST: an abstract syntax tree)

– Static semantic analysis (e.g. type checking)

– Code generation (possibly just use the AST)

– Execution: dynamic semantics (decode and dispatch over the ’code’)

• Construction

– Tools may be too complex for simple parsers and scanners.

– Recursive descent parsers are easy to write.An abstract syntax makes it easy to build an AST.An abstract syntax is typically a union that match the concrete syntax.

– A scanner is essentially an FSM and may be easy to write directly.

– Type checking can be quite complex, but should not be dropped.

Tools for construction of interpreters

• lex and yacc, or flex and bison, are quite complex

– These require internal actions to be written in C

– Similar tools may exist if C is not your language of choice (Java, ML,...)

– Error reporting and error recovery is often problematic.

– Invariably interpreters and semantics depend on another language

– Compiled languages likewise depend on assembly/machine code

• A locally developed tool, called dulce, uses a different, new approach:

– A scanner, parser, and type checking exist for a suitable pre-languagei.e. a language with little semantics, but a fixed pattern of syntax.

– Actual languages are defined relatively as sublanguages with semanticsso sublanguages share syntax and type checking, but vary in operations.

– Semantics consists of highly independent components written as C-routinesall with substantial support by use of make.

– Semantics may also be defined by combination of known operations.

– One might use Lisp in a similar way for both prelanguage and semantics.

10

Page 11: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Breakpoint interpretation

Programmers often use a debugger during program development. Such a programdepends on breakpoins inserted in a program by the developper. With an appropriatecompiler, the meaning of a breakpoint is an interpreter executing in the context ofthe position of the breakpoint, i.e. variable and operation names are recognised bythe interpreter and associated with their meaning in the context.

The language accepted by the interpreter at a breakpoint is most likely the sameas the language accepted by the compiler, or at least a substantial subset of it. If thenotion of a breakpoint is combined with an interpreter for a programming language,the interpreter at a breakpoint will appear as being an extension of the originallanguage. It implies that the notion of a breakpoint is not necessarily associatedwith debugging of programs, but can be seen as a means to implement languageextensions. Users of a language extended in this way may perceive the interpreteras any other interpreter for a language.

A tool, dulce, exists for construction of interpreters that accept languages withbreakpoints as described. It has been developed to demonstrate that interpreters(and compilers) may be constructed without the need for developers to care aboutscanners and parsers, as well as a number of other aspects of language implementa-tion. It has been used in applications that have been presented in the course, andwill be detailed in the next section.

A very small dulce interpreter

Example: extend a predefined set of operations and allow all to be used.

{ DEF auxdiv(x:[int,int,int]):int { ... };

DEF fibdiv(Dd:int,Dr:int):int { auxdiv((Dd,Dr,1)) };

DEF fibmod(Dd:int,Dr:int):int { auxdiv((Dd,Dr,0)) };

loop{INTERPRET}; # semantics of INTERPRET is like a breakpoint

};The above is itself a program to be interpreted by a dulce interpreter. It definesthe semantics of three new operations in terms of some predefined ones. The form ofauxdiv prepares for a later version, which is able to translate into IJVM assemblycode. Semantics of predefined operations is ultimately written as C routines. Thetool provides extensive help to write them. A possible user’s session is:

fib> fibdiv(109,22);

4

fib> fibmod(109,22);

21

fib> auxdiv((109,22,1));

4

11

Page 12: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Translation into FSM-like control

A compiler can be broken down into essentially the same components as an inter-preter. The difference is that code generation has to be perceived in a more concreteway: it may be text that an assembler (which is just another kind of compiler) has tomanipulate further. Eventually, the output will have a form that can be interpretedby a computer considered as an interpreter of its machine operations (i.e. the specificbit patterns that cause the computer to perform one of its native operations).

Programs are composed of computations and control, i.e. arithmetic and branch-ing. The split is reflected in the FSMD model of hardware: computations areperformed by the Datapath (D) component, the control by a Finite State Machine(FSM) component. In assembly code the computations are expressed in terms of asubset of operations that contains no jumps other than routine calls, whereas con-trol is expressed in terms of jumps that may depend on a recorded status (fromcomparisons, for instance).

Companion notes describe issues related to translation of computations into var-ious kinds of ‘code’: stack operations, RISC-type assembly instructions, and spe-cialised hardware described as data flow. Here the focus is on control, and forillustration a simple language will be used that abstracts away from the details ofcomputations.

A dulce interpreter to extract control and basic blocks

A more complex interpreter exists that can be used to extract basic blocks fromprograms written in a language with just if- and while statements. Although morecomplex, it is still described as a program in a language for a dulce interpreter.

‘Syntax’

[ cmd(S:string):Cmd ][ test(S:string):Test ][ if (Cond:Test) [True:Cmd] [False:Cmd] ][ while (Cond:Test) [Body:Cmd] ]

Sample program text (GCD)

{ cmd("x := a"); cmd("y := b");while (test("x != y")) {if (test("x < y"))then{ cmd("y := y-x"); }

else{ cmd("x := x-y"); }

}};

The syntax suffices to describe if and while as operations.The general syntax pattern allows the them to be used as in the program.Expressions appear as strings, because their internals are uninteresting here.The operations cmd and test just serve to classify expressions.

12

Page 13: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

The interpreter in action

The GCD example above can in a first step be translated into

FSMD(

ring([(0,ring(["x := a","y := b","x != y"])),

(1,ring(["x < y"])),

(3,ring(["y := y-x"])),

(4,ring(["x := x-y"])),

(5,ring(["x != y"])),

(2,ring(["exit"]))]),

ring([(0,1,2),(5,1,2),(1,3,4),(4,5,-1),(3,5,-1)]));

This can be read as one composite value consisting of two parts:

1. An enumeration of basic blocks represented as lists of strings(organised to allow easy reversal).

2. A representation of the control graph as a list of nodes,each connected to two others (with -1 denoting absense of a node).

Basic blocks presented as in assembly code

A final step can then produce a more conventional description of an FSMD:

’Code’ for action A_Sk (’invoked’ in state Sk)

A_S0:x := a;y := b;x != y;

A_S1:x < y;

A_S3:y := y-x;

A_S4:x := x-y;

A_S5:x != y;

A_S2:exit;

Control jumps

# entry point

je A_S2

jgt A_S4

j A_S5

jne A_S1

halt

Labelled basic blocks are presented as in assembly code,to which jumps of various kinds have been added manually to express control.

13

Page 14: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Control expressed as a FSM

Operation: (A_Sk) ==> (A_Si)

with Si = NextState(Sk,Status(A_Sk))

starting from S0

NextState:

S0 ==> S1, when Status(A_S0)=1, else S2

S5 ==> S1, when Status(A_S5)=1, else S2

S1 ==> S3, when Status(A_S1)=1, else S4

S4 ==> S5

S3 ==> S5

S2 ==> halt

================================

This is presented as a Moore-type FSM, with actions associated with nodes. How-ever, the same result obtained from the first step can also be translated into aMealy-type machine, which is the case in a later, and more elaborate, version of theinterpreter.

Simple semantics

The interpreter maintains a state that holds a current basic block, bb.r .The current basic block is saved and a new initialised when savebb is used.

Control structure information is saved in graph.r, which holds a list of elementsand also belongs to the state,

(basic block, destinationtrue, destinationfalse)

each corresponding to a graph node with two outgoing edges.A current.r holds an identification of a label (or node) in graph.r

The dispatcher-operation that matches cmd (and test)

# cmd [ cmd(S:string):Cmd ]

{ S :: bb.r } # extend the current basic block

# test [ test(S:string):Test ]

{ S :: bb.r } # extend the current basic block

In both cases the operations are defined in terms of S which denotes a string. Therepresentation of a basic block is thus simply a list of strings.

14

Page 15: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

The extractor’s while-operation

# while [ while (Cond:Test) [Body:Cmd] ]

{ new (2) S_w; # reserves two integers for labels

Cond;

with (savebb) init;

curlab.l := S_w.first;

Body; Cond;

savebb;

(curlab.r,S_w.first,S_w.first+1) :: graph.r;

(init.val,S_w.first,S_w.first+1) :: graph.r;

curlab.l := S_w.first+1;

};

Two labels (or node identifiers) are needed to represent the control structure of awhile, so two are reserved ’at entry’ i.e. each time a while-operation is seen. Thiscorresponds to a marking of the syntax as proposed in another note, i.e.

while (Cond / ) { . Body }; .

The code of the condition is included in whatever basic block is current at entry,but then saved and a new started. Saving a basic block by savebb implies thatthe value of a current label is returned, so that in this case it can be referenced asinit.val . Please remember, that the marking patterns for basic blocks to some de-gree expresses a persons choice about for instance whether to consider the conditionof a while as a basic block by itself.

The program for while refers to the condition and the computation to iterateby the names Cond and Body, respectively. Both could be perceived as routines thatrequire no arguments, i.e. a reference to Cond might in an equivalent C programhave to be expressed as Cond(). How to write such calls in a language is a minorlanguage decision.

Each reference to Cond, and likewise to Body, will contribute to an internal stateand eventually to the resulting code.

Subsequently, the Body and the Cond are analysed, i.e. the state is updated withinformation about their basic blocks and control structures. Before that, however,one of the two reserved labels is assigned as the current label. After the analysis,the current basic block is saved and graph.r is updated. Finally the current labelis set to the other of the two reserved labels.

Updating graph.r adds a tripple of labels to a list, which represents a statetransition graph of Moore type. The first element of a triple is associated with abasic block that computes a condition for chosing one of the two other labels as

15

Page 16: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

’next state’. In this case the updates tell that after either evaluation of Cond thenext state can be the entry to the Body or one that will be determined later.

The association of a label with a basic block is actually done by savebb, whichsaves the current basic block and associates it with the value of curlab.r when itwas initialised by a previous call use of savebb .

Note that savebb like Cond and Body behaves like a call to a C-routine with noarguments.

The if-operation

This is made slightly more complex to illustrate a first approach at optimisation.A more extensive optimisation has been used in a successor (Zebra) to the presentversion, so the optimisation is not really justified by practice. Anyway, it is retainedand its purpose is to avoid excessive use of labels when at least one of the branchesin an application of if is empty.

The optimisation is realised by a variant of savebb, called save nonempty whichrequires two label arguments. Sometimes there is no need to save the current basicblock, and one of the two labels given as arguments will be returned accordingly.The ordinary case is to save and return the label given as first argument, so theunoptimised version would just replace the two uses of save_nonempty by savebb .

# if

{ new(3) S_if;

Cond;

with (savebb) init;

curlab.l := S_if.first;

True;

with (save_nonempty(S_if.first,S_if.first+2)) outT;

curlab.l := S_if.first+1;

False;

with (save_nonempty(S_if.first+1,S_if.first+2)) outF;

curlab.l := S_if.first+2; # save start of current BB

(init.val,outT.val,outF.val) :: graph.r; # add node to FSM

}

Overall organisation of the interpreter

The interpreter for the basic block extractor is defined in terms of operations forwhich semantics already exists, so it does not differ much from an ordinary program.Some details, especially about semantics have beed described, but of course there areseveral others. Initialisation and definition of auxiliary operations used internally,

16

Page 17: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

such as savebb, has only been described informally. But the structure of the entireprogram can now be outlined.

Descriptions of if and while operations differs slightly from those shown above.The T in their descriptions denote some unknown type, so every if operation musthave two branches that results in the same type. In other words: this if-operationcombines the qualities of C’s if-statement and its ?-expression. The role of T in thedescription of while is to let the type of the iterated part by any type — it will bediscarded anyway.

This can be an advantage sometimes, but is not so for the LIJVM language, forinstance. It all depends on the role a language designer decides that types shouldhave.

program{ # ...DEF basicblocks[ Syntax OF Cmd, Test [ cmd(S:string):Cmd ]

[ test(S:string):Test ][ if OF T(Cond:Test)[True:T][False:T] ][ while OF T(Cond:Test)[Body:T] ] ]

{ ####### (... initialisation); definition of semantics #############var bb; bb.l := ring([]); # the basic block being builtvar graph; graph.l := ring([]); # branch nodes: (id,left,right)var curlab; curlab.l := S.first; # label of current BBSyntax # NB: decode and dispatch

cmd: { S :: bb.r } # extend the current BBtest: { S :: bb.r } # ditto# if ...# while ...

################ output of the ‘state’ ############################# ...

};######################### application ##################################loop{ basicblocks(INTERPRET); };

};

This interpreter illustrates actions of a traditional compiler. The interpreted lan-guage with sufficiently expressive, predefined operations is used to write a programthat builds an internal representation (an abstract syntax tree) of the program andgenerates code for some target language from it.

Gezel code can be generated from a language that similarly abstracts from thedetails of computations. This is done by a more elaborate version of the interpreter,called Zebra.

Abstraction from computations can be convenient, but is in general unrealistic.The next section tells about an interpreter that translates programs into IJVMassembly code without depending on such an abstraction.

17

Page 18: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

LIJVM and its compiler

The IJVM computer does not provide usual support for multiplication, division, andshift operations, which makes it hard to write multiplication and division routines.As shown in a companion note, it can be done by use of Fibonacci numbers. Aspecial high-level language, LIJVM, allows these to be expressed, and a compiler forit can translate into assembly code for IJVM.

The compiler for LIJVM has been written to illustrate yet another implementa-tion method for compilers. It relates to a concept from the theory of compilers, calledabstract interpretation, but it is rarely used in actual construction. It is introducedto illustrate the similarities and differences between compilers and interpreters aswell as to stimulate interests in theoretical aspects of compiler construction.

Interpretation vs. compilation

Simple interpretation of constants and expressions

demo> 44;

44

demo> 67+22;

89

Semicolon terminates an expression.A constant denote its valueA composite expressionis evaluated to a value

Compilation of constants and expressions

lijvm> { 44; };

.constant

objref 0xCAFE

.end-constant

.main

BIPUSH 44

POP

.end-main

Complication to be justified by the next exampleAssembly code is the value of an expression!

This is the actual value of 44and this is caused by the semicolon!

Types may represent many properties

lijvm> 44;

----- Error: Programs must have type Void

The semicolon in this case just indicates the end of an expression, which mightotherwise have continued on the next line. Type semantics can be more com-plex than just a simple categorisation of values. In LIJVM the type Void isused to ensure stack consistency through a compile-time check. It implies thatconstants are not accepted as complete programs, but have to be wrapped asillustrated above.

18

Page 19: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Operators combine assembly code snippets

lijvm> {67+22;};

.constant

objref 0xCAFE

.end-constant

.main

BIPUSH 67

BIPUSH 22

IADD

POP

.end-main

An expression to translateThe required opening

Value of 67Value of 22Combined by the semantics of +A semicolon to consume a value from the stack

If we disregard the wrapping of the essential part of the assembly code, wefind that the code agrees completely with the expectations according to Section5.4.8 in Tanenbaum’s Structured Computer Organization. The association toexpression interpretation has been emphasised in the lecture Hardware orientedprogramming.

Details of compilation

Below is a small section of program text from an LIJVM compiler, focusing on trans-lation of expressions as illustrated. The parts labelled add: and sub: expresses howoperators + and -, respectively, are translated. They both refer to their operandsas left and right.

Members

# ---------- Conversion of integer constants to code

int: { if(X<255) { enc(" BIPUSH " X "\n") }

else {

obj_count.l += 1; objects.l ::= X;

enc(" LDC_W c" obj_count.r "\n")

}

}

# ---------- Arithmetic, infix operators: + -

add: { enc(((dec_1(left)) dec_1(right) " IADD\n")) }

sub: { enc(((dec_1(left)) dec_1(right) " ISUB\n")) }

The enc routine maps strings into the type of assembly code.

dec 1 maps assembly code to strings.

Justaposition of strings, "abc" "def", is concatenation (Java: "abc" ++ "def").

19

Page 20: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Corresponding description of syntax

[ Members OF Void, Cond

[ int(X:int): int Class ] # repr. of constants

[ +(int Class,int Class): int Class ] # yes: addition

[ -(int Class,int Class): int Class ] # subtraction

...

[ while (C: Cond) [ Body ]: Void ] # A usual while loop

...

[ ‘;’ (int Class,NONE): Void ] # pops a value

[ ‘;’ (Void,NONE): Void ] # no action

...

]

Syntax is defined like an ‘interface’ of ‘member operations’.

Assembly code that leaves an int on the stack has type int Class.

Assembly code that consumes an int from the stack has type Void.

Hack: int(X:int) is invoked by default on integer constants in the source text!

Hack: semicolon may be defined as an explicit operation.

Semantics of while

The interpretation of a while statement refers to the condition and the part toiterate as C and Body, respectively. Otherwise it is just slightly more complex thanaddition. It appears as just another labelled part under Members.

while: { with (lab.r) fst; lab.l += 2;

TFstack.l(TFtop.r) := (fst.val,fst.val+1);

TFtop.l += 1;

enc(

dec_1(C) # Condition snippet (see below)

"L" fst.val "\n" # Iteration label

dec_1{Body; Nop.val} # Body snippet

" GOTO L" fst.val

"\nL" (fst.val+1) "\n" # Termination label

)

}

An assembly code snippet for a condition ends with a conditional jump.

A false condition jumps to the second of a pair of labels on TFstack.

‘Boolean operators’ (e.g. negation) may use the TFstack in a clever way.

20

Page 21: Interpreters, computers, and compilers · Interpreters, computers, and compilers Course 02131, week 10, SW Jørgen Steensgaard-Madsen November 16, 2005 Abstract Shells, graphical

Translation example

lijvm 1> while (1 != 0) {1+2; 4;};.constantobjref 0xCAFE.end-constant

.main// while

BIPUSH 1BIPUSH 0IF_ICMPEQ L1

L0: // doBIPUSH 1BIPUSH 2IADDPOPBIPUSH 4POPGOTO L0

L1: // end-while.end-main

Abstract interpretation in general

• Constants (and computed values) belong to particular ’abstract domains’.

• Type checking is abstract interpretation with a one-value domain for everytype.

e.g. 4 + 6 is vint + vint which is interpreted as vint.

• Values type correct or type incorrect are program analysis results.

• Other kinds of program analysis results may be obtained by abstract interpre-tation.

21