Chapter 11 Language Translation - Plymouth State University

Chapter 11Language Translation

A program written in any of the high-level lan-

guages must be translated into machine lan-

guage before execution, by a special piece of

software, called a compiler.

Compared to assemblers, compilers are very

difficult to design, thus, many person-years have

to be spent there. /

One machine language instruction leads to one

assembly language instruction. Hence, an as-

sembler really just replaces something with some-

thing else with the help of an OpCode table

and a symbolic table, as we went through in

Lab 9.

1

What will we do here?

We will see here how to put a high-level lan-guage program into one in assembly language.

One high-level language statement may leadto many machine language instructions. Forexample, the following Java statement

a=b+c-d;

corresponds to the following four instructions

LOAD B

ADD C

SUBTRACT D

STORE A

To generate the corresponding instructions, acompile must do a thorough analysis of thestructure (syntax) and meaning (semantics) ofthe involved program, which is very compli-cated and difficult. /

We will do some digging here... ,

2

What should a compiler do?

When performing translation, the foremost goal

is to be correct: The generated machine lan-guage program must do exactly what the orig-inal program does, no more, no less.

For example, the following machine code

LOAD B

ADD C

STORE B

SUBTRACT D

STORE A

does not correctly translate the statement

A=B+C-D

Question: Why?

Answer: It would destroy the original data in

B.

You should not destroy your input.

3

The second goal is that the resulted machine

code should be efficient and concise: It has to

be fast.

For example, to sum up 2x1 + 2x2 + · · · +

2x50000, for the following poorly written Java

program:

sum=0.0;

for(i=1; i<=50000; i++)

sum=sum+(2.0*x[i]);

the compiler should generate more efficient codes,

as if the code had been based on the following:

sum=0.0;

for (i=1; i<=50000; i++)

sum=sum+x[i];

sum=2.0*sum;

The second one does just one multiplication,

while the first would do 50,000 of them. /

4

What is the process?

There are generally four phases for the trans-

lation process for a compiler.

1. Lexical analysis: The compiler looks at the

individual characters in the source program and

groups them into syntactic units, called tokens.

2. Parsing: Sequence of tokens will be checked

to see if it forms a syntactically correct pro-

gram according to a specific program language.

3. Semantic analysis and code generation:

The compiler analyzes the meaning of a pro-

gram and generates the proper code.

4. Code optimization: The compile tries to

make the just generated code more efficient.

5

I want to see... in two ways

A big picture...

and a smaller one:

6

Lexical analysis

In this first phase, a lexical analyzer, part of the

compiler, reads in a sequence of characters,

and puts them into tokens.

For example, for the following Java statement

a = b + 319 - delta;

based on the individual symbols such as (space),

a, =, b, 3, 1, 9, -, d, e, l, t, a, ;, the an-

alyzer forms the following eight tokens

a, =, b, +, 319, -, delta, ;.

From now on, the compiler can work at the

level of symbols, numbers, and operators.

Some of the language, including Java, actually

provides a free tokenizer.

7

Token classification

Besides forming tokens, the analyzer also tries

to categorize them to facilitate later transla-

tion.

For example, all names will be assigned a cat-

egory 1, while all numbers will be assigned a

2, etc. We can have the following table.

Token Type Classification

symbol 1number 2

= 3+ 4- 5; 6

== 7if 8

else 9( 10) 11

8

Why do we classify?

If we look at a bigger picture, we only care

what occurs in where, to have a correct state-

ment.

For example, the following is a legal assign-

ment, no matter what symbols are used and

what values that number has.

‘‘symbol’’ = ‘‘symbol’’ + ‘‘number’’;

To summarize, the input to a lexical analyzer

is a high-level language statement from the

source program. Its output is a list of all the

tokens contained in the program, as well as

their classifications, represented as numbers.

Homework: Exercises 1–3

9

An example

Given the following input statement in Java:

if (a==b) then x=13 else x=20;

the analyzer will generate the following output,

based on the table as given on Page 8:

Token Type Classificationif 8( 10a 1

== 7b 1) 11x 1= 313 2else 9x 1= 320 2; 6

Homework: Exercise 4 (a), (c).

10

A running example

Given the following input statement

x = x + y + z;

the analyzer will generate the following output

of the lexical analysis phase:

Token Type Classificationx 1= 3x 1+ 4y 1+ 4z 1; 6

This statement would be our running exam-

ple to demonstrate the four-phase translation

process as summarized on Page 6, all the way

through the end at Page 36.

Question: Now that the lexical analysis is

done, then what?

11

Parsing

During the parsing phase, a compiler deter-

mines if the tokens recognized fit together in

a grammatically correct way, i.e., if it is a syn-

tactically legal statement of the programming

language.

For example, the following assignment state-

ment x = y + z; is a legal statement as we can

construct the following parse tree.

Such a tree shows how those tokens are grouped

together.

12

Grammars, languages and BNF

To enable the compiler to parse a program,

we have to present a grammar, i.e., the syn-

tax rules of the programming language, usually

represented in Backus-Naur Form (BNF).

The following BNF.

<assignment statement>::=<symbol>=<expression>

corresponds to the following parse tree:

We will talk about this stuff a lot more in

CS3780 Intro. to Computational Theory.

13

A BNF rule consists of two parts:

LHS ::= RHL.

It means that the tokens represented by RHL

produces the token represented by LHS. For

example, the sequence

<symbol>:=<expression>;

produces an assignment statement.

In a BNF rule, the LHS part is always a nonter-

minal, a category used to explain and organize

the language, e.g., <symbol>; while using two

different types of objects, terminals as well as

nonterminals, in its RHS part.

Terminals refer to those actual tokens recog-

nized and returned by the lexical analyzers,

e.g., ;, =, if, else, etc., which are not fur-

ther defined by other rules of the grammar.

14

What is a language?

Any programming language is specified by a

fixed set of grammar rules, a set of terminals,

and a set of variables, including a start variable,

at the root of a parse tree. When such a tree

is produced, the statement will be accepted.

The collection of all the accepted statements

is called the language defined by the grammar.

Question: What are signed numbers (Page 9

of Chapter 4 notes)?

<signed integer>:=<sign><number><sign>::=+|-<number>::=<empty string>|<digit><number><empty string>::=<digit>::=0|1|2|...|9

Thus, +49 is a signed number.

<signed integer>→ <sign><number>→ + <number>

→ + <digit><number>→ + 4 <number>

→ + 4 <digit><number>→ + 4 5 <number> → + 4 5.

15

Parsing techniques

Given as input, the BNF description of a gram-

mar, and a sequence of tokens, if, by applying

the rules, a parser can convert the entire se-

quence of tokens into the start variable symbol,

that sequence is a syntactically valid statement

of the language. Otherwise, it is not.

For example, given the following grammar

1. <sentence>::=<noun><verb>

2. <noun>::=bees|dogs

3. <verb>::=buzz|bite

It has five terminals, three variables and we

also have <sentence> as the goal statement, or

the start variable.

“dogs bite” is a valid statement of this lan-

guage, but “bees dogs” is not valid.

Question: Why?

16

Is there a parse tree...?

Given “dogs bite.”, it will be broken, in the

lexical analysis, into the following sequence of

tokens: “dogs”, “bite”, ”, and “.”.

Then the parser will determine that it is a legal

statement of that language, as the following

tree can be constructed, where the numbers

attached to an edge indicate the grammar rule

used to make that derivation.

On the other hand, “bee”, “dogs” is not a

valid sequence of tokens since no parse tree

can be constructed. /

Homework: Exercises 5 and 9

17

Another example

Consider the following grammar:

1. <assignment statement>::=<variable>:=<exp>

2. <exp>::=<variable>|<variable>+<variable>

3. <variable>::=x|y|z

Then, e.g., x:=y+z will be deemed as a valid

statement of this language, because of the fol-

lowing tree, which we saw on Page 12.

Question: Is it always this easy?

18

A potential issue


1. <t1>::=A B

2. <t2>::=B C

given the following input ...A B C..., both

rules are applicable. The question is which rule

should be used?

This situation can also occur in programming

language parsing.

For example, in parsing

x=y+z;

although it is clear that the whole sequence

should be recognized as an assignment, it is

quite possible for the parser to apply rules to

obtain <assignment statement> + z, which can’t

be further reduced, and we got stuck. /

19

A likely solution


1. <goal>::=<term> C

2. <term>::=A B|B C

With the input of A B C, both rules are applica-

ble. But, if we look ahead, we find out that if

we just apply the second part of R2, it will lead

to no where, as we will get A term. Hence, we

try the first part, and quickly reduce the input

to goal.

In general, there are many look-ahead parsing

algorithms, which use this idea to look ahead

for a few tokens to see what would happen if

certain choice is made at the current point.

Such a practice helps the parser to move to-

wards the right direction. It also leads to very

efficient parsing, as we will no longer try paths

that would fail. ,

20

Another issue

It could also happen that the grammar is sim-ply wrong. For example, with the grammaras shown on Page 18, no matter how hard wetry, we cannot accept the following assignmentstatement: x:=x+y+z. /

This is simply because this grammar is too nar-rowly minded. In general, when presenting agrammar, we have to make sure that a gram-mar is to

• include every valid statement, and

• exclude every invalid statement.

Hence, we want to modify the grammar forassignment statement as follows:


2. <exp>::=<variable>|<exp>+<exp>


21

Which parse tree...

... is correct for x=x+y+z?

22

Ambiguity elimination

By considering the associative law for any ex-

pression,

x + y + z = (x + y) = z

the first parse is the correct one.

Question: How could we come up with a

grammar to reflect this nice feature?

Answer: Here is yet another attempt for the

assignment rule, that does have the above as-

sociative piece.


2. <exp>::=<variable>|<exp>+<variable>


This set of rule not only eliminates the ambigu-

ous interpretation of the statement, also as-

signs the intended meaning to the statement.

Question: Show me...

23

Here it is...

With this last set of revised grammar for the

assignment statement, the statement x:=x+y+z

will be interpreted as x:=(x+y)+z.

Thus, the expression will be interpreted as in-

tended. ,

Question: Anything else?

Homework: Exercises 12 and 15.

24

Rule for the conditions

1. <if statement>::=If(<Boolean exp>)then

<assignment statement><else clause>

2. <Boolean exp>::=<var>|<var> <relation> <var>

3. <relation>::= =|<|>

4. <var>::=x|y|z

5. <else clause>::=else<assignment statement>

Here is a tree for a conditional structure:

if(x==y) x=z else x=y

So, parsing could make you feel dizzy. /

25

Semantics and code generation

During parsing, a compiler deals with the syn-

tax of a statement, i.e., its structure. But, it

is not the case that every syntactically correct

sentence makes sense, e.g., “The man bit the

dog.” is grammatically correct, but doesn’t

make sense.

We thus also have to check the semantics, i.e.,

the meaning, of the statement. The compiler

will analyze the meaning of the tokens and un-

derstand the actions that it tries to perform. If

it does not make sense, it will be rejected; oth-

erwise, it will be translated into an equivalent

in machine language.

Given the following statement:

sum=a+b;

Although it is syntactically correct, it still might

not make sense if we know that the types of a

and b are char and integer, respectively.

26

Semantic record

The previous example tells us that we have to

add some additional information, such as the

type of a data item, to the parsing tree. In

general, we attach a semantic record to every

node in the parsing tree. For example, below

shows a more general parsing tree for a+b:

Based on this information, the compiler can

easily reject the expression that constructs the

following tree because of inconsistent types. /

27

Code generation

Thus, the first step of code generation has

to be semantic analysis, which checks every

branch of the parse tree to make sure that

they are semantically meaningful.

If it is not meaningful, errors will be reported.

Otherwise, we will get into the next phase to

produce the code.

Let’s use the following parse tree for x=y+z;

28

First things first

It begins by working with the original input to-

kens.

For example, it will generate the following code

for the branch for the variable x, y and z.

29

The next step

To accomplish y+z, it has to use some tem-

porary cell, e.g., TEMP, to store the sum first,

as the compiler has not seen x yet, where this

sum will be finally stored.

LOAD Y

ADD Z

STORE TEMP

TEMP: .DATA 0

Generally, whenever the compiler needs a tem-

porary space, it will create one using the .DATA

pseudo-op.

30

The final step

The final branch of the tree builds the non-

terminal <assignment statement>, and is trans-

lated into machine language by loading the

value of the expression to the variable.

LOAD TEMP

STORE X

The compiler also builds up a semantic record

for the finished product, which is considered

the value of the entire statement.

Notice the whole tree consists of the above

and the one on the previous page.

31

The completed code

To summarize, the following assignment state-

ment

x := y + z;

has been translated into the following machine

instructions

LOAD X

ADD Y

ADD Z

STORE X

...

X: .DATA 0

Y: .DATA 1

Z: .DATA 2

Homework: Exercise 23

32

The running example

Given the following line

x = x + y + z;

after lexical analysis (Page 11), the compiler

first generates the following parse tree, using

the revised grammar as shown on Page 23.

33

Now the code...

With the above tree, the compiler can then

generate the following code:

Question: Should you load it into the von

Neumann machine to play with it?

Answer: No. Wait until you read through

Page 37...

34

Code optimization

When compilers came out during the 1950’s,

they were not accepted that well. The major

reason is that the code they generated were

not that efficient, even though correct. Hence,

the need for code optimization, an issue that

we already went through back on Page 4 of

the notes.

There are two types of optimization: local and

global optimization. In local optimization, the

compiler looks at a very small block of instruc-

tion to see if any improvement can be made

to make it run faster.

For example, if an expression can be fully eval-

uated at this time, it should. Hence the fol-

lowing constant evaluation:

LOAD ONE LOAD TWO

ADD ONE =====> STORE X

STORE X

35

Other techniques

We also want to use simpler, and less time con-

suming, operations, hence the following strength

reduction:

LOAD X LOAD X

MULTIPLY TWO =====> ADD X

STORE X STORE X

Also we want to eliminate unnecessary work:

LOAD Y LOAD Y

STORE X =====> STORE X

LOAD X STORE Z

STORE Z

Global optimization requires the ability to see

the “big picture”, which is more difficult and

not always done. /

36

The optimized code

The above code as we saw on Page 34 can befurther optimized as follows:

.beginLOAD XADD YADD ZSTORE XOUT XHALTX: .DATA 0Y: .DATA 1Z: .DATA 2.end

If we run this with Assembler, we get the cor-rect answer of 3 back. , We did a lexical

analysis of this statement on Page 11, did acorrect parsing on Page 33, generated codeon Page 34, and finally provided the optimizedcode here.

We thus ran through the whole compiler pro-cess, as shown on Page 6.

Lab time: Lab 12 on Language Translation

37

Documents

Chapter 11 Language Translation - Plymouth State University