Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Chapter 11Language Translation
A program written in any of the high-level lan-
guages must be translated into machine lan-
guage before execution, by a special piece of
software, called a compiler.
Compared to assemblers, compilers are very
difficult to design, thus, many person-years have
to be spent there. /
One machine language instruction leads to one
assembly language instruction. Hence, an as-
sembler really just replaces something with some-
thing else with the help of an OpCode table
and a symbolic table, as we went through in
Lab 9.
1
What will we do here?
We will see here how to put a high-level lan-guage program into one in assembly language.
One high-level language statement may leadto many machine language instructions. Forexample, the following Java statement
a=b+c-d;
corresponds to the following four instructions
LOAD B
ADD C
SUBTRACT D
STORE A
To generate the corresponding instructions, acompile must do a thorough analysis of thestructure (syntax) and meaning (semantics) ofthe involved program, which is very compli-cated and difficult. /
We will do some digging here... ,
2
What should a compiler do?
When performing translation, the foremost goal
is to be correct: The generated machine lan-guage program must do exactly what the orig-inal program does, no more, no less.
For example, the following machine code
LOAD B
ADD C
STORE B
SUBTRACT D
STORE A
does not correctly translate the statement
A=B+C-D
Question: Why?
Answer: It would destroy the original data in
B.
You should not destroy your input.
3
The second goal is that the resulted machine
code should be efficient and concise: It has to
be fast.
For example, to sum up 2x1 + 2x2 + · · · +
2x50000, for the following poorly written Java
program:
sum=0.0;
for(i=1; i<=50000; i++)
sum=sum+(2.0*x[i]);
the compiler should generate more efficient codes,
as if the code had been based on the following:
sum=0.0;
for (i=1; i<=50000; i++)
sum=sum+x[i];
sum=2.0*sum;
The second one does just one multiplication,
while the first would do 50,000 of them. /
4
What is the process?
There are generally four phases for the trans-
lation process for a compiler.
1. Lexical analysis: The compiler looks at the
individual characters in the source program and
groups them into syntactic units, called tokens.
2. Parsing: Sequence of tokens will be checked
to see if it forms a syntactically correct pro-
gram according to a specific program language.
3. Semantic analysis and code generation:
The compiler analyzes the meaning of a pro-
gram and generates the proper code.
4. Code optimization: The compile tries to
make the just generated code more efficient.
5
I want to see... in two ways
A big picture...
and a smaller one:
6
Lexical analysis
In this first phase, a lexical analyzer, part of the
compiler, reads in a sequence of characters,
and puts them into tokens.
For example, for the following Java statement
a = b + 319 - delta;
based on the individual symbols such as (space),
a, =, b, 3, 1, 9, -, d, e, l, t, a, ;, the an-
alyzer forms the following eight tokens
a, =, b, +, 319, -, delta, ;.
From now on, the compiler can work at the
level of symbols, numbers, and operators.
Some of the language, including Java, actually
provides a free tokenizer.
7
Token classification
Besides forming tokens, the analyzer also tries
to categorize them to facilitate later transla-
tion.
For example, all names will be assigned a cat-
egory 1, while all numbers will be assigned a
2, etc. We can have the following table.
Token Type Classification
symbol 1number 2
= 3+ 4- 5; 6
== 7if 8
else 9( 10) 11
8
Why do we classify?
If we look at a bigger picture, we only care
what occurs in where, to have a correct state-
ment.
For example, the following is a legal assign-
ment, no matter what symbols are used and
what values that number has.
‘‘symbol’’ = ‘‘symbol’’ + ‘‘number’’;
To summarize, the input to a lexical analyzer
is a high-level language statement from the
source program. Its output is a list of all the
tokens contained in the program, as well as
their classifications, represented as numbers.
Homework: Exercises 1–3
9
An example
Given the following input statement in Java:
if (a==b) then x=13 else x=20;
the analyzer will generate the following output,
based on the table as given on Page 8:
Token Type Classificationif 8( 10a 1
== 7b 1) 11x 1= 313 2else 9x 1= 320 2; 6
Homework: Exercise 4 (a), (c).
10
A running example
Given the following input statement
x = x + y + z;
the analyzer will generate the following output
of the lexical analysis phase:
Token Type Classificationx 1= 3x 1+ 4y 1+ 4z 1; 6
This statement would be our running exam-
ple to demonstrate the four-phase translation
process as summarized on Page 6, all the way
through the end at Page 36.
Question: Now that the lexical analysis is
done, then what?
11
Parsing
During the parsing phase, a compiler deter-
mines if the tokens recognized fit together in
a grammatically correct way, i.e., if it is a syn-
tactically legal statement of the programming
language.
For example, the following assignment state-
ment x = y + z; is a legal statement as we can
construct the following parse tree.
Such a tree shows how those tokens are grouped
together.
12
Grammars, languages and BNF
To enable the compiler to parse a program,
we have to present a grammar, i.e., the syn-
tax rules of the programming language, usually
represented in Backus-Naur Form (BNF).
The following BNF.
<assignment statement>::=<symbol>=<expression>
corresponds to the following parse tree:
We will talk about this stuff a lot more in
CS3780 Intro. to Computational Theory.
13
A BNF rule consists of two parts:
LHS ::= RHL.
It means that the tokens represented by RHL
produces the token represented by LHS. For
example, the sequence
<symbol>:=<expression>;
produces an assignment statement.
In a BNF rule, the LHS part is always a nonter-
minal, a category used to explain and organize
the language, e.g., <symbol>; while using two
different types of objects, terminals as well as
nonterminals, in its RHS part.
Terminals refer to those actual tokens recog-
nized and returned by the lexical analyzers,
e.g., ;, =, if, else, etc., which are not fur-
ther defined by other rules of the grammar.
14
What is a language?
Any programming language is specified by a
fixed set of grammar rules, a set of terminals,
and a set of variables, including a start variable,
at the root of a parse tree. When such a tree
is produced, the statement will be accepted.
The collection of all the accepted statements
is called the language defined by the grammar.
Question: What are signed numbers (Page 9
of Chapter 4 notes)?
<signed integer>:=<sign><number><sign>::=+|-<number>::=<empty string>|<digit><number><empty string>::=<digit>::=0|1|2|...|9
Thus, +49 is a signed number.
<signed integer>→ <sign><number>→ + <number>
→ + <digit><number>→ + 4 <number>
→ + 4 <digit><number>→ + 4 5 <number> → + 4 5.
15
Parsing techniques
Given as input, the BNF description of a gram-
mar, and a sequence of tokens, if, by applying
the rules, a parser can convert the entire se-
quence of tokens into the start variable symbol,
that sequence is a syntactically valid statement
of the language. Otherwise, it is not.
For example, given the following grammar
1. <sentence>::=<noun><verb>
2. <noun>::=bees|dogs
3. <verb>::=buzz|bite
It has five terminals, three variables and we
also have <sentence> as the goal statement, or
the start variable.
“dogs bite” is a valid statement of this lan-
guage, but “bees dogs” is not valid.
Question: Why?
16
Is there a parse tree...?
Given “dogs bite.”, it will be broken, in the
lexical analysis, into the following sequence of
tokens: “dogs”, “bite”, ”, and “.”.
Then the parser will determine that it is a legal
statement of that language, as the following
tree can be constructed, where the numbers
attached to an edge indicate the grammar rule
used to make that derivation.
On the other hand, “bee”, “dogs” is not a
valid sequence of tokens since no parse tree
can be constructed. /
Homework: Exercises 5 and 9
17
Another example
Consider the following grammar:
1. <assignment statement>::=<variable>:=<exp>
2. <exp>::=<variable>|<variable>+<variable>
3. <variable>::=x|y|z
Then, e.g., x:=y+z will be deemed as a valid
statement of this language, because of the fol-
lowing tree, which we saw on Page 12.
Question: Is it always this easy?
18
A potential issue
Consider the following grammar:
1. <t1>::=A B
2. <t2>::=B C
given the following input ...A B C..., both
rules are applicable. The question is which rule
should be used?
This situation can also occur in programming
language parsing.
For example, in parsing
x=y+z;
although it is clear that the whole sequence
should be recognized as an assignment, it is
quite possible for the parser to apply rules to
obtain <assignment statement> + z, which can’t
be further reduced, and we got stuck. /
19
A likely solution
Consider the following grammar:
1. <goal>::=<term> C
2. <term>::=A B|B C
With the input of A B C, both rules are applica-
ble. But, if we look ahead, we find out that if
we just apply the second part of R2, it will lead
to no where, as we will get A term. Hence, we
try the first part, and quickly reduce the input
to goal.
In general, there are many look-ahead parsing
algorithms, which use this idea to look ahead
for a few tokens to see what would happen if
certain choice is made at the current point.
Such a practice helps the parser to move to-
wards the right direction. It also leads to very
efficient parsing, as we will no longer try paths
that would fail. ,
20
Another issue
It could also happen that the grammar is sim-ply wrong. For example, with the grammaras shown on Page 18, no matter how hard wetry, we cannot accept the following assignmentstatement: x:=x+y+z. /
This is simply because this grammar is too nar-rowly minded. In general, when presenting agrammar, we have to make sure that a gram-mar is to
• include every valid statement, and
• exclude every invalid statement.
Hence, we want to modify the grammar forassignment statement as follows:
1. <assignment statement>::=<variable>:=<exp>
2. <exp>::=<variable>|<exp>+<exp>
3. <variable>::=x|y|z
21
Which parse tree...
... is correct for x=x+y+z?
22
Ambiguity elimination
By considering the associative law for any ex-
pression,
x + y + z = (x + y) = z
the first parse is the correct one.
Question: How could we come up with a
grammar to reflect this nice feature?
Answer: Here is yet another attempt for the
assignment rule, that does have the above as-
sociative piece.
1. <assignment statement>::=<variable>:=<exp>
2. <exp>::=<variable>|<exp>+<variable>
3. <variable>::=x|y|z
This set of rule not only eliminates the ambigu-
ous interpretation of the statement, also as-
signs the intended meaning to the statement.
Question: Show me...
23
Here it is...
With this last set of revised grammar for the
assignment statement, the statement x:=x+y+z
will be interpreted as x:=(x+y)+z.
Thus, the expression will be interpreted as in-
tended. ,
Question: Anything else?
Homework: Exercises 12 and 15.
24
Rule for the conditions
1. <if statement>::=If(<Boolean exp>)then
<assignment statement><else clause>
2. <Boolean exp>::=<var>|<var> <relation> <var>
3. <relation>::= =|<|>
4. <var>::=x|y|z
5. <else clause>::=else<assignment statement>
Here is a tree for a conditional structure:
if(x==y) x=z else x=y
So, parsing could make you feel dizzy. /
25
Semantics and code generation
During parsing, a compiler deals with the syn-
tax of a statement, i.e., its structure. But, it
is not the case that every syntactically correct
sentence makes sense, e.g., “The man bit the
dog.” is grammatically correct, but doesn’t
make sense.
We thus also have to check the semantics, i.e.,
the meaning, of the statement. The compiler
will analyze the meaning of the tokens and un-
derstand the actions that it tries to perform. If
it does not make sense, it will be rejected; oth-
erwise, it will be translated into an equivalent
in machine language.
Given the following statement:
sum=a+b;
Although it is syntactically correct, it still might
not make sense if we know that the types of a
and b are char and integer, respectively.
26
Semantic record
The previous example tells us that we have to
add some additional information, such as the
type of a data item, to the parsing tree. In
general, we attach a semantic record to every
node in the parsing tree. For example, below
shows a more general parsing tree for a+b:
Based on this information, the compiler can
easily reject the expression that constructs the
following tree because of inconsistent types. /
27
Code generation
Thus, the first step of code generation has
to be semantic analysis, which checks every
branch of the parse tree to make sure that
they are semantically meaningful.
If it is not meaningful, errors will be reported.
Otherwise, we will get into the next phase to
produce the code.
Let’s use the following parse tree for x=y+z;
28
First things first
It begins by working with the original input to-
kens.
For example, it will generate the following code
for the branch for the variable x, y and z.
29
The next step
To accomplish y+z, it has to use some tem-
porary cell, e.g., TEMP, to store the sum first,
as the compiler has not seen x yet, where this
sum will be finally stored.
LOAD Y
ADD Z
STORE TEMP
TEMP: .DATA 0
Generally, whenever the compiler needs a tem-
porary space, it will create one using the .DATA
pseudo-op.
30
The final step
The final branch of the tree builds the non-
terminal <assignment statement>, and is trans-
lated into machine language by loading the
value of the expression to the variable.
LOAD TEMP
STORE X
The compiler also builds up a semantic record
for the finished product, which is considered
the value of the entire statement.
Notice the whole tree consists of the above
and the one on the previous page.
31
The completed code
To summarize, the following assignment state-
ment
x := y + z;
has been translated into the following machine
instructions
LOAD X
ADD Y
ADD Z
STORE X
...
X: .DATA 0
Y: .DATA 1
Z: .DATA 2
Homework: Exercise 23
32
The running example
Given the following line
x = x + y + z;
after lexical analysis (Page 11), the compiler
first generates the following parse tree, using
the revised grammar as shown on Page 23.
33
Now the code...
With the above tree, the compiler can then
generate the following code:
Question: Should you load it into the von
Neumann machine to play with it?
Answer: No. Wait until you read through
Page 37...
34
Code optimization
When compilers came out during the 1950’s,
they were not accepted that well. The major
reason is that the code they generated were
not that efficient, even though correct. Hence,
the need for code optimization, an issue that
we already went through back on Page 4 of
the notes.
There are two types of optimization: local and
global optimization. In local optimization, the
compiler looks at a very small block of instruc-
tion to see if any improvement can be made
to make it run faster.
For example, if an expression can be fully eval-
uated at this time, it should. Hence the fol-
lowing constant evaluation:
LOAD ONE LOAD TWO
ADD ONE =====> STORE X
STORE X
35
Other techniques
We also want to use simpler, and less time con-
suming, operations, hence the following strength
reduction:
LOAD X LOAD X
MULTIPLY TWO =====> ADD X
STORE X STORE X
Also we want to eliminate unnecessary work:
LOAD Y LOAD Y
STORE X =====> STORE X
LOAD X STORE Z
STORE Z
Global optimization requires the ability to see
the “big picture”, which is more difficult and
not always done. /
36
The optimized code
The above code as we saw on Page 34 can befurther optimized as follows:
.beginLOAD XADD YADD ZSTORE XOUT XHALTX: .DATA 0Y: .DATA 1Z: .DATA 2.end
If we run this with Assembler, we get the cor-rect answer of 3 back. , We did a lexical
analysis of this statement on Page 11, did acorrect parsing on Page 33, generated codeon Page 34, and finally provided the optimizedcode here.
We thus ran through the whole compiler pro-cess, as shown on Page 6.
Lab time: Lab 12 on Language Translation
37