Upload
jagadeesan-dhanapal
View
12
Download
3
Embed Size (px)
DESCRIPTION
2008 syllabus PCD unit - 1 notes
Citation preview
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
1 / 15
Syllabus : Unit – I : LEXICAL ANALYSIS
Introduction to Compiling- Compilers-Analysis of the source program- The phases-
Cousins-The grouping of phases-Compiler construction tools. The role of the lexical
analyzer- Input buffering-Specification of tokens-Recognition of tokens-A language for
specifying lexical analyzer.
Compiler :
A Compiler is a program that reads a program written in one language (Source
Language like C,C++,etc…) and translate it into an equivalent program in another language
(Target Language like Machine Language) and the Compiler reports to its user the presence
of errors in the source program.
Classification of Compiler :
1. Single Pass Compiler (narrow) - traverse the source program in only once.
Faster, has limited scope of passes, eg. Pascal
2. Multi-Pass Compiler (wide) – processes the source program in several times.
Slower, has wide scope of passes, eg. Java
3. Load and Go Compiler - generates machine code and then immediately
executes it.
4. Debugging or Optimizing Compiler - tries to minimize or maximize some
attributes of an executable computer
program
Software Tools :
Many software tools that manipulate source programs first perform some kind of
analysis. Some examples of such tools include:
Structure Editors :
A structure editor takes as input a sequence of commands to build a source
program.
The structure editor not only performs the text-creation and modification
functions of an ordinary text editor, but it also analyzes the program text,
putting an appropriate hierarchical structure on the source program.
Example – while …. do and begin….. end.
Pretty printers :
A pretty printer analyzes a program and prints it in such a way that the
structure of the program becomes clearly visible.
Source Program
(High Level Language)
Target Program
(Low Level Language) Compiler
Error Message
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
2 / 15
Static Checkers : A static checker reads a program, analyzes it, and attempts to discover
potential bugs without running the program.
Interpreters :
Translate from high level language ( BASIC, FORTRAN, etc..) into
assembly or machine language.
Interpreters are frequently used to execute command language, since each
operator executed in a command language is usually an invocation of a
complex routine such as an editor or Compiler.
The analysis portion in each of the following examples is similar to that of
a conventional Compiler.
Text formatters.
Silicon Compiler.
Query interpreters.
Analysis of Source Program : The analysis phase breaks up the source program into constituent pieces and creates
an intermediate representation of the source program. Analysis consists of three phases:
Linear analysis (Lexical analysis or Scanning)) :
The lexical analysis phase reads the characters in the source program and
grouped into them tokens that are sequence of characters having a
collective meaning.
Example : position : = initial + rate * 60
Identifiers – position, initial, rate.
Assignment symbol - : =
Operators - + , *
Number - 60
Blanks – eliminated.
Hierarchical analysis (Syntax analysis or Parsing) :
It involves grouping the tokens of the source program into grammatical
phrases that are used by the Compiler to synthesize output.
Example : position : = initial + rate * 60
Assignment statement
|
: =
Identifier
|
position
Expression
|
+
Expression
|
*
Expression
|
identifier
|
rate
Expression
|
number
|
60
Expression
|
identifier
|
initial
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
3 / 15
Semantic analysis :
In this phase checks the source program for semantic errors and gathers
type information for subsequent code generation phase.
An important component of semantic analysis is type checking.
Example : int to real conversion.
Phases of Compiler:
A Compiler operates in phases, each of which transforms the source program from
one representation to another.
Two parts (Six Phases) of compilation. They are,
Analysis Phase ( Three Phases)
Lexical Analysis
Syntax Analysis
Semantic Analysis
Synthesis Phase ( Three Phases)
Intermediate Code Generation
Code Optimizer
Code Generator
Two other activities are
Symbol Table Management
Error Handler
Lexical Analysis :
It is also called scanner.
The lexical analysis phase reads the characters in the source program and
grouped into them tokens that are sequence of characters having a
collective meaning. Such as an Identifier, a Keyword, a Punctuation, an
operator or multi character operator like ++.
The character sequence forming a token is called the lexeme for the token.
Certain tokens will be augmented by a lexical value.
Example : position : = initial + rate * 60 id1 := id2 + id3 * 60
Blanks – eliminated.
Expression
|
*
Expression
|
identifier
|
rate
Expression
|
number
|
inttoreal
|
60
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
4 / 15
Syntax analysis:
It processes the string of descriptors (tokens), synthesized by the lexical
analyzer to determine the syntactic structure of an input statement. This
process is known as parsing.
Output of the parsing step is a representation of the syntactic structure of a
statement. A convenient representation is in the form of a syntax tree.
Example : position : = initial + rate * 60
Semantic analysis :
In this phase checks the source program for semantic errors and gathers
type information for subsequent code generation phase.
Source Program
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate Code
Generator
Code Optimizer
Code Generator
Target Program
Error
Handler
Symbol Table
Management
: =
id1 +
*
id3
60
id2
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
5 / 15
An important component of semantic analysis is type checking.
Example : int to real conversion.
Intermediate Code Generation:
It should be easy to produce.
It should be easy to translate into the target program.
Three address codes consist of a sequence of instructions, each of which
has at most three operands.
Example id1 := id2 + id3 * 60
Three address code as
temp1 := inttoreal (60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
Code Optimization:
To improve the intermediate code, so that faster running machine code will
result.
Example Three address code after optimization as
temp1 := id3 * 60.0
id1 := id2 + temp1
Code Generation:
Final phase of the Compiler is the generation of target code, consisting or
relocatable machine code or assembly code.
Example for 8086 conversion code
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
id3
: =
id1 +
*
inttoreal
|
60
id2
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
6 / 15
Symbol Table Management:
A Symbol table is data structure containing a record for each identifier
with fields for the attributes of an identifier.
When an identifier in the source program is detected by the lexical
analyzer, the identifier is entered into the symbol table. However, the
attributes of an identifier cannot normally be determined during lexical
analyzer.
The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways.
Error Handler:
Each phase can encounted errors
The lexical phase can detect errors where the characters remaining
in the input do not form any token of the language.
The syntax analysis phase can detect errors where the token stream
violates the structure rules of language.
During semantic analysis, the compiler tries to detect construct that
have the right syntactic structure but no meaning to the operation
involved.
An intermediate code generator may detect an operator whose
operands have incompatible.
The code optimizer, doing control flow analysis may detect that
certain statements can never be reached.
While entering information into the symbol table, the book keeping
routine may discover an identifier that has been multiply declared
with contradicting attributes.
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
7 / 15
Write down the output of each phase for expression position : = initial + rate * 60
Source Program
position : = initial + rate * 60
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate Code
Generator
Code Optimizer
Code Generator
Target Program
Error
Handler Symbol Table
Management
id1 := id2 + id3 * 60
: =
id1 +
*
id3
60
id2
id3
: =
id1 +
*
inttoreal
|
60
id2
temp1 : = inttoreal (60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
temp1 := id3 * 60.0
id1 := id2 + temp1
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
8 / 15
Cousins of the Compiler (Language Processing System) :
Preprocessors :
It produces input to Compiler. They may perform the following functions.
Macro Processing :
A preprocessor may allow a user to define macros that are
shorthands for longer constructs.
File inclusion :
A preprocessor may include header files into the program text.
Rational preprocessors :
These preprocessors augment older language with more
modern flow of control and data structuring facilities.
Language extensions :
These preprocessor attempts to add capabilities to the language
by what amounts to built in macros.
Compiler :
It converts the source program (HLL) into target program (LLL).
Assembler :
It converts an assembly language (LLL) into machine code.
Loader and Link Editors :
Loader :
The process of loading consists of taking relocatable machine code,
altering the relocatable addresses and placing the altered
instructions and data in memory at the proper locations.
Link Editor :
It allows us to make a single program from several files of
relocatable machine code.
Preprocessor
Compiler
Assembler
Load/Link-editor
Skeletal source Program
Source Program
Target Assembly Program
Relocatable Machine Code
Library,
Relocatable Object Files
Absolute Machine Code
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
9 / 15
Grouping of Compiler :
A Symbol table is data structure containing a record for each identifier
with fields for the attributes of an identifier.
When an identifier in the source program is detected by the lexical analyzer, the
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
10 / 15
COMPILER CONSTRUCTION TOOLS:
The compiler writer like any programmer can profitably use software tools such as
debuggers, version managers, profilers and so on.
Compiler construction tools are
Parser generators
Scanner generators
Syntax-directed translations engines
Automatic code generators
Dataflow engines
Parser Generators:
These produce syntax analyzers, normally from input that is based on CFG. In
early Compilers, syntax analysis consumed not only a large fraction of the scanning time
of a Compiler, but a large fraction of the intellectual effort of writing a Compiler.
Eg: PIC, EQM
Scanner Generator:
These automatically generate lexical analyzers, normally from a specification
based on regular expressions. The basic organization of the resulting lexical analyzer is in
effect of finite automation.
Syntax-Directed Translation Engines:
These produce intermediate code with three address format, normally from
input that is based on the parse tree.
Automatic Code Generator:
It takes a collection of rules that define the translation of each operation of the
intermediate language into the machine language for the target machine.
The input specification for these systems may contain:
1. A description of the lexical and syntactic structure of the source language.
2. A description of what output is to be generated for each source language
construct.
3. A description of the target machine.
Dataflow Engines:
Much of the information needed to perform good code optimization involves
“dataflow analysis”, the gathering of information about how values are transmitted from
one part of a program to each other part.
These systems have often been referred as,
Compiler- compilers.
Compiler-generators
Translator-writing systems
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
11 / 15
ROLE OF LEXICAL ANALYSER:
Its main task is to read the input characters and produce as output a sequence of
tokens that the parser uses for syntax analysis.
Receiving a “get next token” command from the parser, the lexical analyzer reads
input characters until it can identify the next token.
Its secondary takes are,
1. One task is stripping out from the source program comments and while
space in the form of blank, tab, new line characters.
2. Another task is converting error messages from the compiler with the
source program.
Two phases
1. Scanning
2. Lexical analysis
The scanner is responsible for doing simple tasks, while the lexical
analyzer proper does the more complex operations.
FUNCTIONS:
1. It produces the stream of tokens.
2. It eliminates blank and commands.
3. It generates symbol table which stores the information about ID, constants
encountered in the input.
4. It keeps track of line number.
5. It reports the error encountered while interrupting the tokens.
ISSUES IN LEXICAL ANALYSIS:
There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing.
Simpler design.
Compiler efficiency is improved.
Compiler portability is enhanced.
Lexical
analyzer Parser
Symbol table
Management
Source Program Tokens
Get next token
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
12 / 15
TOKEN:
It is a sequence of character that can be treated as a single logical entity.
Typical tokens are,
1. Identifiers
2. Keywords
3. Operators
4. Special symbols
5. Constants
PATTERN:
A set of strings in the input for which the same token is produced as output.
This set of strings is described by a rule called a pattern associated with the token.
LEXEME:
It is sequence of characters in the source program that is matched by the
pattern foe a token.
INPUT BUFFERING :
During the analysis, the scanner scans the input string from left to right one character
at a time to identify tokens. It uses two pointers for doing this analysis
1. Begin pointer (to keep track of first character for each token).
2. Forward pointer(to keep track of next character)
f l o a t a , b ; a = A + 2 ;
Steps in Scanning the Input:
1. Initially, both begin pointer and forward pointer points to the first character of the
lexeme.
2. The fp scans the buffer until there is a match with the described token is found.
3. Once the lexeme is found (either a space or a delimiter), the fp will represent the right
end to the lexeme.
bp
fp
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
13 / 15
f l o a t a , b ; a = A + 2 ;
4. After processing the lexeme, both pointers will be set to point the character
immediately after the lexeme.
f l o a t a , b ; a = A + 2 ;
5. This procedure is represented for the entire source program.
Input strings are usually stored in buffer.
Two Types:
1. One buffer scheme
2. Two buffer scheme
One Buffer Scheme:
Only one buffer of size ‘N’ is used.
First N characters of the input string are read into the buffer. When the fp
reaches the end into the buffer, it will be filled with the next set of N
characters.
Drawbacks:
The problem with this implementation is that when the size of the token is
greater than ‘N’ this scheme fails to produce the tokens.
Two Buffer Scheme:
f L o a t eof a , b ; a = a + 2 eof
Two N character buffers are used.
bp
bp
fp
fp
fp
First half N Size Second half N Size
bp
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
14 / 15
First N characters are read into the first half of the buffer. If the buffer hasn’t
filled (<N) then a special character called EOF will be inserted to indicate the
end.
When the pointer reaches the end of first half, then the second half will be
loaded with next N characters of the same program.
When the pointer is about to reach the end of second half, then the first half
will be loaded with next N characters of the input.
Algorithm for advancing the fp :
if fp is at the end of first half then
begin
Load second half;
Increment fp by 1;
end
else if fp at the end of second half then
begin
Load first half;
Set fp to first character of first half;
end
else
increment fp by 1;
end
Every time to check whether it has reached its end or not.
To reduce the number of comparisons, a special character called sentinel character
(usually EOF) is introduced at ends of the buffer halves.
Algorithm for advancing the fp using Sentinel:
fp = fp+1;
begin
if fp = eof then
if fp at the end of first half then
begin
Load second half;
Fp by 1;
end
else if fp at the end of second half then
begin
Load first half;
Set fp to first character of first half;
end
else
Terminate lexical analysis;
End
Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,
CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE
Unit – I
15 / 15
Refer the following from Theory of Computation
1. Finite Automata
2. DFA
3. NFA
4. Regular Expression
5. Converting R.E into NFA
6. Converting NFA with into NFA and DFA
7. Minimization of DFA.