15
Anna University B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE., CS2352 Principles of Compiler Design Asst. Professor in CSE,APCE Unit I 1 / 15 Syllabus : Unit I : LEXICAL ANALYSIS Introduction to Compiling- Compilers-Analysis of the source program- The phases- Cousins-The grouping of phases-Compiler construction tools. The role of the lexical analyzer- Input buffering-Specification of tokens-Recognition of tokens-A language for specifying lexical analyzer. Compiler : A Compiler is a program that reads a program written in one language (Source Language like C,C++,etc…) and translate it into an equivalent program in another language (Target Language like Machine Language) and the Compiler reports to its user the presence of errors in the source program. Classification of Compiler : 1. Single Pass Compiler (narrow) - traverse the source program in only once. Faster, has limited scope of passes, eg. Pascal 2. Multi-Pass Compiler (wide) processes the source program in several times. Slower, has wide scope of passes, eg. Java 3. Load and Go Compiler - generates machine code and then immediately executes it. 4. Debugging or Optimizing Compiler - tries to minimize or maximize some attributes of an executable computer program Software Tools : Many software tools that manipulate source programs first perform some kind of analysis. Some examples of such tools include: Structure Editors : A structure editor takes as input a sequence of commands to build a source program. The structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure on the source program. Example while …. do and begin….. end. Pretty printers : A pretty printer analyzes a program and prints it in such a way that the structure of the program becomes clearly visible . Source Program (High Level Language) Target Program (Low Level Language) Compiler Error Message

PCD Notes_Unit -1

Embed Size (px)

DESCRIPTION

2008 syllabus PCD unit - 1 notes

Citation preview

Page 1: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

1 / 15

Syllabus : Unit – I : LEXICAL ANALYSIS

Introduction to Compiling- Compilers-Analysis of the source program- The phases-

Cousins-The grouping of phases-Compiler construction tools. The role of the lexical

analyzer- Input buffering-Specification of tokens-Recognition of tokens-A language for

specifying lexical analyzer.

Compiler :

A Compiler is a program that reads a program written in one language (Source

Language like C,C++,etc…) and translate it into an equivalent program in another language

(Target Language like Machine Language) and the Compiler reports to its user the presence

of errors in the source program.

Classification of Compiler :

1. Single Pass Compiler (narrow) - traverse the source program in only once.

Faster, has limited scope of passes, eg. Pascal

2. Multi-Pass Compiler (wide) – processes the source program in several times.

Slower, has wide scope of passes, eg. Java

3. Load and Go Compiler - generates machine code and then immediately

executes it.

4. Debugging or Optimizing Compiler - tries to minimize or maximize some

attributes of an executable computer

program

Software Tools :

Many software tools that manipulate source programs first perform some kind of

analysis. Some examples of such tools include:

Structure Editors :

A structure editor takes as input a sequence of commands to build a source

program.

The structure editor not only performs the text-creation and modification

functions of an ordinary text editor, but it also analyzes the program text,

putting an appropriate hierarchical structure on the source program.

Example – while …. do and begin….. end.

Pretty printers :

A pretty printer analyzes a program and prints it in such a way that the

structure of the program becomes clearly visible.

Source Program

(High Level Language)

Target Program

(Low Level Language) Compiler

Error Message

Page 2: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

2 / 15

Static Checkers : A static checker reads a program, analyzes it, and attempts to discover

potential bugs without running the program.

Interpreters :

Translate from high level language ( BASIC, FORTRAN, etc..) into

assembly or machine language.

Interpreters are frequently used to execute command language, since each

operator executed in a command language is usually an invocation of a

complex routine such as an editor or Compiler.

The analysis portion in each of the following examples is similar to that of

a conventional Compiler.

Text formatters.

Silicon Compiler.

Query interpreters.

Analysis of Source Program : The analysis phase breaks up the source program into constituent pieces and creates

an intermediate representation of the source program. Analysis consists of three phases:

Linear analysis (Lexical analysis or Scanning)) :

The lexical analysis phase reads the characters in the source program and

grouped into them tokens that are sequence of characters having a

collective meaning.

Example : position : = initial + rate * 60

Identifiers – position, initial, rate.

Assignment symbol - : =

Operators - + , *

Number - 60

Blanks – eliminated.

Hierarchical analysis (Syntax analysis or Parsing) :

It involves grouping the tokens of the source program into grammatical

phrases that are used by the Compiler to synthesize output.

Example : position : = initial + rate * 60

Assignment statement

|

: =

Identifier

|

position

Expression

|

+

Expression

|

*

Expression

|

identifier

|

rate

Expression

|

number

|

60

Expression

|

identifier

|

initial

Page 3: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

3 / 15

Semantic analysis :

In this phase checks the source program for semantic errors and gathers

type information for subsequent code generation phase.

An important component of semantic analysis is type checking.

Example : int to real conversion.

Phases of Compiler:

A Compiler operates in phases, each of which transforms the source program from

one representation to another.

Two parts (Six Phases) of compilation. They are,

Analysis Phase ( Three Phases)

Lexical Analysis

Syntax Analysis

Semantic Analysis

Synthesis Phase ( Three Phases)

Intermediate Code Generation

Code Optimizer

Code Generator

Two other activities are

Symbol Table Management

Error Handler

Lexical Analysis :

It is also called scanner.

The lexical analysis phase reads the characters in the source program and

grouped into them tokens that are sequence of characters having a

collective meaning. Such as an Identifier, a Keyword, a Punctuation, an

operator or multi character operator like ++.

The character sequence forming a token is called the lexeme for the token.

Certain tokens will be augmented by a lexical value.

Example : position : = initial + rate * 60 id1 := id2 + id3 * 60

Blanks – eliminated.

Expression

|

*

Expression

|

identifier

|

rate

Expression

|

number

|

inttoreal

|

60

Page 4: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

4 / 15

Syntax analysis:

It processes the string of descriptors (tokens), synthesized by the lexical

analyzer to determine the syntactic structure of an input statement. This

process is known as parsing.

Output of the parsing step is a representation of the syntactic structure of a

statement. A convenient representation is in the form of a syntax tree.

Example : position : = initial + rate * 60

Semantic analysis :

In this phase checks the source program for semantic errors and gathers

type information for subsequent code generation phase.

Source Program

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Intermediate Code

Generator

Code Optimizer

Code Generator

Target Program

Error

Handler

Symbol Table

Management

: =

id1 +

*

id3

60

id2

Page 5: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

5 / 15

An important component of semantic analysis is type checking.

Example : int to real conversion.

Intermediate Code Generation:

It should be easy to produce.

It should be easy to translate into the target program.

Three address codes consist of a sequence of instructions, each of which

has at most three operands.

Example id1 := id2 + id3 * 60

Three address code as

temp1 := inttoreal (60)

temp2 := id3 * temp1

temp3 := id2 + temp2

id1 := temp3

Code Optimization:

To improve the intermediate code, so that faster running machine code will

result.

Example Three address code after optimization as

temp1 := id3 * 60.0

id1 := id2 + temp1

Code Generation:

Final phase of the Compiler is the generation of target code, consisting or

relocatable machine code or assembly code.

Example for 8086 conversion code

MOVF id3, R2

MULF #60.0, R2

MOVF id2, R1

ADDF R2, R1

MOVF R1, id1

id3

: =

id1 +

*

inttoreal

|

60

id2

Page 6: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

6 / 15

Symbol Table Management:

A Symbol table is data structure containing a record for each identifier

with fields for the attributes of an identifier.

When an identifier in the source program is detected by the lexical

analyzer, the identifier is entered into the symbol table. However, the

attributes of an identifier cannot normally be determined during lexical

analyzer.

The remaining phases enter information about identifiers into the symbol

table and then use this information in various ways.

Error Handler:

Each phase can encounted errors

The lexical phase can detect errors where the characters remaining

in the input do not form any token of the language.

The syntax analysis phase can detect errors where the token stream

violates the structure rules of language.

During semantic analysis, the compiler tries to detect construct that

have the right syntactic structure but no meaning to the operation

involved.

An intermediate code generator may detect an operator whose

operands have incompatible.

The code optimizer, doing control flow analysis may detect that

certain statements can never be reached.

While entering information into the symbol table, the book keeping

routine may discover an identifier that has been multiply declared

with contradicting attributes.

Page 7: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

7 / 15

Write down the output of each phase for expression position : = initial + rate * 60

Source Program

position : = initial + rate * 60

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Intermediate Code

Generator

Code Optimizer

Code Generator

Target Program

Error

Handler Symbol Table

Management

id1 := id2 + id3 * 60

: =

id1 +

*

id3

60

id2

id3

: =

id1 +

*

inttoreal

|

60

id2

temp1 : = inttoreal (60)

temp2 := id3 * temp1

temp3 := id2 + temp2

id1 := temp3

temp1 := id3 * 60.0

id1 := id2 + temp1

MOVF id3, R2

MULF #60.0, R2

MOVF id2, R1

ADDF R2, R1

MOVF R1, id1

Page 8: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

8 / 15

Cousins of the Compiler (Language Processing System) :

Preprocessors :

It produces input to Compiler. They may perform the following functions.

Macro Processing :

A preprocessor may allow a user to define macros that are

shorthands for longer constructs.

File inclusion :

A preprocessor may include header files into the program text.

Rational preprocessors :

These preprocessors augment older language with more

modern flow of control and data structuring facilities.

Language extensions :

These preprocessor attempts to add capabilities to the language

by what amounts to built in macros.

Compiler :

It converts the source program (HLL) into target program (LLL).

Assembler :

It converts an assembly language (LLL) into machine code.

Loader and Link Editors :

Loader :

The process of loading consists of taking relocatable machine code,

altering the relocatable addresses and placing the altered

instructions and data in memory at the proper locations.

Link Editor :

It allows us to make a single program from several files of

relocatable machine code.

Preprocessor

Compiler

Assembler

Load/Link-editor

Skeletal source Program

Source Program

Target Assembly Program

Relocatable Machine Code

Library,

Relocatable Object Files

Absolute Machine Code

Page 9: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

9 / 15

Grouping of Compiler :

A Symbol table is data structure containing a record for each identifier

with fields for the attributes of an identifier.

When an identifier in the source program is detected by the lexical analyzer, the

Page 10: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

10 / 15

COMPILER CONSTRUCTION TOOLS:

The compiler writer like any programmer can profitably use software tools such as

debuggers, version managers, profilers and so on.

Compiler construction tools are

Parser generators

Scanner generators

Syntax-directed translations engines

Automatic code generators

Dataflow engines

Parser Generators:

These produce syntax analyzers, normally from input that is based on CFG. In

early Compilers, syntax analysis consumed not only a large fraction of the scanning time

of a Compiler, but a large fraction of the intellectual effort of writing a Compiler.

Eg: PIC, EQM

Scanner Generator:

These automatically generate lexical analyzers, normally from a specification

based on regular expressions. The basic organization of the resulting lexical analyzer is in

effect of finite automation.

Syntax-Directed Translation Engines:

These produce intermediate code with three address format, normally from

input that is based on the parse tree.

Automatic Code Generator:

It takes a collection of rules that define the translation of each operation of the

intermediate language into the machine language for the target machine.

The input specification for these systems may contain:

1. A description of the lexical and syntactic structure of the source language.

2. A description of what output is to be generated for each source language

construct.

3. A description of the target machine.

Dataflow Engines:

Much of the information needed to perform good code optimization involves

“dataflow analysis”, the gathering of information about how values are transmitted from

one part of a program to each other part.

These systems have often been referred as,

Compiler- compilers.

Compiler-generators

Translator-writing systems

Page 11: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

11 / 15

ROLE OF LEXICAL ANALYSER:

Its main task is to read the input characters and produce as output a sequence of

tokens that the parser uses for syntax analysis.

Receiving a “get next token” command from the parser, the lexical analyzer reads

input characters until it can identify the next token.

Its secondary takes are,

1. One task is stripping out from the source program comments and while

space in the form of blank, tab, new line characters.

2. Another task is converting error messages from the compiler with the

source program.

Two phases

1. Scanning

2. Lexical analysis

The scanner is responsible for doing simple tasks, while the lexical

analyzer proper does the more complex operations.

FUNCTIONS:

1. It produces the stream of tokens.

2. It eliminates blank and commands.

3. It generates symbol table which stores the information about ID, constants

encountered in the input.

4. It keeps track of line number.

5. It reports the error encountered while interrupting the tokens.

ISSUES IN LEXICAL ANALYSIS:

There are several reasons for separating the analysis phase of compiling into

lexical analysis and parsing.

Simpler design.

Compiler efficiency is improved.

Compiler portability is enhanced.

Lexical

analyzer Parser

Symbol table

Management

Source Program Tokens

Get next token

Page 12: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

12 / 15

TOKEN:

It is a sequence of character that can be treated as a single logical entity.

Typical tokens are,

1. Identifiers

2. Keywords

3. Operators

4. Special symbols

5. Constants

PATTERN:

A set of strings in the input for which the same token is produced as output.

This set of strings is described by a rule called a pattern associated with the token.

LEXEME:

It is sequence of characters in the source program that is matched by the

pattern foe a token.

INPUT BUFFERING :

During the analysis, the scanner scans the input string from left to right one character

at a time to identify tokens. It uses two pointers for doing this analysis

1. Begin pointer (to keep track of first character for each token).

2. Forward pointer(to keep track of next character)

f l o a t a , b ; a = A + 2 ;

Steps in Scanning the Input:

1. Initially, both begin pointer and forward pointer points to the first character of the

lexeme.

2. The fp scans the buffer until there is a match with the described token is found.

3. Once the lexeme is found (either a space or a delimiter), the fp will represent the right

end to the lexeme.

bp

fp

Page 13: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

13 / 15

f l o a t a , b ; a = A + 2 ;

4. After processing the lexeme, both pointers will be set to point the character

immediately after the lexeme.

f l o a t a , b ; a = A + 2 ;

5. This procedure is represented for the entire source program.

Input strings are usually stored in buffer.

Two Types:

1. One buffer scheme

2. Two buffer scheme

One Buffer Scheme:

Only one buffer of size ‘N’ is used.

First N characters of the input string are read into the buffer. When the fp

reaches the end into the buffer, it will be filled with the next set of N

characters.

Drawbacks:

The problem with this implementation is that when the size of the token is

greater than ‘N’ this scheme fails to produce the tokens.

Two Buffer Scheme:

f L o a t eof a , b ; a = a + 2 eof

Two N character buffers are used.

bp

bp

fp

fp

fp

First half N Size Second half N Size

bp

Page 14: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

14 / 15

First N characters are read into the first half of the buffer. If the buffer hasn’t

filled (<N) then a special character called EOF will be inserted to indicate the

end.

When the pointer reaches the end of first half, then the second half will be

loaded with next N characters of the same program.

When the pointer is about to reach the end of second half, then the first half

will be loaded with next N characters of the input.

Algorithm for advancing the fp :

if fp is at the end of first half then

begin

Load second half;

Increment fp by 1;

end

else if fp at the end of second half then

begin

Load first half;

Set fp to first character of first half;

end

else

increment fp by 1;

end

Every time to check whether it has reached its end or not.

To reduce the number of comparisons, a special character called sentinel character

(usually EOF) is introduced at ends of the buffer halves.

Algorithm for advancing the fp using Sentinel:

fp = fp+1;

begin

if fp = eof then

if fp at the end of first half then

begin

Load second half;

Fp by 1;

end

else if fp at the end of second half then

begin

Load first half;

Set fp to first character of first half;

end

else

Terminate lexical analysis;

End

Page 15: PCD Notes_Unit -1

Anna University – B.E -VI Sem CSE D. Jagadeesan, B.E., M.Tech., (Ph.D)., MISTE.,

CS2352 – Principles of Compiler Design Asst. Professor in CSE,APCE

Unit – I

15 / 15

Refer the following from Theory of Computation

1. Finite Automata

2. DFA

3. NFA

4. Regular Expression

5. Converting R.E into NFA

6. Converting NFA with into NFA and DFA

7. Minimization of DFA.