Cs6660 Compiler Design Appasami

8/17/2019 Cs6660 Compiler Design Appasami

1/189

COMPILER

DESIGN

G. Appasami, M.Sc., M.C.A., M.Phil., M.Tech., (Ph.D.)

Assistant Professor

Department of Computer Science and Engineering

Dr. Paul’s Engineering Collage

Pauls Nagar, Villupuram

Tamilnadu, India

SARUMATHI PUBLICATIONS


2/189

Villupuram, Tamilnadu, India


3/189

First Edition: July 2015

Second Edition: April 2016

Published By


© All rights reserved. No part of this publication can be reproduced or stored in any form or

by means of photocopy, recording or otherwise without the prior written permission of the

author.

Price Rs. 101/-

Copies can be had from


Villupuram, Tamilnadu, India.

[email protected]

Printed at

Meenam Offset


4/189

Pondicherry – 605001, India


5/189

CS6660 COMPILER DESIGN L T P C 3 0 0 3

UNIT I INTRODUCTION TO COMPILERS 5

Translators-Compilation and Interpretation-Language processors -The Phases of Compiler-Errors Encountered in Different Phases-The Grouping of Phases-Compiler ConstructionTools - Programming Language basics.

UNIT II LEXICAL ANALYSIS 9

Need and Role of Lexical Analyzer-Lexical Errors-Expressing Tokens by RegularExpressions- Converting Regular Expression to DFA- Minimization of DFA-Language forSpecifying Lexical Analyzers-LEX-Design of Lexical Analyzer for a sample Language.

UNIT III SYNTAX ANALYSIS 10

Need and Role of the Parser-Context Free Grammars -Top Down Parsing -GeneralStrategies- Recursive Descent Parser Predictive Parser-LL(1) Parser-Shift Reduce Parser-LRParser-LR (0) Item-Construction of SLR Parsing Table -Introduction to LALR Parser - ErrorHandling and Recovery in Syntax Analyzer-YACC-Design of a syntax Analyzer for aSample Language .

UNIT IV SYNTAX DIRECTED TRANSLATION & RUN TIME ENVIRONMENT 12

Syntax directed Definitions-Construction of Syntax Tree-Bottom-up Evaluation of S-Attribute Definitions- Design of predictive translator - Type Systems-Specification of asimple type checker- Equivalence of Type Expressions-Type Conversions.

RUN-TIME ENVIRONMENT: Source Language Issues-Storage Organization-Storage

Allocation- Parameter Passing-Symbol Tables-Dynamic Storage Allocation-StorageAllocation in FORTAN.

UNIT V CODE OPTIMIZATION AND CODE GENERATION 9

Principal Sources of Optimization-DAG- Optimization of Basic Blocks-Global Data FlowAnalysis- Efficient Data Flow Algorithms-Issues in Design of a Code Generator - A SimpleCode Generator Algorithm.

TOTAL: 45 PERIODS

TEXTBOOK:

1. Alfred V Aho, Monica S. Lam, Ravi Sethi and Jeffrey D Ullman, “Compilers –

Principles, Techniques and Tools”, 2nd Edition, Pearson Education, 2007.

REFERENCES:

1. Randy Allen, Ken Kennedy, “Optimizing Compilers for Modern Architectures: A

Dependence-based Approach”, Morgan Kaufmann Publishers, 2002.

2. Steven S. Muchnick, “Advanced Compiler Design and Implementation, “MorganKaufmann Publishers - Elsevier Science, India, Indian Reprint 2003.3. Keith D Cooper and Linda Torczon, “Engineering a Compiler”, Morgan Kaufmann

Publishers Elsevier Science, 2004.

4. Charles N. Fischer, Richard. J. LeBlanc, “Crafting a Compiler with C”, Pearson

Education, 2008.


6/189

Acknowledgement

I am very much grateful to the management of paul’s educational trust,

Respected principal Dr. Y. R. M. Rao, M.E., Ph.D., cherished Dean Dr. E. Mariappane,

M.E., Ph.D., and helpful Head of the department Mr. M. G. Lavakumar M.E., (Ph.D.).

I thank my colleagues and friends for their cooperation and their support in mycareer venture.

I thank my parents and family members for their valuable support in completion of the book successfully.

I express my special thanks to SARUMATHI PUBLICATIONS for their continuedcooperation in shaping the work.

Suggestions and comments to improve the text are very much solicitated.

Mr. G. Appasami


7/189

TABLE OF CONTENTS

UNIT I INTRODUCTION TO COMPILERS

1.1 Translators 1.1

1.2 Compilation and Interpretation 1.1

1.3 Language processors 1.1

1.4 The Phases of Compiler 1.3

1.5 Errors Encountered in Different Phases 1.8

1.6 The Grouping of Phases 1.9

1.7 Compiler Construction Tools 1.10

1.8 Programming Language basics 1.10

UNIT II LEXICAL ANALYSIS

2.1 Need and Role of Lexical Analyzer 2.1

2.2 Lexical Errors 2.3

2.3 Expressing Tokens by Regular Expressions 2.3

2.4 Converting Regular Expression to DFA 2.6

2.5 Minimization of DFA 2.9

2.6 Language for Specifying Lexical Analyzers-LEX 2.10

2.7 Design of Lexical Analyzer for a sample Language 2.12

UNIT III SYNTAX ANALYSIS

3.1 Need and Role of the Parser 3.1

3.2 Context Free Grammars 3.1

3.3 Top Down Parsing -General Strategies 3.9

3.4 Recursive Descent Parser 3.10

3.5 Predictive Parser 3.113.6 LL(1) Parser 3.12

3.7 Shift Reduce Parser 3.14

3.8 LR Parser 3.15

3.9 LR (0) Item 3.17

3.10 Construction of SLR Parsing Table 3.18

3.11 Introduction to LALR Parser 3.22

3.12 Error Handling and Recovery in Syntax Analyzer 3.26

3.13 YACC 3.27

3.14 Design of a syntax Analyzer for a Sample Language 3.29


8/189

UNIT IV SYNTAX DIRECTED TRANSLATION & RUN TIME ENVIRONMENT

4.1 Syntax directed Definitions 4.1

4.2 Construction of Syntax Tree 4.2

4.3 Bottom-up Evaluation of S-Attribute Definitions 4.34.4 Design of predictive translator 4.6

4.5 Type Systems 4.7

4.6 Specification of a simple type checker 4.8

4.7 Equivalence of Type Expressions 4.10

4.8 Type Conversions 4.14

4.9 RUN-TIME ENVIRONMENT: Source Language Issues 4.16

4.10 Storage Organization 4.19

4.11 Storage Allocation 4.21

4.12 Parameter Passing 4.23

4.13 Symbol Tables 4.244.14 Dynamic Storage Allocation 4.28

4.15 Storage Allocation in FORTAN 4.29

UNIT V CODE OPTIMIZATION AND CODE GENERATION

5.1 Principal Sources of Optimization 5.1

5.2 DAG 5.8

5.3 Optimization of Basic Blocks 5.9

5.4 Global Data Flow Analysis 5.15

5.5 Efficient Data Flow Algorithms 5.195.6 Issues in Design of a Code Generator 5.21

5.7 A Simple Code Generator Algorithm 5.24


9/189

CS6660 __ Compiler Design Unit I _____1.1

UNIT I INTRODUCTION TO COMPILERS

1.1 TRANSLATORS

A translator is one kind of program that takes one form of program (input) and converts intoanother form (output). The input program is called source language and the output program iscalled target language.

The source language can be low level language like assembly language or a high levellanguage like C, C++, JAVA, FORTRAN, and so on.

The target language can be a low level language (assembly language) or a machinelanguage (set of instructions executed directly by a CPU).

Source Target

Translator

languagelanguage

Figure 1.1: Translator

Types of Translators are:

(1). Compilers

(2). Interpreters

(3). Assemblers

1.2 COMPILATION AND INTERPRETATION

A compiler is a program that reads a program in one language and translates it into anequivalent program in another language. The translation done by a compiler is called compilation.

An interpreter is another common kind of language processor. Instead of producing a targetprogram as a translation, an interpreter appears to directly execute the operations specified in thesource program on inputs supplied by the user. An interpreter executes the source programstatement by statement. The translation done by an interpreter is called Interpretation.

1.3 LANGUAGE PROCESSORS®

(i) Compiler

A compiler is a program that can read a program in one language (the source language) and

translate it into an equivalent program in another language (the target language) compilation isshown in Figure 1.2.

Source program

Targetprogram

Compiler(Input) (Output)

Figure 1.2: A Compiler

An important role of the compiler is to report any errors in the source program that it detects

during the translation process.If the target program is an executable machine-language program, it can then be called by

the user to process inputs and produce outputs.


10/189

Input Target Program Output

Figure 1.3: Running the target program


11/189


(ii) Interpreter

An interpreter is another common kind of language processor. Instead of producing a targetprogram as a translation, an interpreter appears to directly execute the operations specified in thesource program on inputs supplied by the user, as shown in Figure 1.4.

Source

Program InterpreterOutp

utInput

Figure 1.4: An interpreter

The machine-language target program produced by a compiler is usually much faster thanan interpreter (mapping inputs to outputs is easy in compiler).

Compiler converts the source to target completely, but an interpreter executes the sourceprogram statement by statement. Usually interpreter gives better error diagnostics than a Compiler.

(iii) Hybrid Compiler

Hybrid Compiler is combination of compilation and interpretation. Java languageprocessors combine compilation and interpretation as shown in Figure 1.4.

Java source program first be compiled into an intermediate form called bytecodes. Thebytecodes are then interpreted by a virtual machine.

A benefit of this arrangement is that bytecodes compiled on one machine can be interpretedon another machine.

Source program

Translator

Intermediateprogram Virtual

Output

Input Machine

Figure 1.5: A hybrid compiler

In order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-time compilers, translate the bytecodes into machine language immediately before they run.

(iv) Language processing system ®

In addition to a compiler, several other programs may be required to create an executabletarget program, as shown in Figure 1.6.

Preprocessor: Preprocessor collects the source program which is divided into modules and storedin separate files. The preprocessor may also expand shorthands called macros into source languagestatements. E.g. # include, #define PI .14

Compiler: The modified source program is then fed to a compiler. The compiler may produce anassembly-language program as its output. because assembly language is easier to produce as outputand is easier to debug.


12/189

Assembler: The assembly language is then processed by a program called an assembler thatproduces relocatable machine code as its output.


13/189


Linker: The linker resolves external memory addresses, where the code in one file may refer to alocation in another file. Large programs are often compiled in pieces, so the relocatable machinecode may have to be linked together with other relocatable object files and library files into thecode that actually runs on the machine.

Loader: The loader then puts together all of the executable object files into memory for execution.It also performs relocation of an object code.

Figure 1.6: A language-processing systemNote: Preprocessors, Assemblers, Linkers and Loader are collectively called cousins of compiler.

1.4 THE PHASES OF COMPILER / STRUCTURE OF COMPILER ®

The process of compilation carried out in two parts, they are analysis and synthesis. Theanalysis part breaks up the source program into constituent pieces and imposes a grammaticalstructure on them.

It then uses this structure to create an intermediate representation of the source program.The analysis part also collects information about the source program and stores it in a data structurecalled a symbol table, which is passed along with the intermediate representation to the synthesispart.

The analysis part carried out in three phases, they are lexical analysis, syntax analysis and Semantic Analysis. The analysis part is often called the front end of the compiler. The synthesis partconstructs the desired target program from the intermediate representation and the information inthe symbol table.

The synthesis part carried out in three phases, they are Intermediate Code Generation,Code Optimization and Code Generation. The synthesis part is called the back end of the compiler.


14/189


Figure 1.7: Phases of a compiler

1.4.1 Lexical Analysis

The first phase of a compiler is called lexical analysis or scanning or linear analysis . Thelexical analyzer reads the stream of characters making up the source program and groups thecharacters into meaningful sequences called lexemes.

For each lexeme, the lexical analyzer produces as output a token of the form

The first component token-name is an abstract symbol that is used during syntax analysis,and the second component attribute-value points to an entry in the symbol table for this token.Information from the symbol-table entry 'is needed for semantic analysis and code generation.

For example, suppose a source program contains the assignment statement

position = initial + rate * 60 (1.1)


15/189


Figure 1.8: Translation of an assignment statement

The characters in this assignment could be grouped into the following lexemes and mapped into thefollowing tokens.

(1) position is a lexeme that would be mapped into a token . where id is an abstractsymbol standing for identifier and 1 points to the symbol able entry for position.

(2) The assignment symbol = is a lexeme that is mapped into the token .

(3) initial is a lexeme that is mapped into the token .

(4) + is a lexeme that is mapped into the token .

(5) rate is a lexeme that is mapped into the token .

(6) * is a lexeme that is mapped into the token .

(7) 60 is a lexeme that is mapped into the token .

Blanks separating the lexemes would be discarded by the lexical analyzer. The sequence oftokens produced as follows after lexical analysis.


16/189

(1.2)


17/189


1.4.2 Syntax Analysis

The second phase of the compiler is syntax analysis or parsing or hierarchical analysis.

The parser uses the first components of the tokens produced by the lexical analyzer to createa tree-like intermediate representation that depicts the grammatical structure of the token stream.

The hierarchical tree structure generated in this phase is called parse tree or syntax tree.

In a syntax tree, each interior node represents an operation and the children of the noderepresent the arguments of the operation.

Figure 1.9: Syntax tree for position = initial + rate * 60

The tree has an interior node labeled * with as its left child and the integer 60 as its

right child. The node represents the identifier rate. Similarly and arerepresented as in tree. The root of the tree, labeled =, indicates that we must store the result of thisaddition into the location for the identifier position.

1.4.3 Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition.

It ensures the correctness of the program, matching of the parenthesis is also done in this

phase.

It also gathers type information and saves it in either the syntax tree or the symbol table, forsubsequent use during intermediate-code generation.

An important part of semantic analysis is type checking, where the compiler checks thateach operator has matching operands.

The compiler must report an error if a floating-point number is used to index an array. Thelanguage specification may permit some type conversions like integer to float for float addition iscalled coercions.

The operator * is applied to a floating-point number rate and an integer 60. The integer maybe converted into a floating-point number by the operator inttofloat explicitly as shown in thefigure.

Figure 1.10: Semantic tree for position = initial + rate * 60

1.4.4 Intermediate Code Generation

After syntax and semantic analysis of the source program, many compilers generate anexplicit low-level or machine-like intermediate representation.

The intermediate representation have two important properties:a. It should be easy to produce

b. It should be easy to translate into the target machine.


18/189


Three-address code is one of the intermediate representations, which consists of a sequenceof assembly-like instructions with three operands per instruction. Each operand can act like aregister.

The output of the intermediate code generator in Figure 1.8 consists of the three-address codesequence for position = initial + rate * 60

t1 = inttofloat(60)t2 = id3 * t1

t3 = id2 + t2

id1 = t3 (1.3)

1.4.5 Code Optimization

The machine-independent code-optimization phase attempts to improve the intermediatecode so that better target code will result. Usually better means faster.

Optimization has to improve the efficiency of code so that the target program running time

and consumption of memory can be reduced.The optimizer can deduce that the conversion of 60 from integer to floating point can be

done once and for all at compile time, so the inttofloat operation can be eliminated by replacing theinteger 60 by the floating-point number 60.0.

Moreover, t3 is used only once to transmit its value to id1 so the optimizer can transform(1.3) into the shorter sequence

t1 = id3 * 60.0

id1 = id2 + t1 (1.4)

1.4.6 Code Generation

The code generator takes as input an intermediate representation of the source program andmaps it into the target language.

If the target language is machine code, then the registers or memory locations are selectedfor each of the variables used by the program.

The intermediate instructions are translated into sequences of machine instructions.

For example, using registers R1 and R2, the intermediate code in (1.4) might get translatedinto the machine code

LDF R2, id3

MULF R2, R2 , #60.0

LDF Rl, id2ADDF Rl, Rl, R2

STF idl, Rl (1.5)

The first operand of each instruction specifies a destination. The F in each instruction tellsus that it deals with floating-point numbers.

The code in (1.5) loads the contents of address id3 into register R2, then multiplies it withfloating-point constant 60.0. The # signifies that 60.0 is to be treated as an immediate constant. Thethird instruction moves id2 into register R1 and the fourth adds to it the value previously computedin register R2. Finally, the value in register R1 is stored into the address of id1, so the codecorrectly implements the assignment statement (1.1).


19/189


1.4.7 Symbol-Table Management

The symbol table, which stores information about the entire source program, is usedby all phases of the compiler.

An essential function of a compiler is to record the variable names used in thesource program and collect information about various attributes of each name.

These attributes may provide information about the storage allocated for a name, itstype, its scope.

In the case of procedure names, such things as the number and types of itsarguments, the method of passing each argument (for example, by value or byreference), and the type returned are maintained in symbol table.

The symbol table is a data structure containing a record for each variable name, withfields for the attributes of the name. The data structure should be designed to allowthe compiler to find the record for each name quickly and to store or retrieve datafrom that record quickly.

A symbol table can be implemented in one of the following ways:

O Linear (sorted or unsorted) list

O Binary Search TreeO Hash table

Among the above all, symbol tables are mostly implemented as hash tables, wherethe source code symbol itself is treated as a key for the hash function and the returnvalue is the information about the symbol.

A symbol table may serve the following purposes depending upon the language in hand:

O To store the names of all entities in a structured form at one place.O To verify if a variable has been declared.O To implement type checking, by verifying assignments and expressions.O To determine the scope of a name (scope resolution).

1.5 ERRORS ENCOUNTERED IN DIFFERENT PHASES

An important role of the compiler is to report any errors in the source program thatit detects during the entire translation process.

Each phases of compiler can encounter errors, after detecting errors, must becorrected to precede compilation process.

The syntax and semantic phases handles large number of errors in compilation process.

Error handler handles all types of errors like lexical errors, syntax errors, semanticerrors and logical errors.

Lexical errors:

Lexical analyzer detects errors from input characters. Name of some keywords identifiers typed incorrectly.

Example: switch is written as swich.

Syntax errors:

Syntax errors are detected by syntax analyzer.

Errors like semicolon missing or unbalanced parenthesis.

Example: ((a+b* (c-d)). In this statement ) missing after b.

Semantic errors:

Data type mismatch errors handled by semantic analyzer.

Incompatible data type vale assignment.

Example: Assigning a string value to integer.

Logical errors:

Code note reachable and infinite loops.


20/189

Misuse of operators. Codes written after end of main() block.


21/189


1.6 THE GROUPING OF PHASES®

Each phases deals with the logical organization of a compiler.

Activities of several phases may be grouped together into a pass that reads an inputfile and writes an output file.

The front-end phases of lexical analysis, syntax analysis, semantic analysis, and

intermediate code generation might be grouped together into one pass.Code optimization might be an optional pass.

A back end pass consisting of code generation for a particular target machine.

Source program (input)

Front end

Lexical Analyer

Syntax analyer

Semantic Analyer

IntermediateCode !enerator

Source language dependent(Machine independent)

Intermediate code

Back end

Codeoptimier(optional)

Code !enerator

Machine dependent(Source language dependent)

Target program (output)

Figure 1.11: The Grouping of Phases of compiler

Some compiler collections have been created around carefully designed intermediaterepresentations that allow the front end for a particular language to interface with the back end for acertain target machine.

Advantages:

With these collections, we can produce compilers for different source languages for one

target machine by combining different front ends.

Similarly, we can produce compilers for different target machines, by combining a frontend for different target machines.


22/189


1.7 COMPILER CONSTRUCTION TOOLS ®

The compiler writer, like any software developer, can profitably use modern softwaredevelopment environments containing tools such as language editors, debuggers, version managers,profilers, test harnesses, and so on.

Writing a compiler is a tedious and time consuming task; there are some specialized tools to

implement various phases of a compiler. These tools are called Compiler Construction Tools.Some commonly used compiler-construction tools are given below:

Scanner generators [Lexical Analysis]

Parser generators [Syntax Analysis]

Syntax-directed translation engines [Intermediate Code]

Data-flow analysis engines [Code Optimization]

Code-generator generators [Code Generation]

Compiler-construction toolkits [For all phases]

1. Scanner generators that produce lexical analyzers from a regular-expressiondescription of the tokens of a language. Unix has a tool for Scanner generator calledLEX.

2. Parser generators that automatically produce syntax analyzers (parse tree) from agrammatical description of a programming language. Unix has a tool called YACCwhich is a parser generator.

3. Syntax-directed translation engines that produce collections of routines for walkinga parse tree and generating intermediate code.

4. Data-flow analysis engines that facilitate the gathering of information about howvalues are transmitted from one part of a program to each other part. Data-flowanalysis is a key part of code optimization.

5. Code-generator generators that produce a code generator from a collection of rulesfor translating each operation of the intermediate language into the machinelanguage for a target machine.

6. Compiler-construction toolkits that provide an integrated set of routines forconstructing various phases of a compiler.

1.8 PROGRAMMING LANGUAGE BASICS.

To design an efficient compiler we should know some language basics. Important conceptsfrom popular programming languages like C, C++, C#, and Java are listed below.

Some of the Programming Language basics which are used in most of the languages arelisted below. They are:

The Static/Dynamic Distinction

Environments and States

Static Scope and Block Structure

Explicit Access ControlDynamic Scope

Parameter Passing Mechanisms


23/189

Aliasing


24/189


1.8.1 The Static/Dynamic Distinction

The language uses a static policy or that the issue can be decided at compile time. On theother hand, a policy that only allows a decision to be made when we execute the program is said tobe a dynamic policy or to require a decision at run time.

The scope of a declaration of x is the region of the program in which uses of x refer to thisdeclaration. A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the program. Otherwise, the language uses dynamic scope. Withdynamic scope, as the program runs, the same use of x could refer to any of several differentdeclarations of x.

Example: consider the use of the term "static" as it applies to data in a Java class declaration. InJava, a variable is a name for a location in memory used to hold a data value. Here, "static" refersnot to the scope of the variable, but rather to the ability of the compiler to determine the location inmemory where the declared variable can be found. A declaration like

public static int x;

This makes x a class variable and says that there is only one copy of x, no matter how manyobjects of this class are created. Moreover, the compiler can determine a location in memory wherethis integer x will be held. In contrast, had "static" been omitted from this declaration, then eachobject of the class would have its own location where x would be held, and the compiler could notdetermine all these places in advance of running the program.

1.8.2 Environments and States

Programming languages affect the values of data elements or affect the interpretation of names for that data changes, as the program runs. For example, the execution of an assignment such

as x = y + 1 changes the value denoted by the name x. More specifically, the assignment changesthe value in whatever location is denoted by x.

The location denoted by x can change at run time. If x is not a static (or "class") variable,then every object of the class has its own location for an instance of variable x. In that case, theassignment to x can change any of those "instance" variables, depending on the object to which amethod containing that assignment is applied.

environment state

names --------------------------locations(variables)---------------------- values

The association of names with locations in memory (the store) and then with values can bedescribed by two mappings that change as the program runs:

1. The environment is a mapping from names to locations in the store. Since variables refer tolocations ('l-values" in the terminology of C), we could alternatively define an environmentas a mapping from names to variables.

2. The state is a mapping from locations in store to their values. That is, the state maps 1-values to their corresponding r-values, in the terminology of C.

Environments change according to the scope rules of a language.

Example: Consider the C program fragment, Integer i is declared a global variable, and alsodeclared as a variable local to function f. When f is executing, the environment adjusts so that namei refers to the location reserved for the i that is local to f, and any use of i, such as the assignment i= 3 shown explicitly, refers to that location.


25/189


Typically, the local i is given a place on the run-time stack.

…

int i; /* global i */

...

void f(..) {

int i; /* local i */

…

i=3; /* use of local i */

…

}

…

x=i+1; /* use of global i */

Whenever a function g other than f is executing, uses of i cannot refer to the i that is local to

f. Uses of name i in g must be within the scope of some other declaration of i. An example is theexplicitly shown statement x = i+l, which is inside some procedure whose definition is not shown.The i in i + 1 presumably refers to the global i.

1.8.3 Static Scope and Block Structure

The scope rules for C are based on program structure; the scope of a declaration isdetermined implicitly by where the declaration appears in the program. Later languages, such as C++, Java, and C#, also provide explicit control over scopes through the use of keywords like public,private, and protected.

A block is a grouping of declarations and statements. C uses braces { and } to delimit a

block; the alternative use of begin and end in some languages.Example: The C++ program in Fig. 1.10 has four blocks, with several definitions of variables aand b. As a memory aid, each declaration initializes its variable to the number of the block to whichit belongs.

Output

3 2

1 41 2

1 1

Figure 1.12: Blocks in a C++ program


26/189


27/189


Consider the declaration int a = 1 in block B1. Its scope is all of B1, except for those blocksnested within B1 that have their own declaration of a. B2, nested immediately within B1, does nothave a declaration of a, but B3 does. B4 does not have a declaration of a, so block B3 is the onlyplace in the entire program that is outside the scope of the declaration of the name a that belongs toB1. That is, this scope includes B4 and all of B2 except for the part of B2 that is within B3. Thescopes of all five declarations are summarized in Figure 1.13.

Figure 1.13: Scopes of declarations

1.8.4 Explicit Access Control

Classes and structures introduce a new scope for their members. If p is an object of a classwith a field (member) x, then the use of x in p.x refers to field x in the class definition. the scope of a member declaration x in a class C extends to any subclass C', except if C' has a local declarationof the same name x.

Through the use of keywords like public, private, and protected, object orientedlanguages such as C++ or Java provide explicit control over access to member names in a superclass. These keywords support encapsulation by restricting access.

Thus, private names are purposely given a scope that includes only the method declarations

and definitions associated with that class and any "friend" classes (the C++ term). Protected namesare accessible to subclasses. Public names are accessible from outside the class.

1.8.5 Dynamic Scope

Technically, any scoping policy is dynamic if it is based on factor(s) that can be known onlywhen the program executes. The term dynamic scope, however, usually refers to the followingpolicy: a use of a name x refers to the declaration of x in the most recently called procedure withsuch a declaration.

Dynamic scoping of this type appears only in special situations.

We shall consider two examples of dynamic policies: macro expansion in the C

preprocessor and method resolution in object-oriented programming.

Example: In the C program, identifier a is a macro that stands for expression (x + I). But we cannotresolve x statically, that is, in terms of the program text.

#define a (x+1)

int x = 2;

void b() { int x = 1 ; printf (“%d\n”,

a); } void c() { printf("%d\n”, a); }

void main() { b(); c(); }

In fact, in order to interpret x, we must use the usual dynamic-scope rule. the function mainfirst calls function b. As b executes, it prints the value of the macro a. Since (x + 1) must besubstituted for a, we resolve this use of x to the declaration int x=l in function b. The reason is thatb has a declaration of x, so the (x + 1) in the printf in b refers to this x. Thus, the value printed is 1.


28/189


After b finishes, and c is called, we again need to print the value of macro a. However, theonly x accessible to c is the global x. The printf statement in c thus refers to this declaration of x,and value 2 is printed.

1.8.6 Parameter Passing Mechanisms

All programming languages have a notion of a procedure, but they can differ in how theseprocedures get their arguments. The actual parameters (the parameters used in the call of aprocedure) are associated with the formal parameters (those used in the procedure definition).

In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is avariable). The value is placed in the location belonging to the corresponding formal parameter of the called procedure. This method is used in C and Java.

In call- by-reference, the address of the actual parameter is passed to the callee as the valueof the corresponding formal parameter. Uses of the formal parameter in the code of the callee areimplemented by following this pointer to the location indicated by the caller. Changes to the formalparameter thus appear as changes to the actual parameter.

A third mechanism call-by-name was used in the early programming language Algol 60. Itrequires that the callee execute as if the actual parameter were substituted literally for the formalparameter in the code of the callee, as if the formal parameter were a macro standing for the actualparameter.

1.8.7 Aliasing

There is an interesting consequence of call-by-reference parameter passing or its simulation,as in Java, where references to objects are passed by value. It is possible that two formal

parameters can refer to the same location; such variables are said to be aliases of one another . Asa result, any two variables, which may appear to take their values from two distinct formal

parameters, can become aliases of each other.

Example: Suppose a is an array belonging to a procedure p, and p calls another procedure q(x, y)with a call q(a, a). Suppose also that parameters are passed by value, but that array names are reallyreferences to the location where the array is stored, as in C or similar languages. Now, x and y havebecome aliases of each other. The important point is that if within q there is an assignment x [10] =2, then the value of y[10] also becomes 2.


29/189

CS6660 Compiler Design Unit II 2.1

UNIT II LEXICAL ANALYSIS

2.1 NEED AND ROLE OF LEXICAL ANALYZER

Lexical Analysis is the first phase of compiler. It reads the input characters from left toright, one character at a time, from the source program.

It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unitsuch as identifiers, keywords, operators and punctuation marks.

It needs to enter that lexeme into the symbol table and also reads from the symbol table.These interactions are suggested in Figure 2.1.

Figure 2.1: Interactions between the lexical analyzer and the parser

Since the lexical analyzer is the part of the compiler that reads the source text, it mayperform certain other tasks besides identification of lexemes. One such task is stripping outcomments and whitespace (blank, newline, tab). Another task is correlating error messagesgenerated by the compiler with the source program.

Needs / Roles / Functions of lexical analyzer

It produces stream of tokens.

It eliminates comments and whitespace.

It keeps track of line numbers.

It reports the error encountered while generating tokens.

It stores information about identifiers, keywords, constants and so on into symbol table.

Lexical analyzers are divided into two processes:

a) Scanning consists of the simple processes that do not require tokenization of the input, suchas deletion of comments and compaction of consecutive whitespace characters into one.

b) Lexical analysis is the more complex portion, where the scanner produces the sequence of

tokens as output.Lexical Analysis versus Parsing / Issues in Lexical analysis

1. Simplicity of design: It is the most important consideration. The separation of lexical andsyntactic analysis often allows us to simplify tasks. whitespace and comments removed bythe lexical analyzer.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to applyspecialized techniques that serve only the lexical task, not the job of parsing. In addition,specialized buffering techniques for reading input characters can speed up the compilersignificantly.

3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to

the lexical analyzer.Tokens, Patterns, and Lexemes

A token is a pair consisting of a token name and an optional attribute value. The token name


30/189

is an abstract symbol representing a kind of single lexical unit, e.g., a particular keyword, or a


31/189


sequence of input characters denoting an identifier. Operators, special symbols and constants arealso typical tokens.

A pattern is a description of the form that the lexemes of a token may take. Pattern is set of rules that describe the token. A lexeme is a sequence of characters in the source program thatmatches the pattern for a token.

Table 2.1: Tokens and LexemesTOKEN INFORMAL DESCRIPTION SAMPLE LEXEMES

(PATTERN)

if characters i, f if

else characters e, l, s, e else

comparison < or > or = or == or !=

Note that in certain pairs, especially operators, punctuation, and keywords, there is no need


32/189

for an attribute value. In this example, the token number has been given an integer-valued attribute.


33/189


2.2 LEXICAL ERRORS

It is hard for a lexical analyzer to tell that there is a source-code error without the aid of other components.

Consider a C program statement fi ( a == f(x)). The lexical analyzer cannot tell whether fi isa misspelling of the keyword if or an undeclared function identifier. Since fi is a valid lexeme forthe token id, the lexical analyzer must return the token id to the parser.

The lexical analyzer is unable to proceed because none of the patterns for tokens matchesany prefix of the remaining input. The simplest recovery strategy is "panic mode" recovery.

We delete successive characters from the remaining input, until the lexical analyzer can finda well-formed token at the beginning of what input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.Transformations like these may be tried in an attempt to repair the input. The simplest such

strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme bya single transformation.

In practice most lexical errors involve a single character. A more general correction strategyis to find the smallest number of transformations needed to convert the source program into onethat consists only of valid lexemes.

2.3 EXPRESSING TOKENS BY REGULAR EXPRESSIONS

Specification of Tokens

Regular expressions are an important notation for specifying lexeme patterns. We cannotexpress all possible patterns, they are very effective in specifying those types of patterns that weactually need for tokens.

Strings and Languages

An alphabet is any finite set of symbols. Examples of symbols are letters, digits, andpunctuation. The set {0,1) is the binary alphabet. ASCII is an important example of an alphabet.

A string (sentence or word) over an alphabet is a finite sequence of symbols drawn fromthat alphabet. The length of a string s, usually written |s|, is the number of occurrences of symbolsin s. For example, banana is a string of length six. The empty string, denoted ε, is the string of

length zero.A language is any countable set of strings over some fixed alphabet. Abstract languages

like Φ, the empty set, or { ε }, the set containing only the empty string, are languages under thisdefinition.

Parts of Strings:

1. A prefix of string s is any string obtained by removing zero or more symbols from the endof s. For example, ban, banana, and ε are prefixes of banana.

2. A sufix of string s is any string obtained by removing zero or more symbols from thebeginning of s. For example, nana, banana, and ε are suffixes of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For instance,

banana, nan, and ε are substrings of banana.4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and

substrings, respectively, of s that are not ε or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily


34/189

consecutive positions of s. For example, baan is a subsequence of banana.


35/189


6. If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed byappending y to x.

Operations on Languages

In lexical analysis, the most important operations on languages are union, concatenation,and closure, which are defined in table 2.2.

Table 2.2: Definitions of operations on languages

Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be the set of digits {0,1,..

.9). Other languages that can be constructed from languages L and D

1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of lengthone, each of which strings is either one letter or one digit.

2. LD is the set df 520 strings of length two, each consisting of one letter followed by onedigit.3. L4 is the set of all 4-letter strings.

4. L* is the set of ail strings of letters, including e, the empty string.

5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.

6. D+ is the set of all strings of one or more digits.

Regular expression

Regular expression can be defined as a sequence of symbols and characters expressing astring or pattern to be searched.

Regular expressions are mathematical representation which describes the set of strings of specific language.

Regular expression for identifiers represented by letter_ ( letter_ | digit )*. The vertical barmeans union, the parentheses are used to group subexpressions, and the star means "zero or moreoccurrences of".

Each regular expression r denotes a language L(r), which is also defined recursively fromthe languages denoted by r's subexpressions.

The rules that efine the regular e!pressions over some alpha"et #$

Basis rules:1. ε is a regular e!pression, an %&ε' is { ε }$

2. If a is a sym"ol in # , then a is a regular e!pression, an %&a' ( {a}, that is, the

language with one string of length one.

Induction rules: Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.

1. (r) | (s) is a regular expression denoting the language L(r) U L(s).

2. (r) (s) is a regular expression denoting the language L(r) L(s) .

3. (r) * is a regular expression denoting (L (r)) * .

4. (r) is a regular expression denoting L(r). i.e., Additional pairs of parentheses aroundexpressions.


36/189

Example: %et # ( {a, "}$


Regular Language Meaning

expression

a|b {a, b} )ingle *a’ or *"’

(a|b) (a|b) {aa, ab, ba, bb} +ll strings of length to over the alpha"et #a* { ε, a, aa, aaa, …} Consisting of all strings of zero or more a's

(a|b)* {ε, a, b, aa, ab, ba, bb, set of all strings consisting of zero or more

aaa, …} instances of a or b

a|a*b {a, ", a", aa", aaa", …} String a and all strings consisting of zero ormore a's and ending in b

A language that can be defined by a regular expression is called a regular set. If tworegular expressions r and s denote the same regular set, we say they are equivalent and write r = s.For instance, (a|b) = (b|a), (a|b)*= (a*b*)*, (b|a)*= (a|b)*, (a|b) (b|a) =aa|ab|ba|bb.

Algebraic laws

Algebraic laws that hold for arbitrary regular expressions r, s, and t:

LAW DESCRIPTION

r|s = s|r | is commutative

r(s|t) = (r|s)t | is associative

r(st) = (rs)t Concatenation is associative

r(s|t) = rs|rt; (s|t)r = sr|tr Concatenation distributes over |

ε r ( r ε ( r ε is the ientity for concatenation

r- ( &r .ε'- ε is guarantee in a closure

r** = r* * is idempotent

Extensions of Regular Expressions

Few notational extensions that were first incorporated into Unix utilities such as Lex thatare particularly useful in the specification lexical analyzers.

1. One or more instances: The unary, postfix operator + represents the positive closure of aregular expression and its language. If r is a regular expression, then (r)+ denotes the

language (L(r))+. The two useful algebraic laws, r* = r

+.ε an r

+ = rr* = r*r.

2. Zero or one instance: The unary postfix operator ? means "zero or one occurrence."

/hat is, r0 is e1uivalent to r.ε , %&r0' ( %&r' 2 {ε}$3. Character classes: A regular expression a1|a2.….an, where the ai's are each symbols of the

alphabet, can be replaced by the shorthand [a3, a4, …an]. Thus, [abc] is shorthand for a|b|c,

and [a-56 is shorthan for a.".….5$

Example: Regular definition for C

identifier Letter_ [A-Z a-

z_]

digit [0-9]

id letter_ ( letter_ | digit )*

Example: Regular definition unsignedinteger digit [0-9]

digits digit+

number digits ( . digits)? ( E [+ -]? digits )?


37/189

Note: The operators *, +, and ? has the same precedence and associativity.


38/189


2.4 CONVERTING REGULAR EXPRESSION TO DFA

To construct a DFA directly from a regular expression, we construct its syntax tree and thencompute four functions: nullable, firstpos, lastpos, and followpas, defined as follows. Eachdefinition refers to the syntax tree for a particular augmented regular expression (r)#.

1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented by nhas ε in its language. That is, the subexpression can be "made null" or the empty string,even though there may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n that correspond to the firstsymbol of at least one string in the language of the subexpression rooted at n.

3. lastpos(n) is the set of positions in the subtree rooted at n that correspond to the last symbolof at least one string in the language of the subexpression rooted at n.

4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such thatthere is some string x = a1a2 …an in L((r)#) such that for some i, there is a way to explainthe membership of x in L((r)#) by matching ai to position p of the syntax tree

and ai+1 to position q.

We can compute nullable, firstpos, and lastpos by a straightforward recursion on the heightof the tree. The basis and inductive rules for nullable and firstpos are summarized in table.

The rules for lastpos are essentially the same as for firstpos, but the roles of children c1 andc2 must be swapped in the rule for a cat-node.

There are only two ways to compute followpos.

1. If n is a cat-node with left child cl and right child c2, then for every position i in lastpos(c1),all positions in firstpos(c2) are in followpos(i).

2. 2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are infollowpos(i) .

Converting a Regular Expression Directly to a DFA

Algorithm: Construction of a DFA from a regular expression r.

INPUT: A regular expression r.

OUTPUT: A DFA D that recognizes L(r).

METHOD:

1. Construct a syntax tree T from the augmented regular expression (r)#.

2. Compute nullable, firstpos, lastpos, and followpos for T.

3. Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D,


39/189


initialize Dstates to contain only the unmarked state firstpos(no),

where no is the root of syntax tree T for (r)#;

while ( there is an unmarked state S in Dstates )

{

mark S;

for ( each input symbol a )

{

let U be the union of followpos(p) for all p in S that correspond to a;

if ( U is not in Dstates )

add U as an unmarked state to Dstates;

Dtran[S, a] = U

}

}Bythe above procedure. The states of D are sets of positions in T. Initially, each state is

"unmarked," and a state becomes "marked" just before we consider its out-transitions. The start

state of D is firstpos(no), where node no is the root of T. The accepting states are thosecontaining the position for the endmarker symbol #.

Example: Construct a DFA for the regular expression r = (a|b)*abb

Figure 2.2: Syntax tree for (a|b)*abb#

Figure 2.3 : firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#


40/189


We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in bothfollowpos(1) and followpos(2), since both firstpas and lastpos for this node are {1,2}. The completesets followpos are summarized in table

NODE n Followpos(n)

1 {1, 2, 3}

2 {1, 2, 3}3 {4}

4 {4}

5 {4}

6 {}

Figure 2.4: Directed graph for the function followpos

nullable is true only for the star-node, and we exhibited firstpos and lastpos in Figure 2.3.The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D. all this setof states A. We must compute Dtran[A, a] and Dtran[A, b]. Among the positions of A, 1 and 3correspond to a, while 2 corresponds to b. Thus, Dtran[A, a] = followpos(1) U followpos(3) = {1,2,3,4}, and Dtran[A, b] = followpos(2) = {1,2,3}.

Figure 2.5: DFA constructed for (a|b)*abb#

The latter is state A, and so does not have to be added to Dstates, but the former, B ={1,2,3,4}, is new, so we add it to Dstates and proceed to compute its transitions. The omplete DFAis shown in Figure 2.5.

Example: 7onstruct 89+ ε for &al"'-a"" and convert to DFA by subset construction.

Figure 2.6: 89+ ε for &a|b)*abb


41/189


Figure 2.7: NFA for (a|b)*abb

Figure 2.8 Result of applying the subset construction to Figure 2.6

2.5 MINIMIZATION OF DFA

There can be many DFA's that recognize the same language. For instance, the DFAs ofFigure 2.5 and 2.8 both recognize the same language L((a|b)*abb).

We would generally prefer a DFA with as few states as possible, since each state requiresentries in the table that describes the lexical analyzer.

Algorithm: Minimizing the number of states of a DFA.

INPUT: A DFA D with set of states S, input alphabet #, initial state so, and set of accepting states F.

OUTPUT: A DFA D' accepting the same language as D and having as few states as possible.

METHOD:

1. Start with an initial partition II with two groups, F and S - F, the accepting and nonacceptingstates of D.

2. Apply the procedure of Fig. 3.64 to construct a new partition anew.

initially, let :new ( :; for

& each group < of : '

{

partition G into subgroups such that two states s and t are in the same subgroup if

and only if for allinput symbols a, states s and t have transitions on a to states in the same group of :;

/* at worst, a state will be in a subgroup by itself */

replace G in IInew by the set of all subgroups formed;

}

3. If : new ( :, let :final ( : and continue with step (4). Otherwise, repeat step (2) with :new in place of If :$

4. 7hoose one state in each group of :final as the representative for that group. Therepresentatives will be the states of the minimum-state DFA D'.

5. The other components of D' are constructed as follows:

(a) The state state of D' is the representative of the group containing the start state of D.

(b) The accepting states of D' are the representatives of those groupsthat contain an


42/189

accepting state of D.


43/189


(c) Let s be the representative of some group G of :fina, and let the transition of D from s oninput a be to state t. Let r be the representative of t's group H. Then in D', there is atransition from s to r on input a.

Example: Let us reconsider the DFA of Figure 2.8 for minimization.

STATE a b

A B C

B B D

C B C

D B E

(E) B C

The initial partition consists of the two groups {A, B, C, D} {E}, which are respectively thenonaccepting states and the accepting states.

/o construct : new, the procedure considers both groups and inputs a and b. The group {E}cannot "e split, "ecause it has only one state, so &=} ill remain intact in : new.

The other group {A, B, C, D} can be split, so we must consider the effect of each inputsymbol. On input a, each of these states goes to state B, so there is no way to distinguish thesestates using strings that begin with a. On input b, states A, B, and C go to members of group {A, B,C, D}, while state D goes to E, a member of another group.

/hus, in : new, group {+, >, 7, ?} is split into {+, >, 7}{?}, an : new for this round is{A, B, C){D){E}.

In the next round, we can split {A, B, C} into {A, C}{B}, since A and C each go to a

member of {A, B, C) on input b, while B goes to a member of another group, {D}. Thus, after thesecon roun, : new = {A, C} {B} {D} {E).

For the third round, we cannot split the one remaining group with more than one state, since

A and C each go to the same state (and therefore to the same group) on each input. We concludethat :final = {A, C}{B){D){E).

Now, we shall construct the minimum-state DFA. It has four states, corresponding to thefour groups of :final, and let us pick A, B, D, and E as the representatives of these groups. Theinitial state is A, and the only accepting state is E.

Table : Transition table of minimum-state DFA

STATE a b

A B A

B B D

C B E

(E) B A

2.6 LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS-LEX

There are wide range f tools for construction of lexical analyzer based on regularexpressions. Lex is a tool (Computer program) that generates lexical analyzers.

Lex is a lexical analyzer based tool by specifying regular expressions to describe patternsfor token. Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.

Use of Lex

The Lex compiler transforms the input patterns into a transition diagram and generatescode.


44/189


An input file “lex.l” is written in the Lex language and describes the lexical analyzer to be

generated. The Lex compiler transforms “lex.l” to a C program, in a file that is always

named “lex.yy.c”.

The file “lex.yy.c” is compiled by the C – Compiler and converted into a file “a.out”. The

C-compiler output is a working lexical analyzer that can take a stream of input characters

and produce a stream of tokens.

The attribute value, whether it be another numeric code, a pointer to the symbol table, ornothing, is placed in a global variable yylval which is shared between the lexical analyzerand parser

Figure 2.9: Creating a lexical analyzer with Lex

Structure of Lex Programs

A Lex program has the following form:

declrtions

!!trnsltion r"les!!

"#ilir$ %"nctions

The declarations section includes declarations of variables, manifest constants (identifiersdeclared to stand for a constant, e.g., the name of a token), and regular definitions.

The translation rules of lex program statement have the form Pattern { Action }

Pattern P1 { Action A1}

Pattern P2 { Action A2}

…Pattern Pn { Action An}

Each pattern is a regular expression. The actions are fragments of code typically written inC language.

The third section holds whatever additional functions are used in the actions. Alternatively,these functions can be compiled separately and loaded with the

lexical analyzer.

The lexical analyzer begins reading its remaining input, one character at a time, until it finds

the longest prefix of the input that matches one of the patterns Pi. It then executes the associated

action Ai. Typically, A

i will return to the parser, but if it does not (e.g., because Pi describes

whitespace or comments), then the lexical analyzer proceeds to find additional lexemes, until oneof the corresponding actions causes a return to the parser. The lexical analyzer returns a singlevalue, the token name, to the parser, but uses the shared, integer variable yylval to pass additionalinformation about the lexeme found.


45/189


2.7 DESIGN OF LEXICAL ANALYZER FOR A SAMPLE LANGUAGE

The lexical-analyzer generator such as Lex is architected with an automation simulator. Theimplementation of Lex compiler can be based on either NFA or DFA.

2.7.1 The Structure of the Generated Analyzer

Figure 2.10 shows the architecture of a lexical analyzer generated by Lex. A Lex program isconverted into a transition table and actions which are used by a finite Automaton simulator.

The program that serves as the lexical analyzer includes a fixed program that simulates anautomaton; the automaton is deterministic or nondeterministic. The rest of the lexical analyzerconsists of components that are created from the Lex program by Lex itself.

Figure 2.10: A Lex program is turned into a transition table and actions, which are used by a finite-automaton simulator

These components are:

1. A transition table for the automaton.

2. Those functions that are passed directly through Lex to the output.

3. The actions from the input program, which appear as fragments of code to beinvoked at the appropriate time by the automaton simulator.

2.7.2 Pattern Matching Based on NFA's

To construct the automation for several regular expressions, we need to combine all NFAs into

one by introducing a new start state ith ε-transitions to each of the start states of the NFA's Nifor pattern pi as shown in figure 2.11.

Figure 2.11: An NFA constructed from a Lex program

Example: Consider the atern


46/189


a { action Al for pattern pl }

abb { action A2 for pattern p2 }

a*b+ { action A3 for pattern p3}

Figure 2.12: NFA's for a, abb, and a*b+

Figure 2.13: Combined NFA

Figure 2.14: Sequence of sets of states entered when processing input aaba

Figure 2.15: Transition graph for DFA handling the patterns a, abb, and a*b+


47/189

CS6660 Compiler Design Unit III 3.1

UNIT III SYNTAX ANALYSIS

3.1 NEED AND ROLE OF THE PARSER

The parser takes the token produced by lexical analysis and builds the syntax tree (parse

tree). The syntax tree can be easily constructed from Context-Free Grammar.The parser reports syntax errors in an intelligible fashion and recovers from commonly

occurring errors to continue processing the remainder of the program.

token intermediat Sourc

e LexicalParser

Parse

Rest of

program Analyzer

tree Front End

representation

Get next token

Symbol

Table

Figure 3.1: Position of parser in compiler model

Role of the Parser:

Parser builds the parse tree.

Parser Performs context free syntax analysis.Parser helps to construct intermediate code.

Parser produces appropriate error messages.

Parser attempts to correct few errors.

Types of parsers for grammars:

Universal parsers

Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley'salgorithm can parse any grammar. These general methods are too inefficient to use inproduction. This method is not commonly used in compilers.

Top-down parsersTop-down methods build parse trees from the top (root) to the bottom (leaves)

Bottom-up parsers.

Bottom-up methods start from the leaves and work their way up to the root.

3.2 CONTEXT FREE GRAMMARS

3.2.1 The Formal Definition of a Context-Free Grammar

A context-free grammar G is defined by the 4-tuple: G= (V, T, P S) where

1. V is a finite set of non-terminals (variable).

2. T is a finite set of terminals.3. P is a finite ∪set of production rules of the form A@$ Where A is nonterminal an @ is string of terminals∈ and/or nonterminals. P is a relation from V to (V

T)*.


48/189

4. S is the start symbol (variable S V ).


49/189


Example 3.1: The following grammar defines simple arithmetic expressions. In this grammar, theterminal symbols are id + - * / ( ). The nonterminal symbols are expression, term and factor, andexpression is the start symbol.

expression expression + term

expression expression - term

expression term

term term * factor

term term / factor

term factor

factor ( expression )

factor id

3.2.2 Notational ConventionsThe following notational conventions for grammars can be used

1. These symbols are terminals:

(a) Lowercase letters early in the alphabet, such as a, b, e.

(b) Operator symbols such as +,*, and so on.

(c) Punctuation symbols such as parentheses, comma, and so on.

(d) The digits 0,1,. . . ,9.

(e) Boldface strings such as id or if , each of which represents a single terminal symbol.

2. These symbols are nonterminals:

(a) Uppercase letters early in the alphabet, such as A, B, C.

(b) The letter S is usually the start symbol when which appears.(c) Lowercase, italic names such as expr or stmt .

(d) When discussing programming constructs, uppercase letters may be used torepresent nonterminals for the constructs. For example, nonterminals forexpressions, terms, and f actors are often represented by E, T, and F, respectively.

3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is,either nonterminals or terminals.

4. Lowercase letters late in the alphabet, chiefly u, v, . . . , z, represent (possibly empty) strings

of terminals.5. Lowercase Greek letters, @, , for example, represent (possibly empty) strings of grammar

symbols. Thus, a generic production can be written as A

α, where A is the head and @ thebody.

6. A set of productions A α1 , A α 2 ,…, A α k with a common head A (call them A-productions), may be written A @1 . @2 .…. @k. call α1, α2,…, αk the Alternatives for A.

7. Unless stated otherwise, the head of the first production is the start symbol.

Example 3.2 : Using these conventions, the grammar of Example 3.1 can be rewritten concisely as

E E + T | E - T | T

T T * F | T / F | F F (

E ) | id


50/189


3.2.3 Derivations

The derivation uses productions to generate a string (set of terminals). The derivation isformed by replacing the nonterminal in the right hand side by suitable production rule.

The derivations are classified into two types based on the order of replacement ofproduction. They are:

1. Leftmost derivation

If the leftmost non-terminal is replaced by its production in derivation, then it calledleftmost derivation.

2. Rightmost derivation

If the rightmost non-terminal is replaced by its production in derivation, then it calledrightmost derivation.

Example 3.3: LMD and RMD for example 3.2

LMD for - ( id + id )

⇒ - ( id + E ) ⇒ - ( id + id )E ⇒ - E ⇒ - ( E ) ⇒ - ( E + E )

RMD for - ( id + id )

- ( E + id )E - E - ( E ) - ( E + E ) - ( id + id )

⇒ ⇒ ⇒ ⇒ ⇒

Example 3.4: Consider the context free grammar (CFG) G = ({S}, {a, b, c}, P, S ) whereP={SSbS | ScS | a}. Derive the string “abaca” by leftmost derivation and rightmost derivation.

Leftmost derivation for “abaca”

SbS

S ⇒

abS (using rule S a)

⇒ abScS (using rule S ScS )

⇒ abacS (using rule S a)

⇒ abaca (using rule S a)

⇒

Rightmost derivation for “abaca”

ScS

S ⇒

Sca (using rule S a)

⇒ SbSca (using rule S SbS )

⇒ Sbaca (using rule S a)

⇒ abaca (using rule S a)

⇒

3.2.4 Parse Trees and DerivationsA parse tree is a graphical representation of a derivation. t is convenient to see how strings

are derived from the start symbol. The start symbol of the derivation becomes the root of the parse


51/189

tree.CS6660 Compiler Design Unit III 3.4

Example 3.5: construction of parse tree for - ( id + id ) ⇒ - ( id + id )

Derivation:

E ⇒ - E ⇒ - ( E ) ⇒ - ( E + E ) ⇒ - ( id + E )

Parse tree:

E E E E E E

E E E E E

)) ( E

( ) ( E E) ( E

E E

+

E E + E +

id

idid

Figure 3.2: Parse tree for -(id + id)

3.2.5 Ambiguity

A grammar that produces more than one parse tree for some sentence is said to beambiguous. Put another way, an ambiguous grammar is one that produces more than one leftmostderivation or more than one rightmost derivation for the same sentence.

A grammar G is said to be ambiguous if it has more than one parse tree either in LMD or in RMDfor at least one string.

Example 3.6: The arithmetic expression grammar (3.3) permits two distinct leftmost derivationsfor the sentence id + id * id:

E + E E

⇒

E * E

E ⇒

i d + E E + E * E

⇒ id + E * E ⇒ id + E * E

⇒ id + id * E

⇒ id + id * E

⇒ id + id * id

⇒ id + id * id

⇒ ⇒

E E

E + E E * E

id E E

id E E +*

idid id id


52/189

Figure 3.3: Two parse trees for id+id*id


53/189


3.2.6 Verifying the Language Generated by a Grammar

A proof that a grammar G generates a language L has two parts: show that every stringgenerated by G is in L, and conversely that every string in L can indeed be generated by G.

Example 3.7: Consider the following grammar S & ) ' ) . $ this simple grammar generates all

strings of balanced parentheses. To show that every sentence derivable from S is balanced, we usean inductive proof on the number of steps n in a derivation.

BASIS: The basis is n = 1. The only string of terminals derivable from S in one step is the emptystring, which surely is balanced.

INDUCTION: Now assume that all derivations of fewer than n steps produce balanced sentences,and consider a leftmost derivation of exactly n steps. Such a derivation must be of the form

The derivations of x and y from S take fewer than n steps, so by the inductive hypothesis x and yare balanced. Therefore, the string (x)y must be balanced.

That is, it has an equal number of left and right parentheses, and every prefix has at least as manyleft parentheses as right.

Having thus shown that any string derivable from S is balanced,

We must next show that every balanced string is derivable from S.

To do so, use induction on the length of a string.

BASIS: If the string is of length A, it must "e, hich is "alance$

INDUCTION: First, observe that every balanced string has even length. Assume that everybalanced string of length less than 2n is derivable from S, and consider a balanced string w of length Bn, n C3$ )urely "egins ith a left parenthesis. Let (x) be the shortest nonempty prefix of

w having an equal number of left and right parentheses. Then w can be written as w = (x) y whereboth x and y are balanced. Since x and y are of length less than 2n, they are derivable from S by theinductive hypothesis. Thus, we can find a derivation of the form

Proved that w = (x)y is also derivable from S.

3.2.7 Context-Free Grammars versus Regular ExpressionsEvery regular language is a context-free language, but not vice-

versa. Example 3.8: The grammar for regular expression (a|b)*abb

A aA | bA | aB

B bC

C b

Describe the same language, the set of strings of a's and b's ending in abb. So we can easilydescribe these languages either by finite automata or PDA.

On the other hand, the language L = {anb

n | n C3} ith an e1ual num"er of aDs an "Ds is a

prototypical example of a language that can be described by a grammar but not by a regularexpression. we can say that "finite automata cannot count" meaning that a finite automaton cannotaccept a language like {a

nb

n | n C 3} that oul require it to keep count of the number of a's before

it sees the b's. So these kinds of languages (Context-Free Grammars) are accepted by PDA as PDA


54/189

uses stack as its memory.


55/189


3.2.8 Left recursion

A context free grammar is said to be left recursive if it has a non terminal A with twoproductions in the following form.

A + @ .

Ehere @ an are se1uences of terminals an nonterminals that o not start with A.

Left recursion in top-down parsing can enter into infinite loop. It creates serious problems,so we have avoid Left recursion.

For example, in expr expr + term | term

Figure 3.4: Left-recursive and right recursive ways of generating a string

ALGORITHM 3.1 Eliminating left recursion.

INPUT: Grammar G with no cycles or - productions.

OUTPUT: An equivalent grammar with no left recursion.

METHOD: Apply the algorithm to G. Note that the resulting non-left-recursive

grammar may have -productions.

arrange the nonterminals in some orer +3, +B, …,

+n$ for ( each i from 1 to n ) {

for ( each j from 1 to i - 1 ) {

replace each production of the form Ai +F "y the

productions Ai 1 . 2 . … . k , here

A j 1 . 2 . … . k are all current A j-productions

}

eliminate the immediate left recursion among the Ai-productions

}

Note: Simply modify the left recursive production A + @ . to

A +'

A' @ +D .

Example 3.9: Consider the grammar for arithmetic expressions.

E E + T | T

T T * F | F


56/189

E (E) | id


57/189


Eliminate left recursive productions E and T by applying the left recursion Eliminating Algorithm.

If A + @ . then A +'

A' @ +D . ε

The production E E + T | T is replaced by

E T E' E' G/ =D .

The production T T * F | F is replaced by

T F T' T'

- 9 /D .

Therefore, finally we obtain,

E T E' E'

G/ =D .

T F T' T'

- 9 /D .

E (E) | id

Example 3.10: Consider the grammar, Eliminate left recursive productions.

S A a | b

A + c . ) .

There is no immediate left recursion. To get it substitute S-production in

A. A + c . + a . " .A + c . + a . " . is replace "y

A b d A' | A' A'

c +D . a +D .

Therefore, finally we obtain grammar without left recursion,

S A a | b

A b d A' | A' A'

c +D . a +D .

Example 3.11: Consider the grammar

A A B d | A a | a

B B e | b

The grammar without left recursion is

A a A'

A' > +D . a +D .

B b B'

B'

e >D .Example 3.12: Eliminate left recursion from the given grammar. A A c | A a d | b d | b

c After removing left recursion, the grammar becomes,

A b d A' | b c

A' A' c A' | a d


58/189

A' A'


59/189


3.2.9 Left factoring

Left factoring is a process of factoring out the common prefixes of two or more production alternates for the same nonterminal.

Algorithm 3.2 : Left factoring a

grammar. INPUT: Grammar G.

OUTPUT: An equivalent left-factored grammar.

ε=/H?J 9or each nonterminal +, fin the longest prefi! @ common to to or more of

its alternatives. If a K - i.e., there is a nontrivial common prefix - replace all of the A-

productions A @ 1 . @ 2 . … . @ n. , here represents all alternatives that do not begin

ith @, "y

A @+D .

A' 1 . 2 . … . n

Here A' is a new nonterminal. Repeatedly apply this transformation until no twoalternatives for a nonterminal have a common prefix.

Example 3.13: Eliminate left factors from the given grammar. S T + S | T

After left factoring, the grammar becomes,

S T L

L G ) .

Example 3.14: Left factor the following grammar. S i E t S | i E t S e S | a ; E b

After left factoring, the grammar becomes,

S i E t SS' | aS' e ) .

E b

Uses:

Left factoring is used in predictive top down parsing technique.


60/189


3.3 TOP DOWN PARSING -GENERAL STRATEGIES

Top-down parsing can be viewed as the problem of constructing a parse tree for the inputstring, starting from the root and creating the nodes of the parse tree in preorder (depth-first ). Top-down parsing can be viewed as finding a leftmost derivation for an input string.

Parsers are generally distinguished by whether they work top-down (start with thegrammar's start symbol and construct the parse tree from the top) or bottom-up (start with theterminal symbols that form the leaves of the parse tree and build the tree from the bottom). Top-down parsers include recursive-descent and LL parsers, while the most common forms of bottom-up parsers are LR parsers.

Types o"parser

Top do#nparser

$ottom upparser

$ac%trac%ing

Predicti&eparser

Shi"t 'educeparser L' parser

'ecursi&edescent

LL()parser

SL'parser

LAL'parser

(C) L'parser

Figure 3.5: Types of parser

Example 3.15 : The sequence of parse trees for the input id+id*id in a top-down parse (LMD).

E TE'

E' G / =D . ε

T F T'

T' - 9 /D . ε

F ( E ) | id


61/189

Figure 3.6: Top-down parse for id + id * id


62/189


3.4 RECURSIVE DESCENT PARSER

These parsers use a procedure for each nonterminal. The procedure looks at its input anddecides which production to apply for its nonterminal. Terminals in the body of the production arematched to the input at the appropriate time, while nonterminals in the body result in calls to theirprocedure. Backtracking, in the case when the wrong production was chosen, is a possibility.

void A()

{

Choose an A-production, A X1 X 2 . . .

X k ; for (i=l to k)

{

if ( X i is a nonterminal ) call

procedure X i () ;

else if ( Xi equals the current input symbol a ) advance the input to the next symbol;

else /* an error has occurred */;

}

}

Example 3.16 : Consider the

grammar S c A d

A a b | a

To construct a parse tree top-down for the input string w = cad, begin with a tree consistingof a single node labeled S, and the input pointer pointing to c, the first symbol of w. S has only oneproduction, so we use it to expand S and obtain the tree of Figure 3.7(a). The leftmost leaf, labeledc, matches the first symbol of input w, so we advance the input pointer to a, the second symbol of w, and consider the next leaf, labeled A.

Now, we expand A using the first alternative A a b to obtain the tree of Figure 3.7(b). Wehave a match for the second input symbol, a, so we advance the input pointer to d, the third inputsymbol, and compare d against the next leaf, labeled b. Since b does not match d, we report failure

and go back to A to see whether there is another alternative for A that has not been tried, but thatmight produce a match

S S S S

c A d c A d c A d c A d

a a

Figure 3.7: Steps in a top-down parse

The second alternative for A produces the tree of Figure 3.7(c). The leaf a matches the


63/189

second symbol of w and the leaf d matches the third symbol. Since we have produced a parse treefor w, we halt and announce successful completion of parsing.


64/189


3.5 PREDICTIVE PARSER (NON RECURSIVE)

A nonrecursive predictive parser can be built by maintaining a stack explicitly, rather thanimplicitly via recursive calls. The parser mimics a leftmost derivation. If w is the input that has "een matche so far, then the stack hols a se1uence ⇒∗ of grammar sym"ols @ suchthat

S wa lm

The table-driven parser in Figure 3.8 has an input buffer, a stack containing a sequence of grammar symbols, a parsing table constructed, and an output stream. The input buffer contains thestring to be parsed, followed by the endmarker $. We reuse the symbol $ to mark the bottom of thestack, which initially contains the start symbol of the grammar on top of $.

The parser is controlled by a program that considers X, the symbol on top of the stack, anda, the current input symbol. If X is a nonterminal, the parser chooses an X-production by consultingentry M[X, a] of the parsing table M. Otherwise, it checks for a match between the terminal X andcurrent input symbol a.

Input a * +

Predicti&e

, ParsingOutput

Program -

Stac%.

+ Parsing Tale M

Figure 3.8: Model of a table-driven predictive parser

Algorithm 3.3 : Table-driven predictive parsing.

INPUT: A string w and a parsing table M for grammar G.

OUTPUT: If w is in L(G), a leftmost derivation of w; otherwise, an error indication.

METHOD: Initially, the parser is in a configuration with w$ in the input buffer and the start symbolS of G on top of the stack, above $. The following procedure uses the predictive parsing table M toproduce a predictive parse for the input.

set i p to point to the first symbol of w;

set X to the top stack symbol;

while & L K M ' { N- stack is not empty -N

if ( X is a ) pop the stack and advance

ip; else if ( X is a terminal ) error ();

else if ( M[X, a] is an error entry ) error ();

else if ( M[X,a] = X Y1Y2…Ok ) {

output the production X Y1Y2…

Ok ; pop the stack;

push Yk Yk-1…O1 onto the stack, with Yl on top;

}

set X to the top stack symbol;

}


65/189


Example 3.17: Consider grammar for the input id + id * id using the nonrecursive predictive

parser.

E TE'

E' G / =D . ε

T F T'

T' - 9 /D . ε

F ( E ) | id

⇒

TE′ FT′E′ ⇒

idT′E′ id E

′

id+TE

′ ⇒

id+FT'E′ ⇒

id+idT'E

′ id+id*FT'E′ id+id*idT'E′ id+id*id E′ id+id*id

⇒

⇒

⇒

⇒

⇒

⇒ ⇒

Figure 3.9: Moves made by a predictive parser on input id + id * id

3.6 LL(1) PARSER

A grammar such that it is possible to choose the correct production with which to expand agiven nonterminal, looking only at the next input symbol, is called LL(1). These grammars allow usto construct a predictive parsing table that gives, for each nonterminal and each lookahead symbol,the correct choice of production. Error correction can be facilitated by placing error routines insome or all of the table entries that have no legitimate production.

LL(1) Grammars

Predictive parsers (recursive-descent parsers) needing no backtracking, can be constructedfor a class of grammars called LL(1). The first "L" in LL(1) stands for scanning the input from leftto right, the second "L" for producing a leftmost derivation, and the "1" for using one input symbol

of lookahead at each step to make parsing action decisions.


66/189


Transition Diagrams for Predictive Parsers

Transition diagrams are useful for visualizing predictive parsers. To construct the transitiondiagram from a grammar, first eliminate left recursion and then left factor the grammar. Then, foreach nonterminal A,

1. Create an initial and final (return) state.

2. For each production A X1X2…Xk , create a path from the initial to the final state, withedges labeled X1, X2, …, Lk . If A , the path is an edge labeled .

A grammar G is LL(1) if and only if whenever A @ . are to istinct prouctions of '$3. If there is a production AαB, or a production AαBβ , where FIRST( β ' contains , then

everything in FOLLOW (A) is in FOLLOW (B)

Documents

Cs6660 Compiler Design Appasami