31
3.1 3. Phase 2 : Syntax Analysis Part I Parsing. The symbol table. The abstract syntax tree. Representing statements & expressions. Recursive descent parsing. A bit of theory.

3.1 3. Phase 2 : Syntax Analysis Part I Parsing. The symbol table. The abstract syntax tree. Representing statements & expressions. Recursive descent parsing

  • View
    238

  • Download
    1

Embed Size (px)

Citation preview

3.13.1

3. Phase 2 : Syntax Analysis Part I3. Phase 2 : Syntax Analysis Part I

• Parsing.

• The symbol table.

• The abstract syntax tree.

• Representing statements & expressions.

• Recursive descent parsing.

• A bit of theory.

3.23.2

ParsingParsing

• Given a grammar and a particular sequence of symbols, either :

– Find how that sequence might be derived from the sentence symbol by using the productions of the grammar.

– Or, show that the sequence could not be derived from the sentence symbol and therefore is not a sentence.

• In programming language terms, for ‘sentence’ read ‘source program’ and for ‘sequence’ read ‘contents of the file being parsed’.

• The parser reads the file and checks that the contents could be derived from the production rules. If not it produces compile time error messages.

3.33.3

Parsing IIParsing II

•If the source program is syntactically correct then parsing produces :

– Symbol table (ST) from the declarations.

– Abstract syntax tree (AST) from the statements.

•If the source program is syntactically incorrect then parsing produces :

– Error messages.

• ST and AST data structures and error messages declared in syner.h in phase 2 unit directory :

/usr/users/staff/aosc/cm049icp/phase2

3.43.4

The Symbol TableThe Symbol Table

• A data structure recording all the user defined symbols (i.e. identifiers) declared in the program.

• Either a variable sized array or a linked list.

– No variable sized arrays in C++ so we’ve got to use a linked list. C++ does have dynamic arrays but

they are not variable sized.

• Each (user defined) identifier is stored in a struct with elements for its name, type and other things as desired.

• Used to determine what identifiers refer to during syntax analysis and code generation.

3.53.5

The Symbol Table IIThe Symbol Table II

• For a real language the ST is a sequence of frames, one per subprogram.

• Frame : Conceptually, a sequence of entries, one per variable / constant.

• C-- has no subprograms so the ST is simply a sequence of entries.

• Example C-- source :

const string s = “Hello” ;int n = 0 ;bool b ;{ ...}

3.63.6

Symbol Table EntriesSymbol Table Entries

• Schematic diagram :

Stored in reverse of declaration

order.

Stored in reverse of declaration

order.

ident : “n”type : INTDATAconstFlag : falseinitialise : address : 0next :

0

ident : “s”type : STRINGDATAconstFlag : trueinitialise : address : 0next : NULL

“Hello”

ident : “b”type : BOOLDATAconstFlag : falseinitialise : NULLaddress : 0next :

3.73.7

C++ Data Structures For The Symbol TableC++ Data Structures For The Symbol Table

• In syner.h :

enum DataType { VOIDDATA, BOOLDATA, STRINGDATA, INTDATA } ;

struct SymTab{ string ident ; DataType type ; bool constflag ; Factor *initialise ; int address ; SymTab *next ;} ;

struct Factor{ ... // Later.} ;

3.83.8

The Abstract Syntax TreeThe Abstract Syntax Tree

•An abstract representation of the program.

•An n-ary tree.

•Example C-- source :

if (b > 6) x = 0 ;else y = x ;

•Of course it’s nowhere near as simple as this diagram suggests.

if

>

x 0 y x

==

b 6

3.93.9

C++ Data Structures For The Abstract Syntax TreeC++ Data Structures For The Abstract Syntax Tree

• In syner.h :

enum StatType { NULLST, ASSIGNST, IFST, WHILEST, CINST, COUTST } ;struct AST{ StatType tag ; AssignSt *assignst ; IfSt *ifst ; WhileSt *whilest ; CinSt *cinst ; CoutSt *coust ; AST *next ;} ;

3.103.10

Schematic Diagram Of An Abstract Syntax TreeSchematic Diagram Of An Abstract Syntax Tree

• Example C-- program :

void main(){ x = 23 + 4 ; cout << x ;}

Built in reverse order. Must then be reversed.

Built in reverse order. Must then be reversed.

tag : ASSIGNSTnext :assignst :

AST for x = 23 + 4

tag : COUTSTnext : NULLcoutst :

AST for cout << x

Unused fields set to NULL.

Unused fields set to NULL.

3.113.11

Assignment StatementsAssignment Statements

• Example C-- source :

x = y + z * 34 ;

• C++ data structure :

struct AssignSt{ SymTab *target ; Expression *expr ;} ;

• The Expression data structure is quite complicated.

– Must handle arbitrary expressions.

– Later.

ST entry for x

target expr

AST for y + z * 34

3.123.12

if Statementsif Statements

• Example C-- source :

if (a > b){ cout << a ; }else { cout << b ; } ;

• C++ data structure :

struct IfSt{ Expression *condition ; AST *thenstats ; AST *elsestats ; int elselabel ; int endlabel ;} ;

The { and } are obligatory in C--.

So is the ;.

The { and } are obligatory in C--.

So is the ;.

elselabel and endlabel are used in

code generation.

elselabel and endlabel are used in

code generation.

3.133.13

if Statements IIif Statements II

• Schematic diagram :

condition : thenstats :elsestats :elselabel : 0thenlabel : 0

AST for a > b

AST for cout << b

AST for cout << a

3.143.14

while Statementswhile Statements

• Example C-- source :

while (a > 0){ a = a - 1 ; cout << a ;} ;

• C++ data structure :

struct WhileSt{ Expression *condition ; AST *stats ; int startlabel ; int endlabel ;} ;

The { and } are obligatory in C--.

So is the ;.

The { and } are obligatory in C--.

So is the ;.

startlabel and endlabel are used in

code generation.

startlabel and endlabel are used in

code generation.

3.153.15

while statements IIwhile statements II

• Schematic diagram :

condition : stats :startlabel : 0endlabel : 0

AST for a > 0

AST for a = a - 1 ;cout << a ;

3.163.16

cin statementscin statements

• Example C-- source :

cin >> a ;

• C++ data structure :

struct CinSt{ SymTab *invar ;} ;

• Schematic diagram :

a must be an int variable in C--.

a must be an int variable in C--.

invar : Symbol table entry for a.

3.173.17

cout statementscout statements

• Example C-- source :

cout >> a ;

• C++ data structure :

struct CoutSt{ SymTab *outvar ;} ;

• Schematic diagram :

a must be an int variable or

constant or a string constant in

C--.

a must be an int variable or

constant or a string constant in

C--.

outvar : Symbol table entry for a.

3.183.18

ExpressionsExpressions

• C-- syntax :

expression ::= basic_exp [ rel_op basic_exp ]basic_exp ::= term { add_op basic_exp }term ::= factor {mul_op term }factor ::= literal | identifier | ‘(‘ expression ‘)’ | ‘!’ factor

• Example expressions :

22 + 4 < b z != y a !true

3.193.19

Expressions IIExpressions II

• In syner.h :

struct Expression{ BasicExp *be1 ; string relOp ; BasicExp *be2 ;} ;

struct BasicExp{ Term *term ; string addOp ; BasicExp *bexp ;} ;

struct Term{ Factor *fact ; string mulOp ; Term *term ;}

Unused fields set to ““ or NULL as

appropriate.

Unused fields set to ““ or NULL as

appropriate.

3.203.20

FactorsFactors

• In syner.h :

struct Factor{ bool literal ; // Tag field DataType type ; // Type field string litBool ; // Boolean literal string litString ; // String literal int litInt ; // Integer literal SymTab *ident ; // Identifier Expression *bexp ; // Bracket expression Factor *nFactor ; // Negated factor} ;

• Literal is short for literal constant.

• Note that C-- does not allow negative integer literals.

– Makes things a lot simpler.

• Unused fields are ignored.

– Set to ““ or 0 or NULL as appropriate.

3.213.21

Example ExpressionsExample Expressions

• Example C-- expression :

42

• Schematic diagram :

be1 :relOp : ““be2 : NULL

term :addOp : ““bexp : NULL

fact :mulOp : ““term : NULL

literal : truetype : intDatalitBool : ““litString : ““litInt : 42ident : NULLbexp : NULLnFactor : NULL

3.223.22

Example Expressions IIExample Expressions II

• Example C-- expression :

42 < a

• Schematic diagram :

* as from be1 on last slide.

* as from be1 on last slide.

be1 : *relOp : “<“be2 :

term :addOp : ““bexp : NULL

fact :mulOp : ““term : NULL

literal : falsetype : intDatalitBool : ““litString : ““litInt : 0ident : bexp : NULLnFactor : NULL

Symbol table entry for a

3.233.23

Recursive Descent ParsingRecursive Descent Parsing

• Notice that the AST is a recursive data structure.

– Definition : for recursion see recursion.

• This is necessary because C-- has a recursive syntax :

– while and if statements may include other statements.

– Expressions include basic expressions which include terms which include factors which include expressions.

• The simplest way to syntax analyse a recursive syntax is via a method known as recursive descent syntax analysis.

– Usually called recursive descent parsing although technically parsing is syntax analysis plus lexical analysis.

3.243.24

Recursive Descent Parsing IIRecursive Descent Parsing II

•In a recursive descent parser we write a subprogram to recognise each of the different syntactic constructs.

– One subprogram per production rule (more or less).

•Each analysis subprogram calls the lexAnal to get tokens as required.

•Each analysis subprogram may call other functions to recognise component syntactic constructs.

– synStatement will call synAssign, synIf, synWhile etc.

– synIf and synWhile will call synStatement recursively.

• Lots more on how to do recursive descent parsing in the next lecture.

3.253.25

Chomsky GrammarsChomsky Grammars

• Noam Chomsky is an American linguist (among other things).

• Chomsky’s theories about the syntactic structure of languages are very popular with Computists.

– Much less popular with Linguists.

• Chomsky identified 4 types of grammar.

– Increasing levels of restrictions on the production rules.

• In a Chomsky type 0 grammar all production rules are of the form

A ::= alpha

where A is a non-empty string of non-terminal symbols and alpha is a string of terminal and/or non-terminal symbols.

• Type 0 is the most general type of grammar.

– Often called free grammars.

– Natural language grammars.

3.263.26

Chomsky Grammars IIChomsky Grammars II

• A Chomsky type 1 grammar contains only production rules of the form

beta A gamma ::= beta alpha gamma

where A is a single non-terminal symbol and alpha, beta and gamma are strings of terminal and/or non-terminal symbols.

– i.e. each production replaces a single non-terminal symbol in a particular context.

• Often called context-sensitive grammars.

3.273.27

Chomsky Grammars IIIChomsky Grammars III

• A Chomsky type 2 grammar contains only production rules of the form

A ::= alpha

where A is a single non-terminal symbol and alpha is a string of terminal and/or non-terminal symbols.

• Often called context-free grammars.

• A Chomsky type 3 grammar contains only production rules of the forms

A ::= a and A ::= bB

in which A and B are single non-terminal symbols and a and b are single terminal symbols.

• Often called regular-expression grammars.

3.283.28

Programming Languages And Chomsky GrammarsProgramming Languages And Chomsky Grammars

• In programming languages the syntax of statements, expressions etc. is defined by a type 2 grammar.

– Syntax analysis works on type 2 grammars.

• Type 3 grammars are used to define the micro-syntax of lexical items such as identifiers, literals and strings.

– Lexical analysis works on type 3 grammars.

• Note that a type 3 grammar is also a type 2 grammar.

– A type 2 grammar is also a type 1 grammar.

– A type 1 grammar is also a type 0 grammar.

• In general type 0 grammars cannot be analysed by a computer.

– The other 3 types can be analysed by a computer.

3.293.29

BacktrackingBacktracking

• Consider the following grammar :

S ::= ‘c’ A ‘d’

A ::= ‘a’ ‘b’ | ‘a’

• Suppose the input sequence is “cad”.

– Parser consumes ‘c’.

– Chooses first alternative for A and consumes ‘a’.

– Parser now looking for a ‘b’ which is not present.

– Parser must backtrack and try the second alternative for A.

– Parser consumes ‘a’.

– Parser consumes ‘d’.

• Backtracking is inefficient and makes the parser more complex to write.

3.303.30

Predictive ParsingPredictive Parsing

• It is reasonably straightforward to construct a grammar which can be parsed without backtracking.

• Simplest type is the LL(1) grammar :

– An LL(1) grammar can be parsed by reading the input left to right.

– At each step the leftmost derivation is produced. Only the leftmost non-terminal in the sequence is

replaced.

– Which production rule should be used at each step can be uniquely determined from the tokens already consumed, the current input token and a peek at the next input token. 1 token lookahead.

• C-- has an LL(1) grammar.

• Ideal for recursive descent predictive parsing.

3.313.31

SummarySummary

• Syntax analysis : determining whether or not a file contains a syntactically correct program.

• Recursive descent parsing : One subprogram to syntax analyse each (major) syntactic construct.

• Symbol table holds information about user defined identifiers for use in syntax analysis and code generation.

• Abstract syntax tree holds an abstract representation of the statements for use in syntax analysis and code generation.

• Chomsky grammars are used to define the syntax of programming languages.

• C-- has an LL(1) Chomsky type 2 grammar.