1
Syntax Analysis
• Programming Language Syntax: Syntax Specifications,• Stages in Translation: Processing Programs, Syntax Analysis,
Semantic Analysis, Lexical Analyzer, Code Generation,• Regular expressions, • Finite Automata, • Grammar Types: Unrestricted, Context-Free, Context-
Sensitive, Regular, BNF, EBNF,• Derivation: Parse Tree,• Grammar Issues: Ambiguous Grammars, Grammar
Transformations, Syntax Diagram, • Recursive Descent Process, Shift-reduce Parsing,• Concrete and Abstract Syntax,• LL grammar and• LR grammar: SLR, LALR.• Programming the Scanner and Parser
Coverage
2
Syntax Analysis
• Syntax defines the structure of the language
• Syntax helps in:− Language design and language comprehension− Implementing or writing the compiler, software
specification and the language system as a whole− Verifying for program correctness
• Definitions− Constructs: Strings that belong to the language− Syntax: The form or structure of the expression,
statements, and the program unit as a whole is called as Syntax
− Semantics: Semantics duly considers what happens while executing a program segment. Thus, it provides the meaning of the statements, expressions and program unit
− Pragmatics: Tools provided by the translator to help in debugging and interacting with the operating system
Programming Language Syntax
3
Syntax Analysis
• Lexeme: Lowest level syntactic unit of any language (e.g.,
sum, begin)
• Token: Category of lexemes (e.g., Identifiers)
• Any complier needs to have recognizers to recognize the
syntax of the language
• Notations of Expressions• Infix notation: operator symbol is present between the
operands• Prefix or Polish notation: operator symbol is present
before the operands• Postfix or Suffix or Reverse Polish notation: operator
symbol is present after the operands• Mixfix notation: operations that don't fit into the previous
notations, like if-then-else
Programming Language Syntax
4
Syntax Analysis
• Associativity in Expressions− Left-associative: Expressions with the same operator or
operator with same precedence are grouped from left to right. • Example: +, -, * and /
− Right-associative: Expressions with the same operator or operator with same precedence are grouped from right to left.• Example: Assignment symbol and exponentiation
• Expression Trees and their Evaluation• Expressions are expressed in the form of a tree with the root
indicating the result of the expression− Traversing a tree can be done in many ways:
• In-order traversal: All the nodes in the left subtree are visited first and then the root node is visited. Finally, the nodes in the right subtree are visited.
• Post-order traversal: All the nodes in the left and right subtree are visited before the root node is visited.
Programming Language Syntax
5
Syntax Analysis
• Expression Trees and their Evaluation− Traversing a tree can be done in many ways:
• Pre-order traversal: The root node is visited first and then the nodes of the left and right subtree are visited.
• Breadth-first traversal: Traversing is taken level by level. Finish visiting nodes at one level before moving to the next level. It is also called as level-order traversal.
• Depth-first traversal: Traversing goes into the depth and then rises to the next subtree. The order of traversing the tree performed by depth-first traversal is similar to preorder traversal.
Programming Language Syntax
6
Syntax Analysis
• Evaluation of Expressions− Applicative Order Evaluation (strict or eager evaluation):
The process of evaluation is bottom-up, which means the processing starts from the leaves and moves towards the root
− Normal Order Evaluation: Evaluation of an expression is done when it is needed in the computation of the result • Addition(5+2)• Addition(Y) {int Y; Y = Y + 2;} • Here, Y is replaced with 5+2 instead of doing the addition
first− Lazy Evaluation (Delayed evaluation): Evaluation is
postponed until it is really needed• Frequently used in functional languages.
− Block Order Evaluation: This is the evaluation of an expression that contains a declaration. • Example: We could have block expression in a function
that includes variable declaration in Pascal
Programming Language Syntax
7
Syntax Analysis
• Evaluation of Expressions− Short Circuit Evaluation: When we are evaluating
expressions which are of Boolean or logical, we could partially evaluate the expression and get the result• AND (X AND Y): If both X and Y are "1", then the result
is "1". Otherwise, the result is "0".• OR (X OR Y): If either or both X and Y are "1", then the
result is "1". Otherwise, the result is "0".• XOR (X XOR Y): If only one of them (X or Y) is "1", then
the result is "1". Otherwise, the result is "0".• NOT (X): If X is "1", then the result is "0". If X is "0",
then the result is "1".
Programming Language Syntax
8
Syntax Analysis
Compilation Process
SOURCE PROGRAM
SCANNER
PARSER
SEMANTIC ANALYSIS
INTERMEDIATE CODE GENERATION
CODE GENERATION
OPTIMIZATION(OPTIONAL)
SYMBOL TABLE
TOKENS
PARSE TREE
INTERMEDIATE CODE
ABSTRACT SYNTAX TREE
MACHINE CODE
SYNTAX
ANALYSIS
9
Syntax Analysis
• Syntax Analysis is of low-level and high-level parts.• Low-level (scanner or lexical analyzer):
• Mostly done using finite automata• Input symbols are scanned and grouped into meaningful
units called tokens.• Tokens are formed by principle of longest substring or
maximum match, using lookahead pointer • High-level part (parser or syntax analyzer)
• Done using Backus-Naur Form (BNF) or Context-Free grammar
• Tokens are grouped into syntactic units like expressions, statements and declarations and checked whether they confirm to the grammatical rules of the language
• Identification of reserved words: Use lookup table (symbol table)
• if statement: "if" "(" "y" "<" "5" ")" … • y is called as a variable, < is called as an operator, … • Tokens are represented as keywords, operators, identifiers,
literals, etc.
Compilation Process
10
Syntax Analysis
• Parser• The parser should find all syntax errors and produce the
parse tree• Parsing algorithms:
• Top-down: Recursive descent (which is a coded implementation) and LL Parser (which is a table driven implementation)
• Bottom-up: LR grammar
• Why separate the syntax analysis into scanner and parser?− Simplicity: Separating them makes the parser simpler.− Efficiency: Due to the separation, we could make
optimization possible for the lexical analyzer.− Portability: Even though parts of the lexical analyzer
might not be portable, we could always make the parser portable
Compilation Process
11
Syntax Analysis
• Semantic analysis (Contextual analysis) is required to make sure
that the data types match
• Semantic analysis works in synchronization with the syntax
analysis
• Contextual analysis is used to answer the following:− Whether the variable has been declared earlier or not?− Does the declaration type match with the usage type of the
variable?− Whether the initialization of the variable has been done in
advance or not?− Is the reference to the array within the bounds of the array?− …
• Code generation• Converting the program into executable machine code• Stages: intermediate code generation and code
generation
Compilation Process
12
Syntax Analysis
• Regular expression is used to represent the information
required by the lexical analyzer
• Regular Expression Definitions: The rules of a language
L(E) defined over the alphabet of the language is expressed
using regular expression E. − Alternation: If a and b are regular expressions, then
(a+b) is also a regular expression.− Concatenation (or Sequencing): If a and b are regular
expressions, then (a.b) is also a regular expression.− Kleene Closure: If a is a regular expression, then a*
means zero or more representation of a. − Positive Closure: If a is a regular expression, then a+
means one or more of the representation of a.− Empty: Empty expressions are those with no strings.− Atom: Atoms indicate that there is only one string in the
expression.
Regular Expressions
13
Syntax Analysis
Regular Expressions
14
Syntax Analysis
Regular Expressions
15
Syntax Analysis
• Regular expression to match integers and floating point
numbers− To match a digit: [0-9] − To match one or more occurrences, we use [0-9]+− To support both signed and unsigned integers: -?[0-9]+
• -? indicates the presence or absence of minus− Floating point representation: Decimal part is present before
the dot • ([0-9]* \. [0-9]+)
− Exponent part: Presence of the character "e" either as lower or uppercase. • “e” is followed by + or – sign which is followed by an
integer. • ([eE][-+]?[0-9]+)? • Question mark at the end indicates the presence of
exponent part is not compulsory.− -?(([0-9]+) | ([0-9]* \. [0-9]+) ([eE][-+]?[0-9]+)?)
Regular Expressions
16
Syntax Analysis
• Finite Automata represent computing devices that could accept or recognize the given regular expression that represent a language
• Finite Automata Definitions− Alphabet (): An alphabet is made up of finite, non-empty
set of symbols. Symbols are represented using lower case Latin alphabets. Symbols are considered to be atoms which cannot be subdivided further. Ex. = {a,b,c}
− String or Word: String is a sequence of symbols formed using a single alphabet. • Given the alphabet = {a,b,c}, the various strings that
could be formed are: a, abc, aa, abcabcabc− Empty String (): Empty string indicates a string that is
composed of zero symbols. Empty string can be included in an alphabet.
− Size of a String: Size of a string indicates the number of symbols present in the string. • Size of the string ab is denoted as, |ab| = 2• Size of the string || = 0 Size of the string |b| = 1
Finite Automata
17
Syntax Analysis
• Finite Automata Definitions− Concatenation of Strings: String can be combined together
to form a new string. • S1 = abc and S2 = def: S1S2 = abcdef and S2S1 = defabc• Concatenate empty string: S1 = S1 = abc = abc = abc =
S1
• Empty string is called as the identity operator for string concatenation.
− Languages (L): Language defines an infinite set of strings from a given alphabet. = {a,b,c}, Language L = {anbncn | n 0}• In this example, number of a's and b's and c's are the same.
− Power of an alphabet: • Represented by the power of order n• This order represents the number of elements present in
each permutation combination of the given string− For a string = {a,b,c}− 0 = {}− 1 = {a, b, c}− 2 = {aa, bb, cc, ab, ba, ac, ca, bc, cb}− 3 = {aaa, bbb, ccc, aab, bba, aac, cca, …}
Finite Automata
18
Syntax Analysis
• Finite Automata Definitions− Closure of an alphabet:
• Transitive Closure: − Zero or more combinations of the string.− * = 0 1 2 3 = {, a, b, c, aa, bb, cc, ab,
… }• Transitive-reflexive Closure:
− One or more combinations of the string.− + = 1 2 3 = {a, b, c, aa, bb, cc, ab, … }
• Any language defined on the given alphabet is a subset of the transitive-reflexive closure of the alphabet.− L, L *
− Empty Language: • Empty language is one that has no strings in it. • L = {} is an empty language. • L = {} is not an empty language because it is made
up of one string, called as the empty string.
Finite Automata
19
Syntax Analysis
• Finite Automata Representation• Circle: state; Arrows: transition; Double circle: final state• States are indicated using numbers• Arrows are indicated using a transition variable or
Finite Automata
Figure 2.2. NFA for
t
Figure 2.3. NFA for t
X Y
Figure 2.4. NFA for XYX
Y
Figure 2.5. NFA for X|Y
20
Syntax Analysis
• DFA (Deterministic Finite Automata) Vs NFA (Non-deterministic Finite Automata)• In DFA, empty transitions () are not allowed. Also, from any
state s there should be only one edge labeled a.• Convert from NFA to DFA
− Find –closure of s:• Add s (the node itself) to its –closure. i.e. –closure(s) =
{s}− Reachable with empty transition: If there is a node t in –
closure(s), and there exists an edge labeled from t to u, then u is also added to –closure(s) if u is not there already. Continue until no more nodes can be added to –closure(s)
Finite Automata
X
Figure 2.6. NFA for X*
21
Syntax Analysis
• Convert from NFA to DFA− State transition:
• From the initial –closure, find transitions on various terminals present in the given regular expression
• Example: If there is a node t in the –closure(s), and there exists an edge labeled (non-empty) from t to u, u is also added to –closure(s) if u is not there already. From u, add all the nodes that could be reached using –transition.
− A transition table is drawn based on the States and Inputs.− Optimization of the transition table can be done as:
• Partition the set of states into non-final and final states. • With the non-final states:
− The state whose transition goes to outside the group is separated from the group.
− If there are states with same transition on all the inputs, keep one of those states and replace the other entries with the preserved one.
− Check for dead state. Dead state is one in which the transitions end up in the same state irrespective of the input. Also, this dead state is not the final state.
Finite Automata
22
Syntax Analysis
• Transitions for (m | n)*mnn
• Find –closure: Starting from 0, using -transition, we could reach 0, 1, 2, 4 and 7. A = {0, 1, 2, 4, 7}.− From node 3, we can reach 6, 7, 1, 2 and 4 using -transition. But
from node 8, there is no more transition possible using -transition.
− -Closure({3,8}) = B = {3,8} − Finally, we get B = {1, 2, 3, 4, 6, 7, 8}.
• Transition of n on set A, we get C = {1,2,4,5,6,7}• Transition of n on set B, we get D = {1,2,4,5,6,7,9}• Transition of n on set D, we get E = {1,2,4,5,6,7,10}• If you apply transition of m on set C, we get B. So, we stop here
because any further transition repeats to the already found sets only.
Finite Automata - Example
10
2 3m
4 5n
1 6
7 8 90m
n n
23
Syntax Analysis
• Transition Table
• Non-Final States (ABCD); Final State (E).
• With non-final states − On input m, all of them go to B and so
they are in one group.− On input n, states A, B, and C move to
members of group (ABCD) but D goes to E. So, split (ABCD) into (ABC) and (D).
− In (ABC), with input n, states A & C go to C but B goes to D. So, split them as (AC) and (B).
− In (AC), both of have the same transitions. Thus, use only one (A) of them.
− Check for dead state. In our example, there is no dead state.
Finite Automata - Example
24
Syntax Analysis
• Terminal Symbols: Atomic or non-divisible symbols in any language
• Non-terminal Symbols (variable symbols or syntactic categories or syntactic variable or abstraction): A single non-terminal symbol can be made of more than one Right Hand Side (RHS) derivation, separated by a divisor (|).
• Variable symbol or distinguished symbol (start symbol): Basic category that is being defined
• Production or Rewriting Rules: Rules that are used to define the structure of the constructs. Defines how to write any variable symbol using terminal and non-terminal symbols. Rule has a left-hand size (LHS) derived to a right-hand side (RHS) that is made up of terminal and non-terminal symbols.
Grammar Types - Definitions
25
Syntax Analysis
• Grammar: A grammar is a finite non-empty set of rules.
• Syntactic lists: Lists of syntactic nature could be represented using recursion. <ident_list> ident | ident, <ident_list>
• Derivation: This is the process of repeatedly applying the rules, starting from the start symbol until there are no more non-terminal symbols to expand.
Grammar Types - Definitions
26
Syntax Analysis
• Unrestricted Grammar: − Called as Recursively Enumerable or Phrase
Structured grammar or Type 0 grammar. − There is no restriction on the right hand side of
the production rule. − At least one non-terminal symbol on the left side
of the production rule must be present
− whereV + and V
− V: finite set of Variable Symbols.− T: finite set of terminal symbols.− Example: S ACaB; Ca aaC
Grammar Types
27
Syntax Analysis
• Context-Sensitive Grammar: − Called as Type 1 grammar − Requires that the right side of the production
rule must not have fewer symbols compared to the left side
− Called as Context-Sensitive Grammar as any replacement of a variable depends on what surrounds it
−
• where AV,V and V +
− Example: Things b b Thing; Thing c Other b c
Grammar Types
28
Syntax Analysis
• Context-Free Grammar:− Called as Type 2 grammar− Developed by Noam Chomsky during the mid-
1950s − The left side of a production rule is a single
variable symbol and the right side is a combination of terminal and variable symbols
− Production rule takes the form Awhere AV,V
− Example: Fraction Digit; Fraction Digit Fraction
Grammar Types
29
Syntax Analysis
• Regular Grammar: − Called as Restrictive Grammar or Type 3 grammar− Each production rule is restricted to have only one
terminal or one terminal and one variable on the right side− Regular Grammars are classified as right-linear or left-
linear grammars.− Right-linear grammar
• AxB or Ax where AV, BV, and xT− Left-linear grammar
• ABx or Ax where AV, BV, and xT− Regular expressions Vs context-free grammar:
• To represent lexical rules which are simple in nature, we don't need a powerful notation like context-free grammar
• Regular expressions can be used to make recognizers for any language.
Grammar Types
30
Syntax Analysis
• Backus-Naur Form (BNF): − Invented by John Backus to describe Algol 58 − Described as a metalanguage because it is a
language that is used to describe another language− Considered equivalent to context-free grammar− Abstractions are used to represent various classes
of syntactic structures, which act like non-terminal symbols.
• To represent While statement:− <while_stmt> while ( <logic_expr> ) <stmt>
• Reasons for using BNF to describe syntax are:− BNF provides a clear and concise syntax
description.− The parser can be based directly on the BNF.− Parsers based on BNF are easier to handle.
Grammar Types
31
Syntax Analysis
• Extended BNF (EBNF): − BNF’s notation + regular expressions− Different notations persist:
• Optional parts: Denoted with a subscript as opt or used within a square bracket.− <proc_call> ident ( <expr_list>)opt− <proc_call> ident [ ( <expr_list>)]
− Alternative parts: • Pipe (|) indicates either-or choice• Grouping of the choices is done with square brackets or
brackets.− <term> <term> [+ | -] const− <term> <term> (+ | -) const
− Put repetitions (0 or more) in braces ({ })• Asterisk indicates zero or more occurrence of the item. • Presence or absence of asterisk means the same here, as the
presence of curly brackets itself indicates zero or more occurrence of the item.− <ident> letter {letter | digit}*− <ident> letter {letter | digit}
Grammar Types
32
Syntax Analysis
• Differences between BNF and EBNF notations− BNF:
• <expr> <expr> + <term> | <expr> - <term> | <term>
• <term> <term> * <factor> | <term> / <factor> | <factor>
− EBNF:• <expr> <term> {[+ | -] <term>}*• <term> <factor> {[ * | / ] <factor>}*
• EBNF uses the final replacement of <expr> by the
<term> and provides the right hand side without
any <expr> entry there.
Grammar Types
33
Syntax Analysis
• Apply the grammar to the start symbol <program> and continue to expand until there is no more non-terminal symbol
left on the right-hand side
• Methods of Derivation− Leftmost derivation is a process by which the leftmost non-
terminal in each sentential form is expanded − Parse-tree or Derivation tree
• Top-down parser keeps the start symbol as the root of the tree. Then, it replaces every variable symbol with a string of terminal symbols.
• Bottom-up parser begins with the terminal symbols. These terminal symbols are matched with the right hand side of the production rule and are replaced with the corresponding variable symbols present in the left hand side of the production rule.
• Parse trees can be used to attach semantics of a construct to its syntactic structure, called as syntax-directed semantics
Derivation
34
Syntax Analysis
• Given the regular grammar S ::= aS | bS | a |
b, check whether the grammar can derive the
form anbn.− Let's try for a1b1; S aS ab− Let's try for a2b2; S aS aaS aabS aabb− Let's try for a3b3; S aS aaS aaaS
aaabS aaabbS aaabbb− We are able to attain the required format using
this regular grammar.
Derivation - Example
35
Syntax Analysis
• Ambiguities in Grammar− Any grammar is said to be ambiguous if it
generates a sentential form that has two or more distinct parse trees.
− Ex. If statement with dangling else.
Grammar Issues
If Statement
) StatementExpressionif (
If Statement
) StatementExpressionif ( else Statement
If Statement
) StatementExpressionif ( else Statement
If Statement
) StatementExpressionif (
36
Syntax Analysis
• Left Factorization: − Initial element of the options in right side of the given rule is
same • N XY | XZ X (Y|Z)
• Elimination of Left Recursion: − First element on the right hand side causes transition to the left
hand side of the rule• N X | NY
XY*− The termination of the NY is possible only if we replace N with X. − If N X is used without the use of N NY, then there will be no
Y. • N NY NYY XYY
• Substitution of Non-terminal Symbols: − Presence of any non-terminal symbol in the right hand side of the
given rule should be replaced using another rule.• N X and M N can be changed as N X and M X
Grammar Transformations
37
Syntax Analysis
• Called as Syntax Charts or Railroad Diagram • Developed by Niklaus Wirth in 1970• Used to visualize rules in the form of diagrams• Used to represent EBNF notations and not BNF notations• Variables are represented by rectangles and terminal symbols
are represented by circles (sometimes oval shape)• Each production rule is represented as a directed graph whose
vertices are symbols
Syntax Diagram
38
Syntax Analysis
• There is a subprogram for each non-terminal in the grammar that parses the sentences that are generated by the non-terminal
• For proceeding with the correct grammatical rule, we match each terminal symbol in the right hand side with the next input token. − If there is a match, we continue further. − Otherwise, an error is generated or other rules are tried
• If a non-terminal has more than one RHS, we determine which one to parse first using:− Choose the correct RHS based on the next token (lookahead).− Next token is compared with the first token that can be
generated by each RHS until a match is found.− If there is no match, then it is considered as a syntax error.
• Shift-Reduce Parsing: With the given grammar and given input string, we reduce the right hand side of the input string to attain the start symbol of the grammar
Recursive Descent Parsing
39
Syntax Analysis
• Concrete Syntax: − Defines the structure of all the parts of a program like
arithmetic expressions, assignments, loops, functions, definitions, etc.
− Context-Free grammars, BNF, EBNF, etc are of concrete syntax type.• Assignment Identifier = Expression;• Expression Term | Expression + Term
• Abstract Syntax: − Generated by the parser and is used to link syntax and
semantics of a program− Unlike concrete syntax, abstract syntax provides only the
essential syntactic elements and does not describe how they are structured• Statement = Assignment | Loop• Assignment = Variable target; Expression source
• Ambiguity occurs in concrete syntax but not in abstract syntax
Concrete and Abstract Syntax
40
Syntax Analysis
• Identification Tables− Called as symbol tables.− A dictionary-type data structure to store identifier names
along with corresponding attributes • Organization of identification table depends on the "block
structure" used in different languages− Monolithic block structure: e.g. BASIC, COBOL− Flat block structure: e.g. Fortran− Nested block structure is used in the modern "block-
structured" programming languages (e.g. Algol, Pascal, C, C++, Scheme, Java, …)
• Monolithic Block Structure: − A single block is used for the entire program− Every identifier is visible throughout the entire program − Scope of each identifier is the whole program and cannot
be declared twice
Symbol Table
41
Syntax Analysis
• Flat Block Structure: − Whole block area is divided into several disjoint blocks− Declarations can be local or global− Identifiers can be redefined in another block− Local declaration is given higher priority over global declaration
• Nested Block Structure: − Blocks may be nested one within another− Scope of an identifier depends on the level of nesting present− An identifier cannot be defined more than once at the same level
within the same block
Symbol Table
42
Syntax Analysis
• Unordered list: Data could be stored in an array or a linked list.
• Ordered list: − Entries in the list are ordered − Searching is faster− Insertion of data into the list is an expensive process
• Binary Search Tree: − Using a binary search tree, the searching time takes
O(log(n)).• Hash Table:
− Most commonly used option− Access the data can be done in constant time− Storage of data is not time consuming
Symbol Table Structure
43
Syntax Analysis
• First L in LL specifies that a left-to-right scan of the input is handled
• Second L specifies that a leftmost derivation is generated • First step towards using LL grammar is elimination of common
prefix. Note: and can match zero or more elements.− Form is B 1 | 2 | … |m |Xm+1| Xm+2 | … | Xm+n
− Replace it with• B B1 | Xm+1| Xm+2 | … | Xm+n
• B1 1 | 2 | … |m
• Convert the grammar into unambiguous one − Make sure they obey precendence and associativity rules− Start from the terminal and move from high precedence to
low precedence• Consider the grammar: E E + E | E * E | (E) | id
− Select the terminals and name them differently.• Factor (E) | id
− * operator has high priority that + operator. So, select E E * E next• E E * E is considered first.
LL Grammar
44
Syntax Analysis
• Convert the grammar into unambiguous one • Consider the grammar: E E + E | E * E | (E) | id
− * has high priority that +. So, select E E * E next• To provide the link between E * E and the
Factor, use the pipe (|) operator.• With no link, the non-terminal will never become
a terminal.• Give a new name “Term” for the element.• Term Term * Factor | Factor
− Then, consider E E + E and change it also.• Expression Expression + Term | Term
− So, F (E) | id; T T * F | F; E E + T | T• Remove Left-recursion
− If A A1 | A2 | … | Am | 1 | 2 | … | n
− Where no i begins with an A. Where A is E, is +T & is T− Replace the above as:
• A 1A' | 2A' |… | nA' • A' 1A' | 2A' | … | mA' |
LL Grammar
45
Syntax Analysis
• Consider the grammar• ETE'; E'+TE'|; TFT'; T'*FT'|; F(E)|id• FIRST & FOLLOW
− FIRST:• If X is terminal, then FIRST(X) is {X}.• If X is non-terminal and X a is a production, then add a to
FIRST(X). If X is a production, then add to FIRST(X).
• If X Y1Y2…Yk is a production, then for all i such that all of Y1,..Yi-1 are non-terminals and FIRST(Yj) contains for j=1,2,… i-1, add every non- symbol in FIRST(Yj) to FIRST(X). If is in FIRST(Yj) for all j=1,2,…,k, then add to FIRST(X).
− The third rule of FIRST is like E TE' where T FT' and F(E)|id. Thus, what is in FIRST(F) will be in FIRST(E) & FIRST(T).
• FIRST(E) = FIRST(T) = FIRST(F) = {(,id} FIRST(E')={+, }• FIRST(T')={*, }
LL Grammar
46
Syntax Analysis
• FIRST & FOLLOW− FOLLOW: (is any string of grammar symbols; can
also be .)• $ in FOLLOW(X), where X is the start symbol.• If there is a production AB, , then everything in
FIRST() but is in FOLLOW(B).• If there is a production AB, or a production AB
where FIRST() contains , then everything in FOLLOW(A) is in FOLLOW(B).
• In FOLLOW, take the first rule apply to all the grammar and then take the second rule apply to all the grammar and so on.
• Note: Refer to notes for verbal explanation for FIRST & FOLLOW rules
LL Grammar
A à B FOLLOW
Condition: FIRST(contains
Third Rule of FOLLOW
FOLLOW
A à BFOLLOWFOLLOW
A à B
FOLLOWFIRST, except
Condition:
Second Rule of FOLLOW
47
Syntax Analysis
• FIRST & FOLLOW− FOLLOW(E) = FOLLOW(E') = {), $}− FOLLOW(T) = FOLLOW(T') = {+,), $} − FOLLOW(F) = {+,*,),$}
• Generating the parsing table− A Grammar whose parsing table has no multiply-defined
entries is said to be LL(1). is any string of grammar symbols; can also be .
1. For each production A of the grammar, do steps 2 & 3.2. For each terminal a in FIRST(), add A to M[A,a].3. If is in FIRST(), add A to M[A,b] for each terminal b in
FOLLOW(A). If is in FIRST() and $ is in FOLLOW(A), add A to M[A,$].
− Note: Here, M[A,b] indicates the corresponding cell in the table, whose row corresponds to the non-terminal A and column corresponds to the terminal b.
4. Make each undefined entry of M error.
LL Grammar
48
Syntax Analysis
• Left to Right grammar• Most powerful shift-reduce parsing technique
− Non-backtracking shift-reduce parsing which could detect a syntactic error as soon as possible
• Represented as LR(k) where k indicates the look-ahead value• LR(1) means no look-ahead: only next element is considered
and not anything those follows the next element. • Can parse all grammars that could be parsed with predictive
parsers like LL(1) grammar• Types of LR grammars:
− SLR – Simple LR parser.− LR – Most general LR parser.− LALR – Intermediate LR parser (Look-ahead LR parser).
• All the types use the same algorithm but with different parsing table
LR Grammar
49
Syntax Analysis
• LR parser configuration: (S0 X1 S1 ... Xm Sm, ai ai+1 ...
an $), which includes Stack values and the rest of
Inputs
− Xi is a grammar symbol
− Si is a state
− ai is an input
• Initial Stack contains just S0
LR Grammar
a1 ... ai ... an $
Sm
Xm
Sm-1
Xm-1
.
.
S1
X1
S0
LR PARSING ALGORITHM
Action TableTerminal and $
States + Four Different Actions
Goto TableNon-Terminal
States + Each item is a state number
Figure 2.11. LR Parsing
50
Syntax Analysis
• Parser takes action using Sm and ai
• shift s: shifts the next input symbol ai and the state s onto the stack − (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $) (S0 X1 S1 ... Xm Sm ai s, ai+1 ...
an $)
• reduce A (or rn where n is a production number)− pop r (r is the length of ) number of items from the stack;
This is done so that we can replace the right hand side with the left hand side of the grammar.
− then push A and s where s=goto[sm-r,A]. Here, m-r indicates that r items have been taken of the stack.
− (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $) (S0 X1 S1 ... Xm-r Sm-r A s, ai ... an $)
− Output is the reducing production rule, reduce A• Accept: Parsing is successfully completed.
• Error: Parser has detected an error. This might because there is an empty entry in the action table.
• GOTO takes a state and grammar symbol as arguments and produces a state.
LR Grammar
51
Syntax Analysis
• Closure: If I is a set of LR(0) items for a grammar G, then
closure(I) is the set of LR(0) items constructed from I by the
two rules:
1. Initially, every LR(0) item in I is added to closure(I).
2. If A .B is in closure(I) and B is a production rule of G; then B. will be in the closure(I). Here, B is a non-terminal. can be anything or even empty
• The above-mentioned rule is applied until no more LR(0) item can be added to closure(I).
E' E E E+T E T T T*F T F F (E) F id
Check for non-terminal after dot, if there is, continue the productions.
closure({E' .E}) = { E' .E E .E+T E .T
T .T*F T .F F .(E) F .id }
Phases of LR Grammar Processing
52
Syntax Analysis
• GOTO: If I is a set of LR(0) items and X is a grammar symbol
(terminal or non-terminal), then goto(I,X) is defined as follows:
− If A .X in I then every item in closure({A X.}) will be in goto(I,X).
Example:I ={ E' .E, E .E+T, E .T,
T .T*F, T .F, F .(E), F .id }goto(I,E) = { E' E., E E.+T } Move dot one step further with E.goto(I,T) = { E T., T T.*F } Move dot one step further with T.goto(I,F) = {T F. } Move dot one step further with F.goto(I,() = { F (.E), E .E+T, E .T, T .T*F, T .F,
F .(E), F .id } After moving the dot after (, there exists a non-terminal and so add the closure of that non-terminal.
goto(I,id) = { F id. } Move dot one step further with id.
Phases of LR Grammar Processing
53
Syntax Analysis
• Canonical LR(0) algorithm: This is needed to create the SLR
parsing table.
C is { closure({S'.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
• goto function is a
DFA on the sets in C.
Phases of LR Grammar Processing
For I1, we look at I0 and
use the symbol E.
I2 and I3 are obtained
using transitions with
symbol T and F
54
Syntax Analysis
• For I4, we have moved the dot on open-bracket. As the dot is
followed by E (a non-terminal), we need to add all the
transitions with E (E .E+T and E .T) from I0. As still we
have some non-terminals (like T and F) that follow the dot, we
add their transitions also.
• I5 is made using transition on id from I0. Then, we make
transition on + from I2 to obtain I6.
Phases of LR Grammar Processing
I0 I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11id
(
F
T
E +
*
E
T
To I3
To I4
To I5
F
(
id
To I2
To I3
To I4
F
(
T
To I4
To I5
F
(
To I6
+
id
)
id
* To I7
Figure 2.12. SLR Transitions
55
Syntax Analysis
1. Construct the canonical collection of sets of LR(0) items for G’.
C {I0,...,In}
2. Create the parsing action table as follows• If a is a terminal, A.a in Ii and goto(Ii,a)=Ij then
action[i,a] is shift j.• If A. is in Ii , then action[i,a] is reduce A for all a in
FOLLOW(A) where AS'. A in reduce is represented using the sequence number of A in the grammar. • Note: There is no element after the dot; can be
anything or even empty • If S'S. is in Ii , then action[i,$] is accept. Here, E being
the starting symbol S, E'E. will produce the accept entry.• If any conflicting actions generated by these rules, the
grammar is not SLR(1).
3. Create the parsing goto table• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
4. All entries not defined by (2) and (3) are errors.
5. Initial state of the parser contains S'.S
LR Grammar – Create SLR Parsing Table
56
Syntax Analysis
• 1) E E+T 2) E T 3) T T*F
• 4) T F 5) F (E) 6) F id
• The first entry of s5 in the (row, column) grouping as (0,id) is
because from Figure 2.12, we could see that I0 transits to I5 on
id. So, action[0, id] = shift 5.
• s6 on (1,+) is because from Figure 2.12, we could see that I1
transits to I6 on +. And so on…
Phases of LR Grammar Processing
57
Syntax Analysis
LR Grammar – Given an input id * id + id
58
Syntax Analysis
• SLR(1) grammar is called as SLR grammar in short
• SLR grammar is always unambiguous but that does not mean that all unambiguous grammars are SLR grammars.
• SLR grammar does not posses any of these conflicts:− Shift/Reduce conflict: It is in a state when it is not sure
whether to make a shift or reduction operation for a terminal.
− Reduce/Reduce conflict: It is in a state when it is not sure whether to make a reduction operation using the production rule i or j for a terminal.
• Canonical SLR(1) parsing table:− In SLR method, the state i makes a reduction by A when
the current token is a:• if the A. in the Ii and a is FOLLOW(A)
− In some situations, A cannot be followed by the terminal a in a right-sentential form when and the state i are on the top stack. This means that making reduction in this case is not correct.
SLR(1) Grammar
59
Syntax Analysis
• LR(1) item− In order to avoid invalid reductions, we need to
make the states carry more information. This information is added as a terminal symbol in the form of a second component in an item.
− A LR(1) item is defined as: A .,a where a is the look-ahead of the LR(1) item (a is a terminal or end-marker.) When (in the LR(1) item A .,a ) is not empty, the look-ahead does not have any effect.
− When is empty (A .,a ), we do the reduction by A only if the next input symbol is a (not for any terminal in FOLLOW(A)).
− A state will contain A .,a1 where {a1,...,an} FOLLOW(A)
SLR(1) Grammar
60
Syntax Analysis
• Canonical Collection of LR(1) items: Similar to LR(0) but with slight changes in closure and goto.
• closure(I) is: ( where I is a set of LR(1) items)
− every LR(1) item in I is in closure(I)
− if A.B,a in closure(I) and B is a production rule of G; then B.,b will be in the closure(I) for each terminal b in FIRST(a) .
• B is the term next to the dot. The rule of any non-terminal that follows the dot will be included into the closure.
• Also, indicates on what follows B as it is the FIRST() or FIRST(a). and can be anything or even empty.
• If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal), then goto(I,X) is defined as follows:− If A .X,a in I then every item in closure({A X.,a})
will be in goto(I,X). − Move the dot a step forward using goto
SLR(1) Grammar
61
Syntax Analysis
• Numbering of the rules start with 1 but the initial S'
S is excluded from the rule numbering.
SLR(1) Grammar
62
Syntax Analysis
• In I0: In the representation S' .S,$: $ is the element that
follows S'. From here, as the dot is followed by a terminal (S), we
need add its rules (S .L=R,$ & S .R,$) also. − S' .S,$ matches A.B,a and S .L=R,$ matches B.,b.
$ is added as the look-ahead item as is empty [so, FIRST() is also empty] and FIRST(a) = FIRST($) = $. Then, the dot is followed by L and R, we add their rules also. The dot stays at the beginning of the right-side in the added rules.
• In I0: In the representation L .*R,$/= L .id,$/= we need to
apply FIRST() = FIRST(=) and FIRST(a) = FIRST($) as A.B,a
is matched with S L.=R,$.
• In I0: R .L,$ does not contain a = as the look-ahead because
A.B,a is matched to S .R,$ and is empty and a is $.
• Transitions are handled based on the movement of dot across
terminal or non-terminal. Transition to I1 from I0 is based on S.
SLR(1) Grammar
63
Syntax Analysis
1. Construct the canonical collection of sets of LR(1) items for G’.
C{I0,...,In}
2. Create the parsing action table as follows• If a is a terminal, A.a,b in Ii and goto(Ii,a)=Ij then action[i,a]
is shift j.• If A.,a is in Ii , then action[i,a] is reduce A where AS’.
• If S’S.,$ is in Ii , then action[i,$] is accept.
• If any conflicting actions generated by these rules, the grammar is not LR(1).
LR(1) Parsing Table Construction
3. Create the parsing goto table
• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
4. All entries not defined by (2) and (3) are
errors.
5. Initial state of the parser contains S’.S,$
64
Syntax Analysis
• LALR stands for LookAhead LR• LALR tables are smaller than LR(1) parsing tables but the
number of states remain the same• LALR parser is obtained by shrinking the canonical LR(1)
parser. This shrinking process should not produce reduce/reduce conflict.
• The core of the LALR grammar is the first component of the LR(1) items, which excludes the look-ahead item. − For Example, in S L.=R,$, the core part is S L.=R
• If there is more than one LR(1) item with the same core, we merge them into a single state.
• Creating LALR parsing table− Create the canonical LR(1) collection of the sets of LR(1)
items for the given grammar.− Find all sets that have the same core. Replace those sets
having the same core with a single set which is their union. • C={I0,...,In} C’={J1,...,Jm}where m n
LALR Grammar
65
Syntax Analysis
• Creating LALR parsing table− Create the parsing tables (action and goto tables) same as
the construction of the parsing tables of LR(1) parser.• Note that: If J=I1 ... Ik since I1,...,Ik have same
cores then cores of goto(I1,X),...,goto(I2,X) must be same.
• So, goto(J,X)=K where K is the union of all sets of items having same cores as goto(I1,X).
− If no conflict is introduced, the grammar is LALR(1) grammar. (reduce/reduce conflicts can be introduced but not shift/reduce conflict)
• Ambiguous grammars produce conflicts− Consider this ambiguous grammar
E E + E | E * E | (E) | id − Produce the parsing table
LALR Grammar
66
Syntax Analysis
LALR Grammar
67
Syntax Analysis
• Errors can be detected by consulting the parsing action table− Goto table is not used to detect errors
• Canonical LR or LR(1) parser will not make any reduction before announcing an error but SLR and LALR might make many reductions before indicating an error
• Panic Mode Error Recovery in LR Parser− When faced with an error, remove the entries in the stack
before the state sthat has a goto with a particular non-terminal A
− Discard zero or more input symbols until the symbol a is found that is present in follow of A
− Parser can now stack the non-terminal A and the state goto[s,A] and proceed with parsing
• Phrase Mode Error Recovery in LR Parser− An empty entry in the action table is associated with a specific
error routine that reflects the most likely error in this case− This error could either insert or delete symbols into or from
the stack− This could be useful in handling missing operand, unbalanced
right parenthesis, etc
Error Recovery in LR Grammar
68
Syntax Analysis
• For scanner:− lex (A Lexical Analyzer Generator): generates codes in C
language− Variants to lex: flex, AT&T lex, Abraxas Pclex, MKS Lex, POSIX
Lex, jflex, … • For Parser:
− yacc ("Yet Another Compiler Compiler" with AT&T Yacc, Berkeley Yacc and GNU Bison as variants)
− Accent: Check for conflicts• Programming with lex/flex
− File name: filename.l− Does not generate executable code, but generates the C
routine called yylex()− We will need to write a program that calls yylex( ) to run the
lexer • Lex programs are divided into three sections: definitions section,
rules section and user subroutines section− The starting and ending of the rules section is indicated using
"%%"− ONLY User subroutines section is optional
Programming the Scanner and Parser
69
Syntax Analysis
• In the definitions section, the part that is covered by %{ and %} is copied as it is into the generated C program
• C language comments can be added outside the definition section also
• When using comments outside the %{ and %} block, comments must be intended with whitespace.
• Rules section− Map pattern and action− If the number of actions that ought to be handled is more
than one, then the actions are grouped with braces. • User subroutines section
− Contains many subroutines− The subroutine that calls yylex( ) is copied as it is into the
C program• Internal Variables of LEX/FLEX:
− yylval: This variable contains the value of the token.− yyleng: This variable contains the length of the string the
lexer has recognized.
Programming the Scanner and Parser
70
Syntax Analysis
• Internal Variables of LEX/FLEX:− yyin: Indicates how lexer reads the input. By default yyin is
set to stdin. − yylex( ): Function that runs the lexer.− yywrap( ): Function that is called by the yylex to check for
the end of the file. − input( ), output( ) and unput( ): input() and unput()
functions are needed to read input from the command line. − Start State: Start states are defined using %s in the
definitions section. − ECHO: This macro is used to write the token to the current
output file yyout. This is similar to writing like: fprintf(yyout, "%s", yytext);
− REJECT: REJECT is used as an action to put back the text matched by the pattern and search for the next best match.
Programming the Scanner and Parser
71
Syntax Analysis
• Programming with Yacc/Bison− Does the task of LALR(1) parser− Being LALR(1) parser, yacc can only go one step lookahead
and thus ambiguous natures beyond one step will generate an error
− The program structure in Yacc is similar to that of Lex − Definitions sections: definitions, C code and associativity
rules are specified. − Yacc calls yylex routine repeatedly to get the token and
then applies the rules specified− As Lex returns tokens to Yacc, both the programs need to
agree on what tokens are• Definitions section in yacc: %token NUMBER
− In the lex program:• extern int yylval;• %%• [0-9]+ {yylval = atoi(yytext); return NUMBER; }
Programming the Scanner and Parser
72
Syntax Analysis
• Programming with Yacc/Bison− In the yacc program, do the following:
• Specify the variables.− %union {int ival, double cost;}
• Connect the values to the return tokens.− %token <ival> INDEX− %token <cost> NUMBER
• Specify the type for the non-terminals. Let's say ival is a terminal but cost is not.− %type <cost> expression
• Associative and Precedence rules are specified in the definition section of the yacc program.− %left '-' '+'− %left '*' '/'− %nonassoc UMINUS
Programming the Scanner and Parser
73
Syntax Analysis
• Programming with Yacc/Bison− expression: expression '+' NUMBER{$$ = $1 - $3; }− | expression '-' NUMBER {$$ = $1 - $3; }− ;
• $1 represents the first number value in the right hand side, $2 represents the operator and $3 represents the second number value in the right hand side. Left-hand side is represented using $$.
− Using union and yyval, only a single value can be passed between lexer and parser. So, use symbol table to pass multiple values
− Error is reported using yyerror() function.− While compiling the C programs generated by Lex and Yacc,
we will use –ly option of the C compiler. The yacc library must contain main() and yyerror().
• Compilation and Execution on Linux platform.− Compile the lex program: lex filename.l− Compile the yacc program: yacc –d filename.y− Compile the C program: gcc –o output y.tab.c lex.yy.c –ly –ll− Running the program: ./output
Programming the Scanner and Parser
74
Syntax Analysis
• Compilation and Execution on Windows platform.− Make sure that flex (flex.exe), bison (bison.exe) and tcc
(Tiny C Compiler or any C compiler) are installed.− Compile the lex program: flex filename.l− Compile the yacc program:
• bison –d filename.y• bison –d filename.y –b y
− Compile the C program (using Tiny C Compiler – tcc) generated: • tcc –o output.exe y.tab.c lex.yy.c yyerror.c libyywrap.c
yyinit.c main.c yyaccpt.c− Running the program: output.exe
• Programming with Accent and Amber− After writing the lex program, we need to write the accent
program− Rules have left and right hand side separated by a colon− The initial symbol provided in the grammar is called as
start symbol and it follows context-free grammar
Programming the Scanner and Parser
75
Syntax Analysis
• Accent− Parameters can be specified as in (inherited attributes) and
out (synthesized attributes), with “<“ and “>” enclosing them
− Statements written within %prelude { …} are literally copied into the generated C program
Programming the Scanner and Parser
%token NUMBER;root: expression<n> { printf("Final = %d\n",
n);};
expression<n>: expression<x> '+' term<y> { *n = x + y;} | term<n> ;
term<n> : term<x> '*' factor<y> { *n = x * y; } | factor<n>;
factor<n> : '(' expression<n> ')' | NUMBER<n> ;
• Given the grammar (R stands
for root, E stands for
expression, T stands for term
and F stands for factor. id is a
terminal which represented by
token NUMBER):
• R E
• E E + T | T
• T T * F | F;
• F (E) | id;
76
Syntax Analysis
• Programming with Accent− Compilation and Execution on Linux
• lex filename.l• accent filename.acc• gcc –o output yygrammar.c lex.yy.c entire.c• Check for ambiguity using Amber:
− accent filename.acc − gcc -o output -O3 yygrammar.c amber.c− output examples 1000
− Compilation and Execution on Windows• flex filename.l• accent filename.acc• tcc –o output.exe yygrammar.c lex.yy.c entire.c
yyerror.c libyywrap.c main.c yyinit.c yyaccpt.c • Check for ambiguity using Amber:
− accent filename.acc − tcc -o output.exe yygrammar.c amber.c− output examples 1000
Programming the Scanner and Parser