Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1

1

Theory of Compilation 236360

Erez Petrank

Lecture 2: Syntax Analysis, Top-Down Parsing

2

You are here

Executable

code

exe

Source

text

txt

Compiler

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

Inter.Rep.

(IR)

Code

Gen.

3

Last Week: from characters to tokens(Using Regular Expressions)

x = b*b – 4*a*c

txt

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS>

<INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

Token Stream

4

The Lex Tool• Lex automatically generates a lexical analyzer from

declaration file.• Advantages: easy to produce a lexical analyzer from a

short declaration, easily verified, easily modified and maintained.

• Intuitively: Lex builds a DFA, The analyzer simulates the DFA on a given input.

LexDeclaration file

LexicalAnalysi

s

characters

tokens

5

Today: from tokens to AST

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

<ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

‘b’ ‘4’

‘b’‘a’

‘c’

ID

ID

ID

ID

ID

factor

term factorMULT

term

expression

expression

factor

term factorMULT

term

expression

term

MULT factor

MINUS

SyntaxTree

6

Syntax Analysis (Parsing)• Goal: discover the program structure.

– For example, a C program is built of functions, each function is built from declarations and instructions, each instruction is built from expressions, etc.

– Is a sequence of tokens a valid program in the language?– Construct a structured representation of the input text– Error detection and reporting

An Example Structure of a Program

program

Main function More Functions

More FunctionsFunction

Function

Decls Stmts

Decls Stmts

Decls Stmts• • •

• • • •

• •

• • •

Decl Decls

Decl Decls

Decl

Stmt Stmts

StmtIdType• • • •

• •

• • •

exprid =• • •

;

;

{ }

{

{

}

}

8

Syntax Analysis (Parsing)• Goal: discover the program structure.

– For example, a C program is built of functions, each function is built from declarations and instructions, each instruction is built from expressions, etc.

– Is a sequence of tokens a valid program in the language?– Construct a structured representation of the input text– Error detection and reporting

• Context free grammars: a simple and accurate method for describing a program structure.

• We will look at families of grammars that can be efficiently parsed.

• The parser will read the token series, make sure that they are derivable in the grammar (or report an error), and construct the derivation tree.

9

Context free grammars

• V – non terminals• T – terminals (tokens for us)• P – production rules

– Each rule of the form V ➞ (T ∪ V)

• S∈V – the initial symbol

G = (V,T,P,S)

10

Why do we need context free grammars?

• Important program structures cannot be expressed by regular expressions. E.g., balanced parenthesis… – S ➞ SS; S ➞ (S); S ➞ ()

• Anything expressible as a regular expression is expressible by CFG. Why use regular expressions at all? – Separation, modularity, simplification. – No point in using strong (and less efficient) tools on

easily analyzable regular expressions.

• Regular expressions describe lexical structures like identifiers, constants, keywords, etc.

• Grammars describe nested structured like balanced parenthesis, match begin-end, if-then-else, etc.

11

Example

S ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )

V = { S, E }T = { id, ‘+’, ‘*’, ‘(‘, ‘)’}S is the initial variable.

Derivation Example

12

S

S S;

id := E S;

id := id S;

id := id id := E ;

id := id id := E + E ;

id := id id := E + id ;

id := id id := id + id ;

x := z;y := x + z


S ➞ S;S

S ➞ id := E

E ➞ id

S ➞ id := E

E ➞ E + E

E ➞ id

E ➞ id

x:= z ; y := x + z

input grammar

Derivation Example

13

S

S S;

id := E S;

id := id S;

id := id id := E ;

id := id id := E + E ;

id := id id := E + id ;


<id,”x”> <ASS> <id,”z”> <SEMI> <id,”y”> <ASS> <id,”x”> <PLUS> <id,”z”>


S ➞ S;S

S ➞ id := E

E ➞ id

S ➞ id := E

E ➞ E + E

E ➞ id

E ➞ id

input grammar

<id,”x”> <ASS> <id,”z”><SEMI><id,”y”><ASS><id,”x”><PLUS><id,”z”>

14

Terminology

• Derivation: a sequence of replacements of non-terminals using the production rules.

• Language: the set of strings of terminals derivable from the initial state.

• Sentential form (תבנית פסוקית) – the result of a partial derivation in which there may be non-terminals.

15

Parse TreeS

S S;

id := E S;

id := id S;

id := id id := E ;

id := id id := E + E ;

id := id id := E + id ;

id := id id := id + id ;x:= z ; y := x + z

S

S

;

S

id :=

E

id

id := E

E

+

E

id id

16

Questions

• How did we know which rule to apply on every step?

• Does it matter? • Would we always get the same result?

17

Ambiguity

x := y+z*wS ➞ S;SS ➞ id := EE ➞ id | E + E | E * E | ( E )

S

id := E

E + E

id

id

E * E

id

S

id := E

E*E

id

id

E + E

id

18

Leftmost/rightmost Derivation

• Leftmost derivation– always expand leftmost non-terminal

• Rightmost derivation– Always expand rightmost non-terminal

• Allows us to describe derivation by listing a sequence of rules only. – always know what a rule is applied to

• Note that it does not necessarily solve ambiguity (e.g., previous slide).

• These are the orders of derivation applied in our parsers (coming soon).

Leftmost Derivation

19

x := z;y := x + z

S ➞ S;S

S ➞ id := E

E ➞ id | E + E | E * E | ( E )

S

S S;

id := E S;

id := id S;

id := id id := E ;

id := id id := E + E ;

id := id id := id + E ;


S ➞ S;S

S ➞ id := E

E ➞ id

S ➞ id := E

E ➞ E + E

E ➞ id

E ➞ id

x:= z ; y := x + z

20

Rightmost Derivation

S

S S;

S id := E;

S id := E + E;

S id := E + id;

S id := id + id ;

id := E id := id + id ;


x := z;y := x + z


S ➞ S;S

S ➞ id := E

E ➞ E + E

E ➞ id

E ➞ id

S ➞ id := E

E ➞ id

x:= z ; y := x + z

21

Bottom-up Example

x := z;y := x + z


id := id ; id := id + id

id := E id := id + id;

S id := id + id;

S id := E + id;

S id := E + E;

S id := E ;

S S;

S

E ➞ id

S ➞ id := E

E ➞ id

E ➞ id

E ➞ E + E

S ➞ id := E

S ➞ S;S

Bottom-up picking left alternative on every step Rightmost derivation when going top-down

22

Parsing

• A context free language can be recognized by a non-deterministic pushdown automaton– But not a deterministic one…

• Parsing can be seen as a search problem– Can you find a derivation from the start symbol to the input

word?– Easy (but very expensive) to solve with backtracking

• Cocke-Younger-Kasami parser can be used to parse any context-free language but has complexity O(n3)– Imagine a program with hundreds of thousands of lines of

code.

• We want efficient parsers– Linear in input size– Deterministic pushdown automata– We will sacrifice generality for efficiency

23

“Brute-force” Parsing

x := z;y := x + z


id := id ; id := id + id

id := E id := id + id; id := id id := E+ id; …E ➞ id

E ➞ id

(not a parse tree… a search for the parse tree by exhaustively applying all rules)

id := E id := id + id; id := E id := id + id;

24

Efficient Parsers

• Top-down (predictive)– Construct the leftmost derivation– Apply rules “from left to right”– Predict what rule to apply based on nonterminal and

token

• Bottom up (shift reduce)– Construct the rightmost derivation– Apply rules “from right to left”– Reduce a right-hand side of a production to its non-

terminal

25

Efficient Parsers

• Top-down (predictive parsing)

Bottom-up (shift reduce)

to be read…already read…

26

Top-down Parsing

• Given a grammar G=(V,T,P,S) and a word w• Goal: derive w using G• Idea

– Apply production to leftmost nonterminal– Pick production rule based on next input token

• General grammar– More than one option for choosing the next production

based on a token

• Restricted grammars (LL)– Know exactly which single rule to apply– May require some lookahead to decide

27

An Easily Parse-able GrammarE ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor

not (not true or false)

E => not E => not ( E OP E ) =>not ( not E OP E ) =>not ( not LIT OP E ) =>not ( not true OP E ) =>not ( not true or E ) =>not ( not true or LIT ) =>not ( not true or false )

Production to apply is known from next input token

E

not E

EOPE

LIT

true

not LIT or

( )

false

At any stage, looking at the current variable and the next input token, a rule can be easily determined.

28

An Easily Parse-able Grammar

E => not E => not ( E OP E ) =>not ( not E OP E ) =>not ( not LIT OP E ) =>not ( not true OP E ) =>not ( not true or E ) =>not ( not true or LIT ) =>not ( not true or false )

Production to apply is known from next input token

E

not E

EOPE

LIT

true

not

LIT

or

( )

false

E

At any stage, looking at the current variable and the next input token, a rule can be easily determined.

E ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor

not (not true or false)

29

Recursive Descent Parsing

• Define a function for every nonterminal• Every function simulates the derivation of the

variable it represents:– Find applicable production rule– Terminal function checks match with next input token– Nonterminal function calls (recursively) other functions

• If there are several applicable productions for a nonterminal, use lookahead

30

Matching tokens

• Variable current holds the current input token

void match(token t) { if (current == t) current = next_token(); else error;}

31

functions for nonterminalsE ➞ LIT | (E OP E) | not ELIT ➞ true | falseOP ➞ and | or | xor

void E() { if (current {TRUE, FALSE}) // E → LIT LIT(); else if (current == LPAREN) // E → ( E OP E ) match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT) // E → not E match(NOT); E(); else error; }

32

functions for nonterminals

void LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error;}


33

functions for nonterminals

void OP() {if (current == AND)

match(AND);else if (current == OR)

match(OR);else if (current == XOR)

match(XOR);else

error;}


34

Overall: Functions for Grammar

E → LIT | ( E OP E ) | not ELIT → true | falseOP → and | or | xor

void E() {if (current {TRUE, FALSE}) LIT();else if (current == LPAREN) match(LPARENT);

E(); OP(); E();match(RPAREN);

else if (current == NOT) match(NOT); E();else error;

}

void LIT() {if (current == TRUE) match(TRUE);else if (current == FALSE) match(FALSE);else error;

}

void OP() {if (current == AND) match(AND);else if (current == OR) match(OR);else if (current == XOR) match(XOR);else error;

}

35

Adding semantic actions

• Can add an action to perform on each production rule simply by executing it when a function is invoked.

• For example, can build the parse tree– Every function returns an object of type Node– Every Node maintains a list of children– Function calls can add new children

36

Building the parse treeNode E() { result = new Node(); result.name = “E”; if (current {TRUE, FALSE}) // E → LIT result.addChild(LIT()); else if (current == LPAREN) // E → ( E OP E ) result.addChild(match(LPARENT)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E → not E result.addChild(match(NOT)); result.addChild(E()); else error; return result;}

37

Getting Back to the Example

• Input = “( not true and false )”;Node treeRoot = E();

E

( E OP E )

not LIT

falsetrue

and LIT

38

Recursive Descent

• How do you pick the right A-production?• Generally – try them all and use backtracking

(costly). • In our case – use lookahead

void A() { choose an A-production, A -> X1X2…Xk; for (i=1; i≤ k; i++) { if (Xi is a nonterminal) call procedure Xi(); elseif (Xi == current) advance input; else report error; }}

In its basic form, each variable has a procedure that looks like:

39

Recursive Descent: a problem

• With lookahead 1, the function for indexed_elem will never be tried… – What happens for input of the form

• ID [ expr ]

term ➞ ID | indexed_elemindexed_elem ➞ ID [ expr ]

40

Recursive Descent: Another Problem

Bool S() { return A() && match(token(‘a’)) && match(token(‘b’));}Bool A() { if (current == ‘a’) return match(token(‘a’)) else return true ;}

S ➞ A a bA ➞ a |

What happens for input “ab” ? What happens if you flip order of alternatives and try “aab”?

41

Recursive descent: a third problem

Bool E() { return E() && match(token(‘-’)) && term() || ID();}

E ➞ E – term | term

What happens with this procedure? Recursive descent parsers cannot handle left-recursive

grammars

42

3 Bad Examples for Recursive Descent

Can we make it work?


S ➞ A a bA ➞ a |

E ➞ E - term

43

The “FIRST” Sets

• To formalize the property (of a grammar) that we can determine a rule using a single lookahead we define the FIRST sets.

• For every production rule A➞ 𝞪– FIRST( ) = all terminals that can start with 𝞪 𝞪– i.e., every token that can appear first under some derivation

for 𝞪

• No intersection between FIRST sets => can pick a single rule

• In our Boolean expressions example– FIRST(LIT) = { true, false }– FIRST( ( E OP E ) ) = { ‘(‘ }– FIRST ( not E ) = { not }


44

The “FIRST” Sets

• No intersection between FIRST sets => can pick a single rule

• If the FIRST sets intersect, may need longer lookahead– LL(k) = class of grammars in which production rule can be

determined using a lookahead of k tokens– LL(1) is an important and useful class

45

The FOLLOW Sets

• FIRST is not enough when variables are nullified. • Consider: S ➞ AB | c ; A ➞ a | ; B ➞ b;

• Need to know what comes afterwards to select the right production

• For any non-terminal A – FOLLOW(A) = set of tokens that can immediately follow

A

• Can select the rule N ➞ with lookahead “b”, if 𝞪– b∈FIRST( ) or𝞪– 𝞪 may be nullified and b∈FOLLOW(N).

46

LL(k) Grammars

• A grammar is in the class LL(K) when it can be derived via:– Top down derivation– Scanning the input from left to right (L)– Producing the leftmost derivation (L)– With lookahead of k tokens (k)

• A language is said to be LL(k) when it has an LL(k) grammar

47

Back to our 1st example

• FIRST(ID) = { ID }• FIRST(indexed_elem) = { ID }

• FIRST/FIRST conflict

• This grammar is not in LL(1). Can we “fix” it?


48

Left factoring

• Rewrite into an equivalent grammar in LL(1)


term ➞ ID after_IDafter_ID ➞ [ expr ] |

Intuition: just like factoring x*y + x*z into x*(y+z)

49

Left factoring – another example

S ➞ if E then S else S | if E then S | T

S ➞ if E then S S’ | TS’ ➞ else S |

50

Back to our 2nd example

• Select a rule for A with a in the look-ahead: – Should we pick (1) A ➞ a or (2) A ➞ ?

• (1) FIRST(a) = { ‘a’ } (and a cannot be nullified).

• (2) FIRST ()=. Also, can (must) be nullified and FOLLOW(A) = { ‘a’ }

• FIRST/FOLLOW conflict• The grammar is not in LL(1).

S ➞ A a bA ➞ a |

51

An Equivalent Grammar via Substitution

S ➞ A a bA ➞ a |

S ➞ a a b | a b

Substitute A in S

S ➞ a after_a after_a ➞ a b | b

Left factoring

52

So Far

• We have tools to determine if a grammar is in LL(1)– The FIRST and FOLLOW sets. – In tutorials: algorithms for finding and using those.

• We have some techniques for modifying a grammar to find an equivalent in LL(1). – Left factoring,– Assignment.

• Now let’s look at the 3rd example and present one more such technique.

53

Back to our 3rd example

• Left recursion cannot be handled with a bounded lookahead.

• What can we do?

• Any grammar with a left recursion has an equivalent grammar with no left recursion.

E ➞ E – term | term

54

Left Recursion Elimination

• L(G1) = β, βα, βαα, βααα, …• L(G2) = same

N N➞ α | β N ➞ βN’ N’ ➞ αN’ |

G1 G2

E ➞ E – term | termE ➞ term TE

TE ➞ - term TE |

For our 3rd example:

אלימינציה של רקורסיה שמאלית

• נחליף את הכלליםביטול רקורסיה ישירה:• A → Aα1 | Aα2 | ··· | Aαn | β1 | β2 | ··· | βn

• בכללים• A → β1A’ | β2A’ | ··· | βnA’ • A’ → α1A’ | α2A’| ··· | αnA’ | Є

• ...ריק αi שימו לב שהשיטה לא עובדת אם• .ריק βi וגם עלולה ליצור רקורסיה שמאלית עקיפה אם

• .’A ריק, אז נוצרת רקורסיה שמאלית של αi אם• αj ריק, אז תיתכן רקורסיה שמאלית עקיפה כאשר βi אם

:A-מתחיל ב– A → A’ וגם – A’ → A….

אלימינציה של רקורסיה שמאלית

• נחליף את הכלליםביטול רקורסיה ישירה:• A → Aα1 | Aα2 | ··· | Aαn | β1 | β2 | ··· | βn

• בכללים• A → β1A’ | β2A’ | ··· | βnA’ • A’ → α1A’ | α2A’| ··· | αnA’ | Є

• :צריך לטפל גם ברקורסיה עקיפה. למשל• S → Aa | b• A → Ac | Sd | Є

• .ועבורה האלגוריתם מעט יותר מורכב

)עקיפה אלגוריתם להעלמת רקורסיה שמאלת וישירה( מדקדוק

• שאולי יש בו רקורסיה שמאלית, ללא מעגלים, וללא G דקדוקקלט:.ε כללי

• . דקדוק שקול ללא רקורסיה שמאליתפלט:

• .A → Є :דוגמא לכלל אפסילון• .;A → B; B → A :דוגמא למעגל

• .)ניתן לבטל כללי אפסילון ומעגלים בדקדוק )באופן אוטומטי

• רעיון האלגוריתם לסילוק רקורסיה שמאלית: נסדר את המשתנים A1, A2, …, An :לפי סדר כלשהו

• נדאג שכל כלל שלו יהיה Ai נעבור על המשתנים לפי הסדר, ולכל מהצורה

• Ai → Ajβ with j > i .• ?מדוע זה מספיק

An Algorithm for Left-Recursion Elimination

• Input: Grammar G possibly left-recursive, no cycles, no ε productions.

• Output: An equivalent grammar with no left-recursion• Method: Arrange the nonterminals in some order A1, A2, …, An

• for i:=1 to n do begin for s:=1 to i-1 do begin replace each production of the form Ai → Asβ

by the productions Ai → d1β |d2β|…|dkβ

where As -> d1 | d2 | …| dk are all the current As-productions;

end eliminate immediate left recursion among the Ai-productions

end

ניתוח האלגוריתם

• Ak → Atβ נראה שבסיום האלגוריתם כל חוק גזירה מהצורה. t > k מקיים

• : כשגומרים את הלולאה הפנימית עבור1שמורה s כלשהו )עם Ai בלולאה

Aj מתחילים בטרמינלים, או במשתנים Ai החיצונית( אז כל כללי הגזירה של .j>s עבורם

• : כשמסיימים עם המשתנה2שמורה Ai, כל כללי הגזירה שלו מתחיליםאו בטרמינלים. j>i עבורם Aj במשתנים

.s-ו i הוכחת שתי השמורות יחד באינדוקציה על• : בסיום האלגוריתם אין רקורסיה שמאלית בין המשתנים מסקנה

המקוריים )ישירה או עקיפה(. 2נובע משמורה .

• לגבי המשתנים החדשים, הם תמיד מופיעים כימניים ביותר, ולכן לעולם לא .יהיו מעורבים ברקורסיה שמאלית

60

LL(k) Parsers

• Recursive Descent– Manual construction– Uses recursion

• Wanted– A parser that can be generated automatically– Does not use recursion

61

LL(k) parsing with pushdown automata

• Pushdown automaton uses– A stack– Input stream– Transition table

• nonterminals x tokens -> production rule• Entry indexed by nonterminal N and token t contains the

rule of N that must be used when current input starts with t

• The initial state: – Input stream has the input ($ marks its end). – Stack starts with “S$” for the initial variable S.

62

LL(k) parsing with pushdown automata

• Two possible moves– Prediction:

• When top of stack is nonterminal N and next token is t: pop N, lookup rule at table[N,t]. If table[N,t] is not empty, push the right-side of the rule on prediction stack, otherwise – syntax error.

– Match:• When top of prediction stack is a terminal T and next token is

t:If (t == T), pop T and consume t. If (t ≠ T) syntax error.

• Parsing terminates when prediction stack is empty. If input is empty at that point, success. Otherwise, syntax error

Stack During the Run:

if ( E ) then Stmt else Stmt ; Stmts ; } $

מחסנית:

top

if ( id < id ) then id = id + num else break; id = id * id; …

Remaining Input:

64

Example transition table

( ) not true false and or xor $

E 2 3 1 1

LIT 4 5

OP 6 7 8

(1) E → LIT(2) E → ( E OP E ) (3) E → not E(4) LIT → true(5) LIT → false(6) OP → and(7) OP → or(8) OP → xor

Non

term

inal

s

Input tokens

Which rule should be used

65

Simple Example

a b c

A A aAb➞ A c➞

A aAb | c➞aacbb$

Input suffix Stack content Move

aacbb$ A$ predict(A,a) = A aAb➞

aacbb$ aAb$ match(a,a)

acbb$ Ab$ predict(A,a) = A aAb➞

acbb$ aAbb$ match(a,a)

cbb$ Abb$ predict(A,c) = A c➞

cbb$ cbb$ match(c,c)

bb$ bb$ match(b,b)

b$ b$ match(b,b)

$ $ match($,$) – success

Stack top on left

66

The Transition Table

• Constructing the transition table is not hard. – It builds on FIRST and FOLLOW.

• You will construct First, Follow, and the table in the tutorials.

67

Simple Example on a Bad Word

a b c

A A aAb➞ A c➞

A ➞ aAb | cabcbb$


abcbb$ A$ predict(A,a) = A aAb➞

abcbb$ aAb$ match(a,a)

bcbb$ Ab$ predict(A,b) = ERROR

68

Error Handling

• Types of errors: – Lexical errors (typos)– Syntax errors (e.g., imbalanced parenthesis) – Semantic errors (e.g., type mismatch)– Logical errors (infinite loop, but also use of ‘=‘ instead of

‘==‘).

• Requirements: – Report the error clearly. – Recover and continue so that more errors can be

discovered. – Be reasonably efficient.

69

Error Handling and Recoveryx = a * (p+q * ( -b * (r-s);

Where should we report the error? The valid prefix property Recovery is tricky

Heuristics for dropping tokens, skipping to semicolon, etc.

70

Error Handling in LL Parsers

• Now what?– Predict bS anyway “missing token b inserted in line XXX”

S ➞ a c | b Sc$

a b c

S S ➞ a c S ➞ bS


c$ S$ predict(S,c) = ERROR

71

Error Handling in LL Parsers

• Result: infinite loop

S ➞ a c | b Sc$

a b c

S S ➞ a c S ➞ bS


bc$ S$ predict(b,c) = S ➞ bS

bc$ bS$ match(b,b)

c$ S$ Looks familiar?

72

Error Handling

• Requires more systematic treatment• Some examples

– Panic mode (or acceptable-set method): drop tokens until reaching a synchronizing token, like a semicolon, a right parenthesis, end of file, etc.

– Phrase-level recovery: attempting local changes: replace “,” with “;”, eliminate or add a “;”, etc.

– Error production: anticipate errors and automatically handle them by adding them to the grammar.

– Global correction: find the minimum modification to the program that will make it derivable in the grammar. • Not a practical solution…

73

Summary

• Lexical analysis tokens• Parsing understand the program structure. • Context-Free Grammars. • Top-down or bottom-up. • Recursive descent: recursion, a function for each variable. • General grammars hard to parse. • LL(k) grammars (with small k’s): efficient. • Use pushdown automata. • Non-LL(k) Grammars may sometimes be “fixed”:

– left-recursion elimination, left factorization, and assignments.

74

Coming up next time

• Bottom-Up Parsing.

Documents

Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1