Parsing

1

CIS 461Compiler Design and Construction

Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and

Linda Torczon

Lecture-Module #9

Parsing

2

Announcements

• Professor Christian Trefftz will substitute on Friday, October 5.• Read through Chapter 5.• Writing Assignment #5 is due Monday, October 8.• Programming Assignment #2 will be posted on the class webpage; it will be

due in approximately two weeks, tentatively October 17.– For this assignment you should work in teams of two. Please find a

teammate.– You’ll use the JJTree utility associated with JavaCC; start reading its

documentation.– I’ll explain it more starting Monday, October 8.– Start the project early; please don’t leave it to the last weekend! (;-)

• Midterm will be tentatively next Wednesday, October 10– in class– closed books, closed notes

3

Parsing Techniques

Top-down parsers (LL(1), recursive descent)

• Start at the root of the parse tree from the start symbol and grow toward leaves (similar to a derivation)

• Pick a production and try to match the input

• Bad “pick” may need to backtrack

• Some grammars are backtrack-free (predictive parsing)

Bottom-up parsers (LR(1), operator precedence)

• Start at the leaves and grow toward root

• We can think of the process as reducing the input string to the start symbol

• At each reduction step a particular substring matching the right-side of a production is replaced by the symbol on the left-side of the production

• Bottom-up parsers handle a large class of grammars

4

Eliminating Immediate Left Recursion

To remove left recursion, we can transform the grammar

Consider a grammar fragment of the form

A A |

where or are strings of terminal and nonterminal symbols

and neither nor start with A

We can rewrite this as

A RR R |

where R is a new non-terminal

This accepts the same language, but uses only right recursion

A

A

A

A

R

R

R

5

Left-Recursive and Right-Recursive Expression Grammar

1 S Expr2 Expr Expr + Term3 | Expr – Term4 | Term5 Term Term * Factor6 | Term / Factor7 | Factor8 Factor num9 | id

1 S Expr2 Expr Term Expr3 Expr + Term Expr4 | – Term Expr5 | 6 Term Factor Term7 Term * Factor Term8 | / Factor Term9 | 10 Factor num11 | id

6

Predictive Parsing

Basic idea

Given A , the parser should be able to choose between &

FIRST sets

For a string of grammar symbols , define FIRST() as the set of tokens that appear as the first symbol in some string that derives from

That is, x FIRST() iff * x , for some

The LL(1) Property

If A and A both appear in the grammar, we would like

FIRST() FIRST() =

This would allow the parser to make a correct choice with a lookahead of exactly one symbol !

(Pursuing this idea leads to LL(1) parser generators...)

7

Recursive Descent Parsing

Recursive-descent parsing

• A top-down parsing method

• The term descent refers to the direction in which the parse tree is traversed (or built).

• Use a set of mutually recursive procedures (one procedure for each nonterminal symbol)

– Start the parsing process by calling the procedure that corresponds to the start symbol

– Each production becomes one clause in procedure

• We consider a special type of recursive-descent parsing called predictive parsing

– Use a lookahead symbol to decide which production to use

8

Recursive Descent Parsing: Expression Grammar

void main() { lookahead=getNextToken(); S(); match(EOF); }

void S() { Expr(); } void Expr() { Term(); ExprPrime(); } void ExprPrime() { switch(lookahead) { case PLUS : match(PLUS); Term(); ExprPrime(); break; case MINUS : match(MINUS); Term();

ExprPrime(); break; default: return; }}

void Term() { Factor(); TermPrime(); } void TermPrime() { switch(lookahead) { case TIMES: match(TIMES); Factor(); TermPrime(); break; case DIV: match(DIV); Factor(); TermPrime(); break; default: return; }}

void Factor() { switch(lookahead) { case ID : match(ID); break; case NUMBER: match(NUMBER); break; default: error();}

int PLUS=1, MINUS=2, ...int lookahead;void match(int token) { if (lookahead==token) lookahead=getNextToken(); else error(); }

9

Recursive Descent Parsing: Another Grammar

1 S if E then S else S2 | begin S L3 | print E4 L end5 | ; S L6 E num = num

void S() { switch(lookahead) { case IF: match(IF); E(); match(THEN); S(); match(ELSE); S(); break; case BEGIN: matvh(BEGIN); S(); L(); break; case PRINT: match(PRINT); E(); break; default: error(); }}

void E() { match(NUM); match(EQ); match(NUM); }

void L() { switch(lookahead) { case END: match(END); break; case SEMI: match(SEMI); S(); L(); break; default: error(); }}

void main() { lookahead=getNextToken(); S(); match(EOF); }

10

Example Execution For Input: if 2=2 then print 5=5 else print 1=1

main: call S();

S1: find the production for (S, IF) : Sif E then S else S

S1: match(IF);

S1: call E();

E1: find the production for (E, NUM): Enum = num

E1: match(NUM); match(EQ); match(NUM);

E1: return from E1 to S1

S1: match(THEN);

S1:call S();

S2: find the production for (S, PRINT): Sprint E

S2: match(PRINT);

S2: call E();




S2: return from S2 to S1

S1: match(ELSE);

S1: call S();


S3: match(PRINT);

S3: call E();





S1: return from S1 to mainmain: match(EOF); return success;

11

Another Approach: Stack-Based Table-Driven Parsing

The parsing table

• A two dimensional array

M[A, a] gives a production

– A: a nonterminal symbol

– a: a terminal symbol

• What does it mean?

– If top of the stack is A and the lookahead symbol is a then we apply the production M[A, a]

IF BEGIN PRINT END SEMI NUM

S Sif E then S else S Sbegin S L Sprint E

L L end L ; S L

E Enum = num

12

Table-driven Parsers

A table-driven parser looks like

Parsing tables can be built automatically!

ScannerTable-driven

Parser

ParsingTable

ParserGenerator

sourcecode

grammar

IR

Stack

13

Table-Driven Predictive Parsing Algorithm

• Push the end-of-file symbol ($) and the start symbol onto the stack

• Consider the symbol X on the top of the stack and lookahead symbol a

– If X = a = $ announce successful parse and halt

– If X = a $ pop X off the stack and advance the input pointer to the next input symbol

– If X is a nonterminal, look at the production M[X, a]

• If there is no such production (M[X, a] = error), then call an error routine

• If M[X, a] is a production X Y1 Y2 ... Yk , then pop X and push Yk , Yk-1 , ..., Y1 onto the stack with Y1 on top

– If none of the cases above apply, then call an error routine

14

Table-Driven Predictive Parsing Algorithm

Push($); // $ is the end-of-file symbolPush(S); // S is the start symbol of the grammarlookahead = get_next_token();repeat X = top_of_stack(); if (X is a terminal or X == $) then if (X == lookahead) then pop(X); lookahead = get_next_token(); else error(); else // X is a non-terminal if ( M[X, lookahead] == X Y1 Y2 ... Yk) then pop(X); push(Yk); push(Yk-1); ... push(Y1); else error();until (X = $)

15

Recursive Descent Parser On: if 2=2 then print 5=5 else print 1=1

main: call S();

S1: find the production for (S, IF) : Sif E then S else S

S1: match(IF);

S1: call E();




S1: match(THEN);

S1:call S();


S2: match(PRINT);

S2: call E();





S1: match(ELSE);

S1: call S();


S3: match(PRINT);

S3: call E();





S1: return from S1 to mainmain: match(EOF); return success;

16

Table Driven Parser On: if 2=2 then print 5=5 else print 1=1$

Stack lookahead Parse-table lookup$S IF M[S,IF]: Sif E then S else S $S,ELSE,S,THEN,E,IF IF$S,ELSE,S,THEN,E NUM M[E,NUM]: Enum = num $S,ELSE,S,THEN,NUM,EQ,NUM NUM$S,ELSE,S,THEN,NUM,EQ EQ$S,ELSE,S,THEN,NUM NUM$S,ELSE,S,THEN THEN$S,ELSE,S PRINT M[S,PRINT]: Sprint E $S,ELSE,E,PRINT PRINT$S,ELSE,E NUM M[E,NUM]: Enum = num $S,ELSE,NUM,EQ,NUM NUM$S,ELSE,NUM,EQ EQ$S,ELSE,NUM NUM$S,ELSE ELSE$S PRINT M[S,PRINT]: Sprint E $E,PRINT PRINT$E NUM M[E,NUM]: Enum = num $NUM,EQ,NUM NUM$NUM,EQ EQ$NUM NUM$ $ report success!

17

How to Build Parse Tables? FIRST Sets

For a string of grammar symbols define FIRST() as

• The set of tokens that appear as the first symbol in some string that derives from

• If * , then is in FIRST()

To construct FIRST(X) for a grammar symbol X, apply the following rules until no more symbols can be added to FIRST(X)

• If X is a terminal FIRST(X) is {X}

• If X is a production then is in FIRST(X)

• If X is a nonterminal and X Y1Y2 ... Yk is a production then put every symbol in

FIRST(Y1) other than to FIRST(X)

• If X is a nonterminal and X Y1Y2 ... Yk is a production, then put terminal a in

FIRST(X) if a is in FIRST(Yi) and is in FIRST(Yj) for all 1 j i

• If X is a nonterminal and X Y1Y2 ... Yk is a production, then put in FIRST(X) if is in FIRST(Yi) for all 1 i k

18

Computing FIRST Sets for Strings of Symbols

To construct the FIRST set for any string of grammar symbols X1X2 ... Xn (given the FIRST sets for symbols X1 , X2 , ... Xn ) apply the following rules.

FIRST(X1X2 ... Xn ) contains:

– Any symbol in FIRST(X1) other than

– Any symbol in FIRST(Xi) other than , if is in FIRST(Xj) for all 1 j i

, if is in FIRST(Xj) for all 1 i n

19

FIRST Sets

1 S Expr2 Expr Term Expr3 Expr + Term Expr4 | - Term Expr5 | 6 Term Factor Term7 Term * Factor Term8 | / Factor Term9 | 10 Factor num11 | id

Symbol FIRSTS {num, id}Expr {num, id}

Expr { , +, - }Term {num, id}

Term { , *, / }Factor {num, id}num {num}id {id}+ {+}- {-}* {*}/ {/}

20

How to build Parse Tables? FOLLOW Sets

For a non-terminal symbol A, define FOLLOW(A) as:

The set of terminal symbols that can appear immediately to the right of A in some sentential form

To construct FOLLOW(A) for a non-terminal symbol A apply the following rules until no more symbols can be added to FOLLOW(A)

• Place $ in FOLLOW(S) ($ is the end-of-file symbol, S is the start symbol)

• If there is a production A B , then everything in FIRST() except is placed in FOLLOW(B)

• If there is a production A B, then everything in FOLLOW(A) is placed in FOLLOW(B)

• If there is a production A B , and is in FIRST() then everything in FOLLOW(A) is placed in FOLLOW(B)

21

FOLLOW Sets


Symbol FOLLOWS { $ } Expr { $ }Expr { $ }Term { $, +, - }

Term { $, +, - }Factor { $, +, -, *, / }

22

LL(1) Parse Table Construction

• For all productions A , perform the following steps:

– For each terminal symbol a in FIRST(), add A to M[A, a]

– If is in FIRST(), then add A to M[A, b] for each terminal symbol b in FOLLOW(A) and add A to M[A, $] if $ is in FOLLOW(A)

• Set all the undefined entries in M to error

23


id num + - * / $

S SE SE E ET E ET E

E’ E + T E E - T E E

T TF T TF T

T’ T’ T’ T * F T T / F T T’

F F id F num

Grammar:

LL(1) Parse table:

24

LL(1) grammars

Left-to-right scan of the input, Leftmost derivation, 1-token lookahead

Two alternative definitions of LL(1) grammars:

• A grammar G is LL(1) if there are no multiple entries in its LL(1) parse table

• A grammar G is LL(1) if for each set of its productions A 1 | 2 | ... | n

1. FIRST(1), FIRST(2), ..., FIRST(n), are all pairwise disjoint

2. If i * , then FIRST (j) FOLLOW (A) = for all 1 i

n, i j

Documents

Parsing