26
CSCI 3130: Automata theory and formal languages Andrej Bogdanov http://www.cse.cuhk.edu.hk/ ~andrejb/csc3130 The Chinese University of Hong Kong Ambiguity Parsing algorithm for CFGs Fall 2010

CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Embed Size (px)

Citation preview

Page 1: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

CSCI 3130: Automata theory and formal languages

Andrej Bogdanov

http://www.cse.cuhk.edu.hk/~andrejb/csc3130

The Chinese University of Hong Kong

AmbiguityParsing algorithm for CFGs

Fall 2010

Page 2: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Ambiguity

• A grammar is ambiguous if some strings have more than one parse tree

1+2*2

E

E E+

E E*V

V V1

2 2

E

E E*

E E+ V

V V

1 2

2

E E + E | E * E | (E) | N

N 1N | 2N | 1 | 2

= 5 = 6

Page 3: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Disambiguation

• Sometimes we can rewrite the grammar to remove the ambiguity

E E + E | E * E | (E) | N

N 1N | 2N | 1 | 2same precedence!

Divide expression into terms and factors

2 * (1 + 2 * 2)F F

TT

F F

Page 4: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Disambiguation

E E + E | E * E | (E) | N

N 1N | 2N | 1 | 2

E T | E + TAn expression is a sum of one or more terms

Each term is a product of one or more factors T F | T * F

Each factor is a parenthesizedexpression or a number F (E) | 1 | 2

Page 5: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Parsing example

2 * (1 + 1 + 2 * 2) + 1

E T | E + TT F | T * FF (E) | 1 | 2

E

TTE +

T F*E( )

TF

F F

F

FTE +

TE + FT *

Page 6: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Disambiguation

• Disambiguation is not always possible– There exist inherently ambiguous languages– There is no general procedure for disambiguation

• In programming languages, ambiguity comes from precedence rules, and we can do like in example

• In English, ambiguity is sometimes a problem:

He ate the cookies on the floor

Page 7: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Parsing

• Do we have a method for building a parse tree?

• Can we tell if the parse tree is unique?

S → 0S1 | 1S0S1 | TT → S |

input: 00111

Page 8: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

First attempt

• Maybe we can try all possible derivations:

S → 0S1 | 1S0S1 | TT → S | x = 00111

S 0S1

1S0S1

T

00S1101S0S110T1

S

10S10S1...

when do we stop?

Page 9: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Problems

• How do we know when to stop?

S → 0S1 | 1S0S1 | TT → S | x = 00111

S 0S1

1S0S1

00S1101S0S110T1

10S10S1...

when do we stop?

Page 10: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Problems

• Idea: Stop derivation when length exceeds |x|

• Not right because of -productions

• We want to eliminate -productions

S → 0S1 | 1S0S1 | TT → S | x = 01011

S 0S1 01S0S11 01S011 010111 3 7 6 5

Page 11: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Problems

• Loops among the variables (S → T → S) might make us go forever

• We want to eliminate such loops

S → 0S1 | 1S0S1 | TT → S | x = 00111

Page 12: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Removal of -productions

• A variable N is nullable if there is a derivation

• How to remove -productions

Find all nullable variables NFor every production of the form A → N,

add another production A → If N → is a production, remove itIf S is nullable, add the special production S →

N *

Page 13: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Example

• Find the nullable variables

S ACDA aB C ED | D BC | bE b

B C D

nullable variablesgrammar

Find all nullable variables

Page 14: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Finding nullable variables

• To find nullable variables, we work backwards– First, mark all variables A s.t. A as nullable– Then, as long as there are productions of the form

where all of A1,…, Ak are marked as nullable, mark A as nullable

A → A1… Ak

Page 15: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Eliminating -productions

S ACDA aB C ED | D BC | bE b

nullable variables: B, C, D

For every production of the form A → N,add another production A →

If N → is a production, remove it

D CS ADD BD S ACS AC E

Page 16: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Dealing with loops

• A unit production is a production of the form

where A1 and A2 are both variables

• Example

A1 → A2

S → 0S1 | 1S0S1 | TT → S | R | R → 0SR

grammar: unit productions:

S T

R

Page 17: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Removal of unit productions

• If there is a cycle of unit productions

delete it and replace everything with A1

• Example

A1 → A2 → ... → Ak → A1

S → 0S1 | 1S0S1 | TT → S | R | R → 0SR

S T

R

S → 0S1 | 1S0S1S → R | R → 0SR

T is replaced by S in the {S, T} cycle

Page 18: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Removal of unit productions

• For other unit productions, replace every chain

by productions A1 → ,... , Ak →

• Example

A1 → A2 → ... → Ak →

S → R → 0SR is replaced by S → 0SR, R → 0SR

S → 0S1 | 1S0S1 | R | R → 0SR

S → 0S1 | 1S0S1 | 0SR | R → 0SR

Page 19: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Recap

• After eliminating -productions and unit productions, we know that every derivation

doesn’t shrink in length and doesn’t go into cycles

• Exception: S → – We will not use this rule at all, except to check if L

• Note -productions must be eliminated before unit

productions

S a1…ak where a1, …, ak are terminals*

Page 20: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Example: testing membership

S → 0S1 | 1S0S1 | TT → S |

x = 00111

S → | 01 | 101 | 0S1 |10S1 | 1S01 | 1S0S1

S 01, 101

10S1

1S01

1S0S1

10011, strings of length ≥ 6

10101, strings of length ≥ 6

unit, -prod

eliminate

only strings of length ≥ 6

0S1 0011, 0101100S11strings of length ≥ 6

only strings of length ≥ 6

Page 21: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Algorithm 1 for testing membership• How to check if a string x ≠ is in L(G)

Eliminate all -productions and unit productionsLet X := SWhile some new rule R can be applied to X

Apply R to XIf X = x, you have found a

derivation for xIf |X| > |x|, backtrack

If no more rules can be applied to X, x is not in L

Page 22: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Practical limitations of Algorithm I

• This method can be very slow if x is long

• There is a faster algorithm, but it requires that we do some more transformations on the grammar

G = CFG of the java programming languagex = code for a 200-line java program

algorithm might take about 10200 steps!

Page 23: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Chomsky Normal Form

• A CFG is in Chomsky Normal Form if every production (except S → ) is

• Convert to Chomsky Normal Form:

A → BC A → aor

A → BcDEreplace terminalswith new variables

A → BCDEC → c

break upsequenceswith new variables

A → BX1

X1 → CX2

X2 → DEC → c

Noam Chomsky

Page 24: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Algorithm 2 for testing membership

S AB | BCA BA | aB CC | bC AB | a

x = baaba

Idea: We generate each substring of x bottom up

ab b aa

ACB B ACAC

BSA SASC

B– B

SAC–

SAC

Page 25: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Parse tree reconstruction

S AB | BCA BA | aB CC | bC AB | a

x = baabaab b aa

ACB B ACAC

BSA SASC

B– B

SAC–

SAC

Tracing back the derivations, we obtain the parse tree

Page 26: CSCI 3130: Automata theory and formal languages Andrej Bogdanov andrejb/csc3130 The Chinese University of Hong Kong Ambiguity

Cocke-Younger-Kasami algorithm

For cells in last row If there is a production A xi

Put A in table cell iiFor cells st in other rows If there is a production A BC where B is in cell sj and C is in cell jt Put A in cell st

x1 x2 … xk

11 22 kk12 23

… …1k

tablecells

s j t k1

Input: Grammar G in CNF, string x = x1…xk

Cell ij remembers all possible derivations of substring xi…xj