Context-Free Grammars and Languages

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, Koreaseungjin@postech.ac.kr

1 / 44

Outline

I Context-free grammars

I Parse trees

2 / 44

Palindrome Example

Consider the language of palindromes, L = {w ∈ {0, 1} |w = wR}, wherea palindrome is a string that reads the same forward and backward (e.g.,otto).

Question: Any recursive definition of this L?

Answer: Yes, there is! Exploiting the idea that if a string is apalindrome, it must begin and end with the same symbol, leading to:

I Basis: ε, 0, and 1 are palindromes.

I Induction: If w is a palindrome, so are 0w0 and 1w1.

No string is a palindrome of 0’s and 1’s, unless it follows from this basisand induction rule.

3 / 44

Grammar: Palindrome Example

Gpal = ({S}, {0, 1},S ,P),

S → ε,

S → 0,

S → 1,

S → 0S0,

S → 1S1.

4 / 44

Context-Free Grammars

DefinitionA grammar G = (V ,T ,S ,P) is said to be context-free if all productionsin P are of the form

A︸︷︷︸head

→ x︸︷︷︸body

where A ∈ V and x ∈ (V ∪ T )∗.

I No restrictions in the right-hand side of productions rules.

I A restriction in the left-hand side of production rules, allowing onlysingle variable.

5 / 44

Example: Consider the grammar G = (V ,T ,S ,P) with productions

S → aSa|bSb|ε.

A typical derivation in this grammar is

S ⇒ aSa⇒ aaSaa⇒ aabSbaa⇒ aabbaa.

This make it clear that

L(G ) = {wwR |w ∈ {a, b}∗}.

We know this is not regular but is context-free.

6 / 44

Derivations Using Grammars

Apply the productions of a CFG to infer that certain strings are in thelanguage. There are two approaches to this inference:

I Recursive inference: Use productions from body to head

I Derivation: Use productions from head to body.I Leftmost derivationI Rightmost derivation

See Fig. 5.2 and 5.3 for recursive inference and see Ex. 5.6 for derivation(pp. 178-179).

7 / 44

Consider the following CFG G = ({E , I}, {+, ∗, (, ), a, b, 0, 1},E ,P) withproductions

1. E → I ,

2. E → E + E ,

3. E → E ∗ E ,4. E → (E ),

5. I → a,

6. I → b,

7. I → Ia,

8. I → Ib,

9. I → I0,

10. I → I1.

8 / 44

Context-Free Languages

DefinitionA language is said to be context-free iff there is a context-free grammarG such that L = L(G ), where

L(G ) = {w ∈ T ∗ |S ∗⇒G

9 / 44

The Language of Gpal

TheoremL(Gpal) = {w ∈ {0, 1}∗ |w = wR}.That is, w ∈ L(Gpal) iff w = wR for w ∈ {0, 1}∗.

Proof (”if part”). Suppose w = wR . We prove by induction on |w | thatw ∈ L(Gpal).

Basis: |w | = 0 or |w | = 1. Then w is ε, 0, or 1. Since S → ε|0|1 are

productions, we conclude that S∗⇒G

w in all base cases.

Induction: Suppose w ≥ 2. Since w = wR , we have w = 0x0 orw = 1x1 and x = xR . If w = 0x0, we know from the IH that S

∗⇒ x .Then S ⇒ 0S0

∗⇒ 0x0 = w . The case for w = 1x1 is similar.

10 / 44

Proof (”only if part”). We assume that w ∈ L(Gpal) and must show thatw = wR .

Since w ∈ L(Gpal), we have S∗⇒ w . We prove by induction on the length

of∗⇒.

Basis: The derivation S∗⇒ w is done in one step. Then w must be ε, 0,

or 1, all palindromes.

Induction: IH assumes S∗⇒ x in n steps where x = xR . Suppose the

derivation takes n + 1 steps. Then we must have

S ⇒ 0S0∗⇒ 0x0 = w ,

S ⇒ 1S1∗⇒ 1x1 = w .

By IH, w = wR .

11 / 44

Example: Show that L = {anbm | n 6= m} is a CFL.Solution. Note that CFG G = ({S}, {a, b}, S ,P) with productions

S → aSb | ε,

leads to

L(G) = {anbn | n ≥ 0}.

In order to take care of the case for n > m, we first generate a string with anequal number of a’s and b’s, then add extra a’s on the left, leading to

S → AS1,

S1 → aS1b | ε,A → aA | a.

We use a similar reasoning for the case n < m. Thus, the CFG for L is given by

S → AS1|S1B,

S1 → aS1b|ε,A → aA|a,B → bB|b.

12 / 44

Leftmost and Rightmost Derivations

I In CFGs that are not linear, a derivation may involve sententialforms with more than one variable. In such cases, we have a choicein the order in which variables are replaced.

I A derivation is said to be leftmost/rightmost if in each step theleftmost/rightmost variable in the sentential form is replaced.

13 / 44

Consider G = ({A,B,S}, {a, b},S ,P) with productions

1. S → AB,

2. A → aaA,

3. A → ε,

4. B → Bb,

5. B → ε.

The following two derivations (the same productions) produce the samesentence although the order in which the productions are applied isdifferent.

S1⇒ AB

2⇒ aaAB3⇒ aaB

4⇒ aaBb5⇒ aab,

S1⇒ AB

4⇒ ABb2⇒ aaABb

5⇒ aaAb3⇒ aab.

Note that L(G ) = {a2nbm | n ≥ 0,m ≥ 0}.

14 / 44

Parse Trees

DefinitionAn ordered tree for a CFG G , is a parse tree for G if and only if

1. The root is labeled S .

2. Every leaf has a label from T ∪ {ε}.3. Every interior vertex (a vertex which is not a leaf) is labeled by a

variable in V .

4. If a vertex is labeled A and its children are labeled a1, a2, . . . , an,then P must contain

A→ a1a2 · · · an.

5. If a leaf is labeled ε, then it must be the only child of its parent.

15 / 44

More to Say about Parse Trees...

I Tells us the syntactic structure of w .

I An alternative representation to derivations and recursive inference.

I There can be several parse trees for the same string. (ambiguity)

I Ideally there should be only parse tree (the true structure) for eachstring, i.e., the language should be unambiguous.

I Unfortunately, we cannot always remove the ambiguity.

16 / 44

Example: In the grammar

E → I ,

E → E + E ,

E → E ∗ E ,E → (E ), · · ·

The following is the parse tree which shows the derivation E∗⇒ I + E .

17 / 44

Example: In the grammar

P → ε | 0 | 1 | 0P0 | 1P1.

The following is the parse tree which shows the derivation P∗⇒ 0110.

18 / 44

The Yield of a Parse Tree

The yield of a parse tree is the string of leaves from left to right.

Important are those parse trees where:

1. The yield is a terminal string.

2. The root is labeled by the start symbol.

We shall see the set of yields of these important parse trees is thelanguage of the grammar.

19 / 44

The yield is a ∗ (a + b00).

20 / 44

Let G = (V ,T ,S ,P) be a CFG and A ∈ V . We will show that thefollowing are equivalent:

1. We can determine by recursive inference that w is in the language ofvariable A.

2. A∗⇒ w .

3. A∗⇒lm

4. A∗⇒rm

5. There is a parse tree of G with root A and yield w .

21 / 44

derivationstrees

Inferences

22 / 44

From Inferences to TreesTheoremLet G = (V ,T ,S ,P) be a CFG. If the recursive inference procedure tellsus that terminal string w is in the language of variable A, then, there is aparse tree with root A and yield w .

Proof. We do an induction on the length of the inference.

Basis: One step. Then we must have used a production A→ w . Thedesired parse tree is then

23 / 44

Induction: w is inferred in n + 1 steps. Suppose that the last step wasbased on a production

A→ X1X2 · · ·Xk ,

where Xi ∈ V ∪ T . We break w up as

w1w2 · · ·wk ,

where wi = Xi when Xi ∈ T and when Xi ∈ V , then wi was previouslyinferred being in Xi , in at most n steps.

By the IH, there are parse trees i with root Xi and yield wi . Then thefollowing is a parse tree for G with root A and yield w :

X1 X2 Xk

w1 w2 wk

24 / 44

From Trees to Derivations

We will show how to construct a leftmost derivation for a parse tree.

Example: In the grammar of slide 6, there clearly is a derivation

E ⇒ I ⇒ Ib ⇒ ab.

Then, for any α and β there is a derivation

αEβ ⇒ αIβ ⇒ αIbβ ⇒ αabβ.

For example, suppose we have a derivation

E ⇒ E + E ⇒ E + (E).

We can choose α = E + ( and β =) and continue the derivation as

E + (E)⇒ E + (I )⇒ E + (Ib)⇒ E(ab).

This is why CFG’s are called context-free.

25 / 44

TheoremLet G = (V ,T ,S ,P) be a CFG and suppose there is a parse tree with

root labeled A and yield w . Then A∗⇒lm

w in G .

Proof. We do an induction on the height of the parse tree.

Basis: Height is 1. The tree must look like

Consequently A→ w ∈ P and A⇒lm

26 / 44

Induction: Height is n + 1. The tree must look like

X1 X2 Xk

w1 w2 wk

Then w = w1w2 · · ·wk , where

1. If Xi ∈ T , then wi = Xi .

2. If Xi ∈ V , then Xi∗⇒lm

wi in G by the IH.

27 / 44

Now we construct A∗⇒lm

w by an inner induction by showing that

∀i : A∗⇒lm

w1w2 · · ·wiXi+1Xi+2 · · ·Xk .

Basis (inner): Let i = 0. We already know that

A∗⇒lm

X1X2 · · ·Xk .

Induction (inner): Make the IH that

A∗⇒lm

w1w2 · · ·wi−1XiXi+1 · · ·Xk .

28 / 44

Case 1: Xi ∈ T . Do nothing, since Xi = wi gives us

A∗⇒lm

w1w2 · · ·wiXi+1Xi+2 · · ·Xk .

29 / 44

Case 2: Xi ∈ V . By the IH there is a derivationXi ⇒

lmα1 ⇒

lmα2 ⇒

lm· · · ⇒

lmwi . By the context-free property of derivations

we can proceed with

A∗⇒lm

w1w2 · · ·wi−1XiXi+1 · · ·Xk ⇒lm

w1w2 · · ·wi−1α1Xi+1 · · ·Xk ⇒lm

w1w2 · · ·wi−1α2Xi+1 · · ·Xk ⇒lm· · ·

w1w2 · · ·wi−1wiXi+1 · · ·Xk .

30 / 44

Example: Let’s construct the leftmost derivation for the tree

31 / 44

Suppose we have inductively constructed the leftmost derivation

E ⇒lm

I ⇒lm

corresponding to the leftmost subtree, and the leftmost derivation

E ⇒lm

(E )⇒lm

(E + E )⇒lm

(I + E )⇒lm

(a + E )⇒lm

(a + I )⇒lm

(a + I0)⇒lm

(a + I00)⇒lm

(a + b00)

corresponding to the rightmost subtree.

32 / 44

For the derivation corresponding to the whole tree, we start withE ⇒

lmE ∗ E and expand the first E with the first derivation and the

second E with the second derivation:

E ⇒lm

E ∗ E ⇒lm

I ∗ E ⇒lm

a ∗ E ⇒lm

a ∗ (E )⇒lm

a ∗ (E + E )⇒lm

a ∗ (I + E )⇒lm

a ∗ (a + E )⇒lm

a ∗ (a + I )⇒lm

a ∗ (a + I0)⇒lm

a ∗ (a + I00)⇒lm

a ∗ (a + b00).

33 / 44

From Derivations to Recursive Inferences

Observation: Suppose that A⇒ X1X2 · · ·Xk∗⇒ w . Then

w = w1w2 · · ·wk , where Xi∗⇒ wi .

The factor wi can be extracted from A∗⇒ w by looking at the expansion

of Xi only.

Example: E ⇒ a ∗ b + a and E ⇒ E︸︷︷︸X1

∗︸︷︷︸X2

E︸︷︷︸X3

+︸︷︷︸X4

E︸︷︷︸X5

We have

E ⇒ E ∗ E ⇒ E ∗ E + E ⇒ I ∗ E + E ⇒ I ∗ I + E ⇒I ∗ I + I ⇒ a ∗ I + I ⇒ a ∗ b + I ⇒ a ∗ b + a.

By looking at the expansion of X3 = E only, we can extract

E ⇒ I ⇒ b.

34 / 44

TheoremLet G = (V ,T ,S ,P) be a CFG. Suppose A

∗⇒G

w . and that w is a string of

terminals. Then we can infer that w is in the language of variable A.

Proof. We do an induction on the length of the derivation A∗⇒G

Basis: One step. If A⇒G

w , there must be a production A→ w in P. Then we

can infer that w is in the language of A.

Induction: Suppose A∗⇒G

w in n + 1 steps. Write the derivation as

X1X2 · · ·Xk∗⇒G

As noted on the previous slide we can break w as w1w2 · · ·wk where Xi∗⇒G

Furthermore, Xi∗⇒G

wi can use at most n steps.

Now we have a production A→ X1X2 · · ·Xk , and we know by the IH that wecan infer wi to be in the language of Xi .

Therefore we can infer w1w2 · · ·wk to be in the language of A.35 / 44

Ambiguity in Grammars and Languages: Example

In the grammar

E → I ,

E → E + E ,

E → E ∗ E ,E → (E ),

· · · ,

the sentential form E + E ∗ E has two derivations:

E ⇒ E + E ⇒ E + E ∗ E ,

E ⇒ E ∗ E ⇒ E + E ∗ E .

36 / 44

This gives us two parse trees:

I Left-hand side: The second and the third expressions ar multipliedand the result is added to the first expression. (e.g., 1+(2 ∗ 3)=7)

I Right-hand side: Adds the first two expressions and multiplies theresult by the third. (e.g. (1+2) ∗ 3=9)

37 / 44

Ambiguity in Grammars and Languages

DefinitionA CFG G is said to be ambiguous if there exists some w ∈ L(G ) that hasat least two distinct parse trees.

DefinitionA CFL L is said to be inherently ambiguous if all its grammars areambiguous.

DefinitionIf L is a CFL for which there exists an unambiguous grammar, then L issaid to be unambiguous.

Even one grammar for L is unambiguous, then L is an unambiguouslanguage.

38 / 44

Removing Ambiguity from Grammars

I Good news: Sometimes we can remove ambiguity by hand.

I Bad news: There is no algorithm to do it.

I More bad news: Some CFL’s have only ambiguous CFG’s.

39 / 44

Let us consider the grammar:

E → I |E + E |E ∗ E | (E ),

I → a | b | Ia | Ib | I0 | I1.

There are two problems:

1. There is no precedence between ∗ and +.

2. There is no grouping of sequences of operators, e.g., E + E + Emeant to be (E + E ) + E or E + (E + E ).

40 / 44

Solution: We introduce more variables, each representing expressions thatshare a level of ”binding strength”.

1. A factor is an expression that cannot be broken apart by an adjacent∗ or +. Our factors are

1.1 Identifiers1.2 A parenthesized expression

2. A term is an expression that cannot be broken by +. A term is aproduct of one or more factors. For instance a ∗ b can be broken bya1∗ or ∗a1. It cannot be broken by +, since, e.g., a1 + a ∗ b is (byprecedence rules) same as a1 + (a ∗ b), and a ∗ b + a1 is same as(a ∗ b) + a1.

3. The rest are expressions, i.e., they can be broken apart by ∗ or +.

41 / 44

We will let F stand for factors, T for terms, and E for expressions.Consider the following grammar:

I → a | b | Ia | Ib | I0 | I1,F → I | (E ),

T → F |T ∗ F ,E → T |E + T .

Now the only parse tree for a + a ∗ a will be

42 / 44

Why is the grammar shown in previous slide unambiguous?

I A factor is either an identifier or (E ), for some expression E .

I The only parse tree for a sequence

f1 ∗ f2 ∗ · · · ∗ fn−1 ∗ fn

of factors is the one that gives f1 ∗ f2 ∗ · · · ∗ fn−1 ∗ fn as a term and fnas a factor, as in the parse tree on the next slide.

I An expression is a sequence

t1 + t2 + · · ·+ tn−1 + tn

of terms ti . It can only be parse with t1 + t2 + · · ·+ tn−1 + tn as anexpression and tn as a term.

43 / 44

44 / 44

Context-Free Grammars and Languages - POSTECH...

Documents

1 PDAs Accept Context-Free Languages. 2 Context-Free Languages (Grammars) Languages Accepted by PDAs Theorem:

ON PEDAGOGICAL GRAMMARS FOR SALISH LANGUAGES

Parsing Grammars Regular Languages Grammars

Languages and Grammars

Languages Grammars

Context-free Languages: Grammars and Automata

Context-Free Grammars Chapter 11. Languages and Machines

Topics Automata Theory Grammars and Languages Complexities Formal Languages and Automata Theory1

Context-free grammars for natural languages

Context-Free Grammars - Stanford University€¦ · Context-free grammars give a formalism for describing languages by generating all the strings in the language. Context-free languages

Theory of Computation (Fall 2014): Regular Grammars & Regular Languages

Context-free languages and grammars

Chapter 3 Context-Free Grammars, Context-Free Languages, Parse

Context-Free Grammars: Normal Formsmlg.postech.ac.kr/~seungjin/courses/automata/handouts/... · 2016-11-08 · Context-Free Grammars: Normal Forms Seungjin Choi Department of Computer

Context-Free Languages & Grammars (CFLs & CFGs) ( )

LANGUAGES, REGULAR EXPRESSIONS, BNF GRAMMARS, FINITE 2175/notes/FS… · Web viewLANGUAGES, REGULAR EXPRESSIONS, BNF GRAMMARS, FINITE STATE MACHINES. Languages. As computer scientists,

Chapter 3 Regular Languages and Regular Grammars

Formal Grammars and Languages

ITU - MDD - Textural Languages and Grammars

Automata, Grammars and Languages - University of Arizona · CSC 473 Automata, Grammars & Languages 10/7/10 1 C SC 473 Automata, Grammars & Languages Automata, Grammars and Languages