View
227
Download
4
Category
Preview:
Citation preview
Context-Free Grammars and Languages
Seungjin Choi
Department of Computer Science and EngineeringPohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 37673, Koreaseungjin@postech.ac.kr
1 / 44
Palindrome Example
Consider the language of palindromes, L = {w ∈ {0, 1} |w = wR}, wherea palindrome is a string that reads the same forward and backward (e.g.,otto).
Question: Any recursive definition of this L?
Answer: Yes, there is! Exploiting the idea that if a string is apalindrome, it must begin and end with the same symbol, leading to:
I Basis: ε, 0, and 1 are palindromes.
I Induction: If w is a palindrome, so are 0w0 and 1w1.
No string is a palindrome of 0’s and 1’s, unless it follows from this basisand induction rule.
3 / 44
Grammar: Palindrome Example
Gpal = ({S}, {0, 1},S ,P),
S → ε,
S → 0,
S → 1,
S → 0S0,
S → 1S1.
4 / 44
Context-Free Grammars
DefinitionA grammar G = (V ,T ,S ,P) is said to be context-free if all productionsin P are of the form
A︸︷︷︸head
→ x︸︷︷︸body
,
where A ∈ V and x ∈ (V ∪ T )∗.
I No restrictions in the right-hand side of productions rules.
I A restriction in the left-hand side of production rules, allowing onlysingle variable.
5 / 44
Example: Consider the grammar G = (V ,T ,S ,P) with productions
S → aSa|bSb|ε.
A typical derivation in this grammar is
S ⇒ aSa⇒ aaSaa⇒ aabSbaa⇒ aabbaa.
This make it clear that
L(G ) = {wwR |w ∈ {a, b}∗}.
We know this is not regular but is context-free.
6 / 44
Derivations Using Grammars
Apply the productions of a CFG to infer that certain strings are in thelanguage. There are two approaches to this inference:
I Recursive inference: Use productions from body to head
I Derivation: Use productions from head to body.I Leftmost derivationI Rightmost derivation
See Fig. 5.2 and 5.3 for recursive inference and see Ex. 5.6 for derivation(pp. 178-179).
7 / 44
Consider the following CFG G = ({E , I}, {+, ∗, (, ), a, b, 0, 1},E ,P) withproductions
1. E → I ,
2. E → E + E ,
3. E → E ∗ E ,4. E → (E ),
5. I → a,
6. I → b,
7. I → Ia,
8. I → Ib,
9. I → I0,
10. I → I1.
8 / 44
Context-Free Languages
DefinitionA language is said to be context-free iff there is a context-free grammarG such that L = L(G ), where
L(G ) = {w ∈ T ∗ |S ∗⇒G
w}.
9 / 44
The Language of Gpal
TheoremL(Gpal) = {w ∈ {0, 1}∗ |w = wR}.That is, w ∈ L(Gpal) iff w = wR for w ∈ {0, 1}∗.
Proof (”if part”). Suppose w = wR . We prove by induction on |w | thatw ∈ L(Gpal).
Basis: |w | = 0 or |w | = 1. Then w is ε, 0, or 1. Since S → ε|0|1 are
productions, we conclude that S∗⇒G
w in all base cases.
Induction: Suppose w ≥ 2. Since w = wR , we have w = 0x0 orw = 1x1 and x = xR . If w = 0x0, we know from the IH that S
∗⇒ x .Then S ⇒ 0S0
∗⇒ 0x0 = w . The case for w = 1x1 is similar.
10 / 44
Proof (”only if part”). We assume that w ∈ L(Gpal) and must show thatw = wR .
Since w ∈ L(Gpal), we have S∗⇒ w . We prove by induction on the length
of∗⇒.
Basis: The derivation S∗⇒ w is done in one step. Then w must be ε, 0,
or 1, all palindromes.
Induction: IH assumes S∗⇒ x in n steps where x = xR . Suppose the
derivation takes n + 1 steps. Then we must have
S ⇒ 0S0∗⇒ 0x0 = w ,
or
S ⇒ 1S1∗⇒ 1x1 = w .
By IH, w = wR .
11 / 44
Example: Show that L = {anbm | n 6= m} is a CFL.Solution. Note that CFG G = ({S}, {a, b}, S ,P) with productions
S → aSb | ε,
leads to
L(G) = {anbn | n ≥ 0}.
In order to take care of the case for n > m, we first generate a string with anequal number of a’s and b’s, then add extra a’s on the left, leading to
S → AS1,
S1 → aS1b | ε,A → aA | a.
We use a similar reasoning for the case n < m. Thus, the CFG for L is given by
S → AS1|S1B,
S1 → aS1b|ε,A → aA|a,B → bB|b.
12 / 44
Leftmost and Rightmost Derivations
I In CFGs that are not linear, a derivation may involve sententialforms with more than one variable. In such cases, we have a choicein the order in which variables are replaced.
I A derivation is said to be leftmost/rightmost if in each step theleftmost/rightmost variable in the sentential form is replaced.
13 / 44
Consider G = ({A,B,S}, {a, b},S ,P) with productions
1. S → AB,
2. A → aaA,
3. A → ε,
4. B → Bb,
5. B → ε.
The following two derivations (the same productions) produce the samesentence although the order in which the productions are applied isdifferent.
S1⇒ AB
2⇒ aaAB3⇒ aaB
4⇒ aaBb5⇒ aab,
S1⇒ AB
4⇒ ABb2⇒ aaABb
5⇒ aaAb3⇒ aab.
Note that L(G ) = {a2nbm | n ≥ 0,m ≥ 0}.
14 / 44
Parse Trees
DefinitionAn ordered tree for a CFG G , is a parse tree for G if and only if
1. The root is labeled S .
2. Every leaf has a label from T ∪ {ε}.3. Every interior vertex (a vertex which is not a leaf) is labeled by a
variable in V .
4. If a vertex is labeled A and its children are labeled a1, a2, . . . , an,then P must contain
A→ a1a2 · · · an.
5. If a leaf is labeled ε, then it must be the only child of its parent.
15 / 44
More to Say about Parse Trees...
I Tells us the syntactic structure of w .
I An alternative representation to derivations and recursive inference.
I There can be several parse trees for the same string. (ambiguity)
I Ideally there should be only parse tree (the true structure) for eachstring, i.e., the language should be unambiguous.
I Unfortunately, we cannot always remove the ambiguity.
16 / 44
Example: In the grammar
E → I ,
E → E + E ,
E → E ∗ E ,E → (E ), · · ·
The following is the parse tree which shows the derivation E∗⇒ I + E .
I
E
E + E
17 / 44
Example: In the grammar
P → ε | 0 | 1 | 0P0 | 1P1.
The following is the parse tree which shows the derivation P∗⇒ 0110.
P
P
P
00
11
ǫ
18 / 44
The Yield of a Parse Tree
The yield of a parse tree is the string of leaves from left to right.
Important are those parse trees where:
1. The yield is a terminal string.
2. The root is labeled by the start symbol.
We shall see the set of yields of these important parse trees is thelanguage of the grammar.
19 / 44
Let G = (V ,T ,S ,P) be a CFG and A ∈ V . We will show that thefollowing are equivalent:
1. We can determine by recursive inference that w is in the language ofvariable A.
2. A∗⇒ w .
3. A∗⇒lm
w .
4. A∗⇒rm
w .
5. There is a parse tree of G with root A and yield w .
21 / 44
From Inferences to TreesTheoremLet G = (V ,T ,S ,P) be a CFG. If the recursive inference procedure tellsus that terminal string w is in the language of variable A, then, there is aparse tree with root A and yield w .
Proof. We do an induction on the length of the inference.
Basis: One step. Then we must have used a production A→ w . Thedesired parse tree is then
A
w
23 / 44
Induction: w is inferred in n + 1 steps. Suppose that the last step wasbased on a production
A→ X1X2 · · ·Xk ,
where Xi ∈ V ∪ T . We break w up as
w1w2 · · ·wk ,
where wi = Xi when Xi ∈ T and when Xi ∈ V , then wi was previouslyinferred being in Xi , in at most n steps.
By the IH, there are parse trees i with root Xi and yield wi . Then thefollowing is a parse tree for G with root A and yield w :
A
X1 X2 Xk
w1 w2 wk
24 / 44
From Trees to Derivations
We will show how to construct a leftmost derivation for a parse tree.
Example: In the grammar of slide 6, there clearly is a derivation
E ⇒ I ⇒ Ib ⇒ ab.
Then, for any α and β there is a derivation
αEβ ⇒ αIβ ⇒ αIbβ ⇒ αabβ.
For example, suppose we have a derivation
E ⇒ E + E ⇒ E + (E).
We can choose α = E + ( and β =) and continue the derivation as
E + (E)⇒ E + (I )⇒ E + (Ib)⇒ E(ab).
This is why CFG’s are called context-free.
25 / 44
TheoremLet G = (V ,T ,S ,P) be a CFG and suppose there is a parse tree with
root labeled A and yield w . Then A∗⇒lm
w in G .
Proof. We do an induction on the height of the parse tree.
Basis: Height is 1. The tree must look like
A
w
Consequently A→ w ∈ P and A⇒lm
w .
26 / 44
Induction: Height is n + 1. The tree must look like
A
X1 X2 Xk
w1 w2 wk
Then w = w1w2 · · ·wk , where
1. If Xi ∈ T , then wi = Xi .
2. If Xi ∈ V , then Xi∗⇒lm
wi in G by the IH.
27 / 44
Now we construct A∗⇒lm
w by an inner induction by showing that
∀i : A∗⇒lm
w1w2 · · ·wiXi+1Xi+2 · · ·Xk .
Basis (inner): Let i = 0. We already know that
A∗⇒lm
X1X2 · · ·Xk .
Induction (inner): Make the IH that
A∗⇒lm
w1w2 · · ·wi−1XiXi+1 · · ·Xk .
28 / 44
Case 2: Xi ∈ V . By the IH there is a derivationXi ⇒
lmα1 ⇒
lmα2 ⇒
lm· · · ⇒
lmwi . By the context-free property of derivations
we can proceed with
A∗⇒lm
w1w2 · · ·wi−1XiXi+1 · · ·Xk ⇒lm
w1w2 · · ·wi−1α1Xi+1 · · ·Xk ⇒lm
w1w2 · · ·wi−1α2Xi+1 · · ·Xk ⇒lm· · ·
w1w2 · · ·wi−1wiXi+1 · · ·Xk .
30 / 44
Suppose we have inductively constructed the leftmost derivation
E ⇒lm
I ⇒lm
a
corresponding to the leftmost subtree, and the leftmost derivation
E ⇒lm
(E )⇒lm
(E + E )⇒lm
(I + E )⇒lm
(a + E )⇒lm
(a + I )⇒lm
(a + I0)⇒lm
(a + I00)⇒lm
(a + b00)
corresponding to the rightmost subtree.
32 / 44
For the derivation corresponding to the whole tree, we start withE ⇒
lmE ∗ E and expand the first E with the first derivation and the
second E with the second derivation:
E ⇒lm
E ∗ E ⇒lm
I ∗ E ⇒lm
a ∗ E ⇒lm
a ∗ (E )⇒lm
a ∗ (E + E )⇒lm
a ∗ (I + E )⇒lm
a ∗ (a + E )⇒lm
a ∗ (a + I )⇒lm
a ∗ (a + I0)⇒lm
a ∗ (a + I00)⇒lm
a ∗ (a + b00).
33 / 44
From Derivations to Recursive Inferences
Observation: Suppose that A⇒ X1X2 · · ·Xk∗⇒ w . Then
w = w1w2 · · ·wk , where Xi∗⇒ wi .
The factor wi can be extracted from A∗⇒ w by looking at the expansion
of Xi only.
Example: E ⇒ a ∗ b + a and E ⇒ E︸︷︷︸X1
∗︸︷︷︸X2
E︸︷︷︸X3
+︸︷︷︸X4
E︸︷︷︸X5
.
We have
E ⇒ E ∗ E ⇒ E ∗ E + E ⇒ I ∗ E + E ⇒ I ∗ I + E ⇒I ∗ I + I ⇒ a ∗ I + I ⇒ a ∗ b + I ⇒ a ∗ b + a.
By looking at the expansion of X3 = E only, we can extract
E ⇒ I ⇒ b.
34 / 44
TheoremLet G = (V ,T ,S ,P) be a CFG. Suppose A
∗⇒G
w . and that w is a string of
terminals. Then we can infer that w is in the language of variable A.
Proof. We do an induction on the length of the derivation A∗⇒G
w .
Basis: One step. If A⇒G
w , there must be a production A→ w in P. Then we
can infer that w is in the language of A.
Induction: Suppose A∗⇒G
w in n + 1 steps. Write the derivation as
A⇒G
X1X2 · · ·Xk∗⇒G
w .
As noted on the previous slide we can break w as w1w2 · · ·wk where Xi∗⇒G
wi .
Furthermore, Xi∗⇒G
wi can use at most n steps.
Now we have a production A→ X1X2 · · ·Xk , and we know by the IH that wecan infer wi to be in the language of Xi .
Therefore we can infer w1w2 · · ·wk to be in the language of A.35 / 44
Ambiguity in Grammars and Languages: Example
In the grammar
E → I ,
E → E + E ,
E → E ∗ E ,E → (E ),
· · · ,
the sentential form E + E ∗ E has two derivations:
E ⇒ E + E ⇒ E + E ∗ E ,
and
E ⇒ E ∗ E ⇒ E + E ∗ E .
36 / 44
This gives us two parse trees:
E E E
EE
E
E
EE
EF
+
+ ∗
∗
I Left-hand side: The second and the third expressions ar multipliedand the result is added to the first expression. (e.g., 1+(2 ∗ 3)=7)
I Right-hand side: Adds the first two expressions and multiplies theresult by the third. (e.g. (1+2) ∗ 3=9)
37 / 44
Ambiguity in Grammars and Languages
DefinitionA CFG G is said to be ambiguous if there exists some w ∈ L(G ) that hasat least two distinct parse trees.
DefinitionA CFL L is said to be inherently ambiguous if all its grammars areambiguous.
DefinitionIf L is a CFL for which there exists an unambiguous grammar, then L issaid to be unambiguous.
Even one grammar for L is unambiguous, then L is an unambiguouslanguage.
38 / 44
Removing Ambiguity from Grammars
I Good news: Sometimes we can remove ambiguity by hand.
I Bad news: There is no algorithm to do it.
I More bad news: Some CFL’s have only ambiguous CFG’s.
39 / 44
Let us consider the grammar:
E → I |E + E |E ∗ E | (E ),
I → a | b | Ia | Ib | I0 | I1.
There are two problems:
1. There is no precedence between ∗ and +.
2. There is no grouping of sequences of operators, e.g., E + E + Emeant to be (E + E ) + E or E + (E + E ).
40 / 44
Solution: We introduce more variables, each representing expressions thatshare a level of ”binding strength”.
1. A factor is an expression that cannot be broken apart by an adjacent∗ or +. Our factors are
1.1 Identifiers1.2 A parenthesized expression
2. A term is an expression that cannot be broken by +. A term is aproduct of one or more factors. For instance a ∗ b can be broken bya1∗ or ∗a1. It cannot be broken by +, since, e.g., a1 + a ∗ b is (byprecedence rules) same as a1 + (a ∗ b), and a ∗ b + a1 is same as(a ∗ b) + a1.
3. The rest are expressions, i.e., they can be broken apart by ∗ or +.
41 / 44
We will let F stand for factors, T for terms, and E for expressions.Consider the following grammar:
I → a | b | Ia | Ib | I0 | I1,F → I | (E ),
T → F |T ∗ F ,E → T |E + T .
Now the only parse tree for a + a ∗ a will be
E
E
FF
FT
T
T
I
II a
a a
+
∗
42 / 44
Why is the grammar shown in previous slide unambiguous?
I A factor is either an identifier or (E ), for some expression E .
I The only parse tree for a sequence
f1 ∗ f2 ∗ · · · ∗ fn−1 ∗ fn
of factors is the one that gives f1 ∗ f2 ∗ · · · ∗ fn−1 ∗ fn as a term and fnas a factor, as in the parse tree on the next slide.
I An expression is a sequence
t1 + t2 + · · ·+ tn−1 + tn
of terms ti . It can only be parse with t1 + t2 + · · ·+ tn−1 + tn as anexpression and tn as a term.
43 / 44
Recommended