Doctor Syntax was the ﬁctitious hero of several 19th ...ryan/cse4250/notes/syntax.pdf · Doctor Syntax was the ﬁctitious hero of several 19th century satirical ... Post Systems

Doctor Syntax was the fictitious hero of several 19th century satiricalpoems by William Combe (1742–1823). The publications wereillustrated by Thomas Rowlandson (1756–1827).

Rowlandson’s many comic illustrations offered humorouscommentary on the political and social conventions of hisday. Rowlandson is perhaps best known for his animateddrawings of the mis-adventures of the infamous Dr. Syntax.The various escapades of the fictional English clergymanand schoolmaster, Dr. Syntax, are chronicled in threevolumes or “Tours.” The Tours of Dr. Syntax were satires ona then contemporary series of picturesque adventures tovarious parts of England with an emphasis on landscapeillustration and poetic verse.

I The tour of Doctor Syntax in search of the picturesque

I The second tour of Doctor Syntax in search of consolation

I The third tour of Doctor Syntax in search of a wife

Overview of Syntax

I Study of Language — page ??I Structure — page ??I Formal Languages — page ??I Regular Expressions — page ??I BNF, ∗ Context Free Grammar, parse trees — page ??I ∗Attribute Grammars, ∗Post Systems

Semiotics

Semiotics. Date: 1880. A general philosophical theory of signs andsymbols that deals especially with their function in both artificiallyconstructed and natural languages. The application of linguisticmethods to other than natural languages. A philosophy thatemphasizes the significance of language as universal.In his 1946 book, Signs, Language, and Behavior, Americanphilosopher Charles William Morris (1901–1979) gave thesedefinitions for the branches of the science of signs

I pragmatics — origin, uses and effects of signs with the behaviorin which they occur;

I semantics — the signification of signs in all modes of signifying;

I syntax — combination of signs without regard for their specificsignifications or their relations to the behavior in which they occur.

Study of Language

I syntax is the form, structure of language

I semantics is the meaning of language

I pragmatics is origin, use, context of language

Distinction between semantics and syntax:

1. number versus numeral

2. character versus glyph

3. conjunction versus ∧

Syntax in NL

’Twas brillig, and the slithy tovesDid gyre and gimble in the wabe;

All mimsy were the borogoves,And the mome raths outgrabe.

Lewis Carroll, “Jabberwocky”

We know certain things—for instance, that all this happened in thepast and that there was more than one tove but only one wabe. Weknow these things because of the grammatical structure of the Englishlanguage.

Syntax

Can you translate the poem “Jabberwocky” into other languages?

’Twas BRKPT and the I/O queueWas SYMMING FASTRAND like the wind.

All idle was the CAUAs the last run had just FINNED.

“Beware the UNIVAC, my son,Its FASTRAND and its high-speed drum,

And FIELDATA, and listen forThe CTMC’s hum.”

Harry Gilbert, mid 1970s

Syntax

Can you translate the poem “Jabberwocky” into other languages?

’Twas BRKPT and the I/O queueWas SYMMING FASTRAND like the wind.

All idle was the CAUAs the last run had just FINNED.

“Beware the UNIVAC, my son,Its FASTRAND and its high-speed drum,

And FIELDATA, and listen forThe CTMC’s hum.”

Harry Gilbert, mid 1970s

Arpawocky

Also, Arpawocky, RFC527, 22 June 1973, by D. L. Covill

Twas brillig, and the ProtocolsDid USER-SERVER in the wabe.

All mimsey was the FTP,And the RJE outgrabe,

Beware the ARPANET, my son;The bits that byte, the heads that scratch;

Beware the NCP, and shunthe frumious system patch,

Structure

The arrangement of elements of an entity and their relationships toeach other.

I Linear structure – like links of a chain

I Hierarchical structure – like nested mixing bowls, or boxes withinboxes

I Trees – like family tree, or railroad yard

Grammars are a way of describing these and more complex structures.

Structure

Structure

Even though it is invisible, structure has great importance, evenmeaning. Sentence structure is crucial to disambiguate the followingsentences.

1. The chicken is ready to eat.

Who is eating?

2. He saw that gasoline can explode.What is exploding?

3. Flying planes can be dangerous.What is dangerous?

Structure


1. The chicken is ready to eat.Who is eating?

2. He saw that gasoline can explode.

What is exploding?


Structure




3. Flying planes can be dangerous.

What is dangerous?

Structure





programminglanguages

language design,description, semantics

compilers

implementation,recognition

formallanguages

expressivity

formallanguages,automata

models

computabilty

limits

complexity

efficiency

Bad/Confusing Syntax In PL

The science of syntax is about unambiguous communication and assuch is neutral. Let us at least be clear.Hard to define “good” and “bad” syntax. Only experience helps, andthat takes time. The lessons of programming language design are adhoc.

Bad/Confusing Syntax In PL

DO 10 I = 1.5A(I) = X + B(I)

10 CONTINUE

IF IF=THEN THEN THEN=ELSE; ELSE ELSE=IFIF IF THEN THEN=ELSE; ELSE ELSE=0;

while ( ... ) {...break;...

}

Bad Syntax In Compilers

Not LL but is LR:

stm ::= "if" expr "then" stm "else" stm| "if" expr "then" stm

cf. Jacc FAQ 8.1

Lexical Structure Versus Phrase Structure

In lexical structure the alphabet used is a set of characters (e.g,Latin1) and the goal is to categorize substring into different tokenseach described by their own language.In phrase structure, the alphabet used is a set tokens, and one goal isto determine if the program is syntactically correct (i.e., is the phrase inthe language or not). Actually the goal is more broad than that. Thegoal is to identify the phrase structure or derivation of syntacticallycorrect programs.

Three Views of a Program

Program asStream of Characters

Program asStream of Tokens

Program asTree Structure

Whitespace

Free-format as opposed to one statement per line. In earlier versionsof FORTRAN a fixed lexical structure of one statement per punch cardstating on column 7. COBOL and Basic also used to be fixed format.For the most part languages today permit the programmer to choosethe white space (and hence the layout) of the program. Some layoutsare clearly more legible and others, so the need for style guidelines.Python, Haskell, and Occam require proper indenting.

Should a language specify the layout (use of white space)?

Formal Language Versus Natural Language

Natural language is full of unresolvable differences and “jaggededges.” (This makes it rich and interesting.)

Formal Language Versus Natural Language

Formal Language

First some definitions:

I object language, the language being studied and meta language,the language used to communicate

I alphabet is a set of atomic symbols

I strings is a sequence of symbols

I formal language is a set of strings

Unlike natural languages, it is unambiguous for a formal language L ifs ∈ L or s /∈ L.The key challenge is to find good (finite) representation of (infinite)sets.

Object and Meta Language

Object language versus meta-language. The object language is thelanguage being studied, and the meta-language is the language beingused to communicate.For example, in the class French 101 one may study the Frenchlanguage, it is the object language. To communicate to the students itmay be necessary for the teacher to speak in English. English is themeta-language in this case.

AlphabetAlphabet. The alphabet is the finite set of symbols comprising theformal language. These are the basic, atomic, distinct elements.

Σ = {a,b,c,d}Σ = {0,1}Σ = {a,b, . . . ,z,0,1, . . . ,9}Σ = Latin-1

Σ = Unicode

Σ = {if, then, (, ), ’,’, . . .}

Theoreticians like to use Σ = {0,1} because it is so simple.Sometimes the symbols of the alphabet may be chosen in a way tosuggest their meaning or use. The symbols might even be composedof other sub-atomic symbols. Important character sets forprogramming languages to use and be written in are Latin-1, Latin-0,and Unicode, not just US-ASCII.

ISO 8859-13 (Latin-0)

NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SS SIDLE DC1 DC2 DC3 DC4 NAK SYN ETP CAN EM SUB ESC FS GS RS US

! " # $ % & ’ ( ) * + , - . /0 1 2 3 4 5 6 7 8 9 : ; < = > ?@ A B C D E F G H I J K L M N OP Q R S T U V W X Y Z [ \ ] ˆ _` a b c d e f g h i j k l m n op q r s t u v w x y z { | } ˜ DEL

XXX XXX BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3DCS PU1 PU2 SST CCH MS SPA EPA SOS XXX SCI CSI ST OSC PM APC

¡ ¢ £ ¥ S § s © ª « ¬ SHY ® ¯° ± 2 3 Z µ ¶ · z 1 º » Œ œ Y ¿A A A A A A Æ C E E E E I I I IÐ N O O O O O × Ø U U U U Y Þ ßa a a a a a æ c e e e e ı ı ı ıd n o o o o o ÷ ø u u u u y þ y

Unicode Supports Many Scripts

Strings

String. Given an alphabet Σ = {a,b,c,d}. A string is an (ordered)sequence of symbols from the alphabet. E.g., aaa, bac, d .The sequence may contain no symbols, this empty string is given aspecial notation "". (Some authors use ε, the Greek letter epsilon.)The most important operation on strings is string concatenation whichis sometimes denoted by the · operator. E.g., ab ·db = abdb,da ·ba = daba.

More Strings

Given an alphabet Σ = {0,1}, the following are strings: 0, ε, 111, 010,1.Given an alphabet Σ = {a,b, . . . ,z}, the following are strings: ε, x ,aeiou, hello, biopsy .Using Latin-1 as the alphabet, the following are strings: ¡Hola!,Hola!, ¿Que pasa?, "Hi. Bye", großtergeminsamer Teiler.Using the alphabet Σ

{if, then, else, begin, end, (, ), ‘,’}

the following are strings:1. else )

2. if then

3. begin begin , , end end

Formal Language

Formal language. A formal language is a set of strings over somealphabet.Given an alphabet Σ = {0,1}, the following are formal languages:

L0 = /0

L1 = {1}L2 = {1,0101}L3 = {0,01,00}L4 = {1,01,11,001,011,101,111, . . .}L5 = {ε,1,11,111,1111, . . .}L6 = {ε,0,1,00,01,10,11,000, . . .}= Σ∗

The collection of legal programs in a programming language is also anexample of a formal language.

Chomsky Hierarchy

Chomsky Hierarchy

name machine

0 unrestricted Turing machine, Post systems1 context-sensitive linear-bounded automata2 context-free pushdown automata, BNF3 regular finite automata, regular expression

I Regular expressions are used to describe tokens in programminglanguages

I BNF, CFG used to describe syntax of programming languages

I [Context-sensitive languages play no role in programminglanguages]

I attribute grammars, Post systems used in semantics

Description Mechanisms1. Regular expressions (Appel, page 19, but not in Webber or

Mitchell; briefly mentioned in Sebesta)2. BNF (Chapters 2 and three of Webber)

“The programming language Ada expresses computation.”“The programming language Ada is notation for computation.”“A program in Java denotes a computation or algorithm.”“Regular expressions express formal languages (sets of strings).”“Regular expressions are notation for formal languages.”“A regular expression denotes a formal language.”“BNF expresses formal languages.”“BNF is notation for formal languages.”“A BNF definition denotes a formal language.”Now consider: a programming language (like Ada) is a formallanguage.We can try to describe Ada using regular expressions or BNF.

Jeffrey E. F. Friedl. Mastering Regular Expressions, second edition.O’Reilly: Sepastapol, California, 2002. ISBN: 0-596-00289-0.

Regular Expressions

Some people, when confronted with a problem think “I know,I’ll use regular expressions.” Now they have two problems.

From a post by Jamie Zawinski to alt.religion.emacs in 1997.

Regular Expressions (Syntax)

1. Empty. /0

2. Atom. Any single symbol of a ∈ Σ is a regular expression.

3. Alternation. If r1 is a regular expression and r2 is a regularexpression, then (r1 + r2) is a regular expression. [Maybe (r1|r2)is better.]

4. Concatenation. If r1 and r2 are regular expressions, then (r1 · r2)is a regular expression.

5. Closure. If r is a regular expression, then r∗ is a regularexpression.

Regular Expressions (Examples)

a ·ba + ba · (b + c)a + (b · c)a∗

a + (b · c)∗

(a + (b · c))∗

(using the alphabet Σ = {a,b,c})


(Omitting the “centered dot” and using the | for alternation.)

aba | ba(b | c)a | (bc)a∗

a | (bc)∗

(a | (bc))∗



011011 + 01(0 + 1)0 + (10)0∗

1 + (01)∗

(0 + (10)∗

((0 + (10)∗)∗

(using the alphabet Σ = {0,1} and omitting the · symbol)

Regular Expressions (Semantics)

That is what regular expressions look like (syntax). What do regularexpressions mean (semantics)?Each regular expression denotes a formal language (a set of strings).There are an infinite number of regular expressions. (Each one isfinitely constructed.) Just like programs in programming languages.We must find a way to define the meaning or denotation of eachregular expression.So, we define a (recursive) function from regular expressions (aspieces of syntax) to sets of strings (languages).

Regular Expressions (Semantics)

1. Empty. The language with no strings.

2. Atom. The language with one string of length one–a itself. Notethat the notation a is ambiguous. It stands for both the symboland the language.

3. Alternation. (r1 + r2) is the union of two languages.

4. Concatenation. (r1 · r2) is the set all strings beginning in onelanguage and then followed by a string in the second.

5. Closure. r∗ is zero, one, or more strings.


a ·b = {ab}a + b = {a,b}

a · (b + c) = {ab,ac}a + (b · c) = {a,bc}

a∗ = {ε,a,aa,aaa, . . .}a + (b · c)∗ = {ε,a,bc,bcbc, . . .}

(a + (b · c))∗ = {ε,a,bc,aa,abc,bcbc, . . .}


Other Definitions

Some authors replace one of the five cases of the definition

1. Empty. /0 denoting the language with no strings

with another case

1. Epsilon. ε denoting the set consisting of the single string “”

Do you see the difference?

Notice the ε is superfluous as {ε} is represented by /0∗. Can you provethis?

∗Regular Expressions (Semantics)

1. Empty.

DJ /0K = {}

2. Atom.

DJa1K = {a1}...

DJanK = {an}

3. Alternation.

DJ(r1 + r2)K = DJr1K∪DJr2K

∗Regular Expressions (Semantics)

4. Concatenation.

DJ(r1 · r2)K = {x · y | x ∈DJr1K, y ∈DJr2K}

where x · y is string concatenation.

5. Closure.DJr∗K =

⋃i

(DJrK)i

where Si is defined recursively as follows:

S0 = {ε}Si+1 =

{x · y | x ∈ S, y ∈ Si}

∗Using the Definitions

DJ(a ·b)K = {x · y | x ∈DJaK, y ∈DJbK}= {x · y | x ∈ {a}, y ∈ {b}}= {a ·b}= {ab}

DJ(a + (a ·b))K = DJaK∪DJ(a ·b)K= {a}∪{ab}= {a, ab}

DJ(a + (b · c))∗K =⋃

i

(DJ(a + (b · c))K)i

= {ε}∪{a, bc}∪DJ(a + (b · c))K2∪ . . .= {ε, a, bc}∪{aa,abc,bca,bcbc}∪ . . .= {ε, a, aa, bc, abc, bca, bcbc}∪ . . .= {ε,a,aa,bc,aaa,abc,bca,aabc,abca,bcaa,bcbc, . . .}

∗ Digression: Induction

Is the recursively defined function D well-defined? Yes.When are such recursively defined functions well-defined? Overfree-generated sets.

∗ Digression: Induction

Consider the following inductive definition of a subset M of Nat , thenatural numbers, using integer multiplication ·:

I 1 ∈M,

I if n ∈M, then 9 ·n ∈M,

I if n ∈M, then 23 ·n ∈M.

Suppose we define the function g : M→ Nat inductively by:

I g(1) = 1,

I g(9 ·n) = 9,

I g(23 ·n) = 23.

Prove that 0=1!

More Regular Expressions

To make regular expressions more convenient, regular expressionsare almost always extended with new notation. Here are someadditional meta-symbols commonly seen (perhaps with differentsyntax). Regular expressions with these new meta-symbols can bedefined in terms of the original definitions. Let r be a regularexpression over the alphabet Σ.

1. Optional. r? = (r + /0∗)

2. One or more. (This is a second and conflicting use of themeta-character +.) r+ = (r · r∗)

3. Any. . = (a + b + . . .+ y + z) where Σ = a,b, . . . ,y ,z.

4. Range. [a− z] = (a + b + . . .+ y + z). (Assumes that Σ isordered.)

5. Range complement. [¬c− x] = (a + b + y + z). (Assumes thatΣ is ordered.)

Where areregular expressions

used?

xkcd—a webcomic of romance, sarcasm, math, and language byRandall Munroe

Grep

The program grep is a command-line utility for searching textoriginally written for Unix. The grep command searches text files forlines matching a given regular expression and prints matching lines tothe programs standard output.

Grep

Suppose the file preamble.txt has the following lines:

We the People of the United States, in Order to form a more perfectUnion, establish Justice, insure domestic Tranquility, provide for thecommon defence, promote the general Welfare, and secure the Blessingsof Liberty to ourselves and our Posterity, do ordain and establishthis Constitution for the United States of America.

> grep and preamble.txtcommon defence, promote the general Welfare, and secure the Blessingsof Liberty to ourselves and our Posterity, do ordain and establish

> grep -i people preamble.txtWe the People of the United States, in Order to form a more perfect

> grep -w do preamble.txtof Liberty to ourselves and our Posterity, do ordain and establish

(But does not match “domestic.”)

> grep -E -w "(defense)|(defence)" preamble.txtcommon defence, promote the general Welfare, and secure the Blessings

Practical Regular Expressions

grep options regexp filename

Options:

-E extended regular expression-i ignore case-x match whole line-w match word in line

Syntax of regular expressions for grep

| or $ end of line. any ? optional[] set {n,m} n through m times[ˆ ] set compliment () grouping

grep

grep -E ’.{5,}’ /usr/dict/words

AarhusAaronAbabaabackabacus...zombiezoologyZoroastrianzucchiniZurichzygote

grep

grep -E ’.{5,}’ /usr/dict/words

AarhusAaronAbabaabackabacus...zombiezoologyZoroastrianzucchiniZurichzygote

grep

grep -i -E ’[aeiou]{4,}’ /usr/dict/words

aqueousHawaiianIEEEobsequiousonomatopoeiapharmacopoeiaprosopopoeiaqueueSequoia

grep

grep -i -E ’[aeiou]{4,}’ /usr/dict/words

aqueousHawaiianIEEEobsequiousonomatopoeiapharmacopoeiaprosopopoeiaqueueSequoia

grep

grep -i -E ’(q[ˆu]|q$)’ /usr/dict/words

CEQColloqIQIraqqQatarQEDq’sseq

grep

grep -i -E ’(q[ˆu]|q$)’ /usr/dict/words

CEQColloqIQIraqqQatarQEDq’sseq

grepIs there a regular expression that matches those words whose lettersappear in alphabetical order?

grep -x -E /usr/dict/words

grep -x -E’a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?’/usr/dict/words

almostbeginbelowbiopsydirtyemptyfirstfortyghostglory

(Some of the longer words.)











grepCan we modify the previous example to allow double letters?

grep -x -E’a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*’/usr/dict/words

accentaccessalmostbiopsychillychoosyeffortfloppyglossyknotty


grep

grep -E -e ’(y.*){3,}’ /usr/dict/wordsA

The words with at least three y’s.

polytypy chromosomal variation between populationspsychophysiologysynonymysyzygy straight-line alignment of three celestial bodies

grep

grep -E -e ’(y.*){3,}’ /usr/dict/wordsA

The words with at least three y’s.

polytypy chromosomal variation between populationspsychophysiologysynonymysyzygy straight-line alignment of three celestial bodies

Back references

grep -E -e ’(.)\1\1’ /usr/dict/words

AAAAAASIEEEiiiviii

Back references

grep -E -e ’(.)\1\1’ /usr/dict/words

AAAAAASIEEEiiiviii

PERL

Perl supports the usual regular expressions:

\ Quote the next metacharacterˆ Match the beginning of the line. Match any character (except newline)$ Match the end of the line (or before newline at the end)| Alternation(..) Grouping[..] Character class

* Match 0 or more times+ Match 1 or more times? Match 1 or 0 times{n} Match exactly n times{n,} Match at least n times{n,m} Match at least n but not more than m times

PERL

PERL supports “reluctant” and “possessive” quantifiers in addition tothe “greedy” quantifiers. PERL has additional boundary matchers,Unicode character class operators, and look-ahead and look-behindmatchers. Some of these require exponential back-tracking.

\p{property} Match a character with Unicode property\P{property} Match a character without Unicode property(?=pattern) A zero-width positive look-ahead assertion(?!pattern) A zero-width negative look-ahead assertion(?<=pattern) A zero-width positive look-behind assertion(?<!pattern) A zero-width negative look-behind assertion

For more information see the documentation at www.perldoc.com.

http://www.perldoc.com/perl5.6.1/pod/perlre.html

Java

Java supports regular expression in the classjava.util.regex.Pattern. It has character class operations forunion and intersection.

[...] Character class[ˆ...] Complement[...[...]] Union[...&&[...]] Intersection[...&&[ˆ...]] Substraction

http://java.sun.com/j2se/1.4.1/docs/api/java/util/regex/Pattern.html

// Greedy quantifiersString match = find("A.*c", "AbcAbc"); // AbcAbcmatch = find("A.+", "AbcAbc"); // AbcAbc

// Nongreedy quantifiersmatch = find("A.*?c", "AbcAbc"); // Abcmatch = find("A.+?", "AbcAbc"); // Ab

// Find first substring in input that matches pattern.static String find (String pat, CharSequence input) {

final Pattern pattern = Pattern.compile(pat);final Matcher matcher = pattern.matcher(input);if (matcher.find()) {

return matcher.group();}return null;

}

Using regular expressions to match comments often runs into twoproblems:

I The “every character” pattern often does not match thecharacters indicating a newline.

/* The first line,but not the second line. */

I The greedy “star” will match more than what was intended.

x = 1; /* Both comments */x = 2;/* at once! */ x = 3;

final static String commentMultilineString = "/\\*.*?\\*/";final static Pattern multiLineCommentPattern =

Pattern.compile (commentMultilineString , Pattern.DOTALL);

Tokens

Regular expressions are used in the definition of programminglanguages to describe the basic lexical units–tokens.

1. NUMBER: [0−9]+

2. IDENTIFIER: [a− zA−Z ][a− zA−Z0−9]∗

3. a keyword: begin

4. a hex number:0x[0−9A−F ]+

5. REAL: [0−9] +′ .′[0−9] + (E[0−9]+)?

Metacharacters: [] +′−()?|

Tools

Token Stream

Token Stream

Given a program

float match0 (char *s) /* find a zero */{ if (!strncmp (s, "0.0", 3))return 0.;

}

the lexical analyzer will return the streamFLOAT ID(MATCH0) LPAREN CHAR STAR ID(S) RPAREN LBRACE

BANG ID(STRNCMP) LPAREN ID(S) COMMA STRING(0.0)COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI

RBRACE EOF

Lexical Tokens

An identifier is a sequence of letters and digits; the firstcharacter must be a letter. The underscore counts as aletter. Upper- and lowercase letters are different. If the inputstream has been parsed into tokens up to a given character,the next token is taken to include the longest string ofcharacters that could possibly constitute a token. Blanks,tabs, newlines, and comments are ignored except as theyserve to separate tokens. Some white space is required toseparate otherwise adjacent identifiers, keywords, andconstants.

Regular expressions

1. How do you recognize if a string matches a regular expression?

2. How do you tokenize a character string?

These questions are left to a compiler construction class.

Expressiveness of RE

We like RE because they describe strings.RE are great: easily understandable, concise, recognizers are easy tomake, . . .But,

The major shortcoming of regular expressions, as far as thedescription of programming languages is concerned, is that bracketingis not expressible. Bracketing is indispensable. For example,arithmetic expressions require opening and closing parentheses.Sequences of statements are often bracketed by begin/end pairs.

The set of strings

{an ·bn}= {ε,ab,aabb,aaabbb, . . .}

is not regular!

Expressiveness of RE

We like RE because they describe strings.RE are great: easily understandable, concise, recognizers are easy tomake, . . .But,The major shortcoming of regular expressions, as far as thedescription of programming languages is concerned, is that bracketingis not expressible. Bracketing is indispensable. For example,arithmetic expressions require opening and closing parentheses.Sequences of statements are often bracketed by begin/end pairs.

The set of strings

{an ·bn}= {ε,ab,aabb,aaabbb, . . .}

is not regular!

BNF (Backus-Naur Form)Conventions (e.g., tokens in bold font) and meta-symbols:

::= “is defined as”| “or”[] “optional” (not necessary, EBNF){} “zero or more” (not necessary, EBNF)

BNF is a collection of rules or productions. Each rule has a LHS and aRHS. The LHS is a syntactic category (a part of the language), calleda nonterminal and the RHS is a sequence of nonterminals and tokens(called terminals).

LHS1 ::= RHS1

LHS2 ::= RHS2...

LHSn ::= RHSn

Examples and Derivations

Grammar 1 and derivation: “revolutionary ideas”Example 3.1, page 129. Program with statement listsExample 3.2, page 131. Simple assignment statementsSebesta, seventh edition

Example Grammar

sentence ::= subject predicate .

subject ::= article {adjective} noun

predicate ::= verb [adverb ]

article ::= A | The

adjective ::= green | colorless | big | new | revolutionary

noun ::= caterpillar | idea

verb ::= sleeps | appears | crawls

adverb ::= furiously | infrequently

Derivation

Sentential form. A sentential form is a string of terminal andnonterminal symbols.

Derivation. A derivation of a sentential form is created from anonterminal by repeatedly replacing the LHS nonterminal of some rulewith its RHS.Every BNF definition is a description of a formal language (a set ofstring), the language of all sentences derivable from a syntacticcategory.

We derive two sentences used as example by Noam Chomsky—onesentence is meaningful, the other makes no sense.

Example Derivation

sentence⇒1 subject predicate .⇒2 article adjective adjective noun predicate .⇒3 Aadjective adjective noun predicate .⇒4 Anewadjective noun predicate .⇒5 Anewrevolutionarynoun predicate .⇒6 Anewrevolutionary ideapredicate .⇒7 Anewrevolutionary ideaverb adverb .⇒8 Anewrevolutionary ideaappearsadverb .⇒9 Anewrevolutionary ideaappears infrequently .

Example Derivation

sentence⇒1 subject predicate .⇒2 article adjective adjective noun predicate .⇒3 Aadjective adjective noun predicate .⇒4 Acolorlessadjective noun predicate .⇒5 Acolorlessgreennoun predicate .⇒6 Acolorlessgreenideapredicate .⇒7 Acolorlessgreenideaverb adverb .⇒8 Acolorlessgreenideasleepsadverb .⇒9 Acolorlessgreenideasleepsfuriously .

Syntax versus Semantics

Two sentences (with the same structure, syntax) derived from thegrammar.

Anewrevolutionary ideaappears infrequently .Acolorlessgreenideasleepsfuriously .

One sentence is meaningful, the other is meaningless. Clearly, themeaning of the words (semantics) is important.

Examples and Derivations

∗ Grammar 1 and derivation: “revolutionary ideas”Example 3.1, page 112. Program with statement listsExample 3.2, page 113. Simple assignment statementsSebesta, fifth edition

Alternate Styles of BNF

“traditional” style:〈traditional style〉 ::= keyNonterminals in angle brackets, terminals in bold font.

CFG or “mathematical” style:A→ BaNonterminals in uppercase letters, terminal in lower case letters.

“verbatim” style:

verbatim_style ::= "key"

ISO 14977 is much like verbatim style.

Wirth, “What can we do about the unnecessary diversity of notation forsyntactic definitions?” CACM, volume 20, number 11, November 1977,page 822.

Syntax Graphs

In addition to linear forms of BNF there are also syntax graphs, alsoknown as syntax diagrams.See Chapter 2, page 21 of WebberFigure 3.6, page 122 of Sebesta, fifth edition (not in sixth edition).See Hyper GOS syntax diagram generator.

http://http://cui.unige.ch/db-research/Enseignement/analyseinfo/BNFweb.html

Syntax Graphs

if_statement ::="if" condition "then"sequence_of_statements{ "elsif" condition "then"sequence_of_statements }[ "else"sequence_of_statements ]"end" "if" ";"

∗Grammar

A grammar is a 4-tuple 〈T ,N,P,S〉I T is the set of terminal symbols,

I N set of nonterminal symbols, T ∩N = /0,

I S, a nonterminal, is the start symbol,

I P are the productions of the grammar.

A production has the form α→ β where α and β are strings ofterminals and nonterminals (α can’t be the empty string). We areprimarily interested in context-free grammars in which α is a singlenonterminal.

∗Example 1

Let G be the grammar defined by the following productions.

S → AB

A → ε

A → Aa

B → ε

B → B b

(We infer easily that T , the set of terminals, is {a,b}; N, the set ofnonterminals, is {A,B}; and S is the start symbol.)

LG = {ai bj}

∗Example 2

Let G be the grammar defined by the following productions.

S → ε

S → aS b

(We infer easily that T , the set of terminals, is {a,b}; N, the set ofnonterminals, is {S}; and S is the start symbol.)

LG = {ai bi}

∗What’s a Grammar?

A grammar is a description of a language.Every grammar denotes a formal language (set of strings). A string ofnonterminals ω is in the language, if it is derivable from the start thesymbol.We say γαδ⇒ γβδ (derives in one step) whenever α→ β is aproduction. We use α

∗⇒β as the reflexive and transitive closure of the⇒ relation. A string ω of terminals is in the language defined bygrammar G, if S

∗⇒ω. So, L(G) = {ω ∈ T ∗ | S ∗⇒ω}. In this case thestring ω is called a sentence of G. If S

∗⇒α where α may containnonterminals, then we say α is a sentential form of the grammar G.

∗Chomsky Hierarchy

name grammar

0 unrestricted1 context-sensitive |β| ≥ |α|2 context-free α ∈ N3 regular A→ ωB, A→ ω

Same as BNF

Context-free grammar is the same as BNF.

S ::= AB

A ::=

A ::= Aa

B ::=

B ::= B b

OR

S ::= AB

A ::= Aa | εB ::= B b | ε

Parse Tree

A formal definition of a parse tree for a context-free grammar ispossible, but it is not illuminating. But the trees in the figures aresuggestive. Each production used in the derivation of a string appearsas a subtree in the diagram. The left-hand-side nonterminal appearsas a node, and all the grammar symbols in the right-hand side of theproduction appear as children of this node.

An example form Sebesta, 7th, Example 3.2, page 131.An grammar for simple assignment statement

assign ::= id := expr

expr ::= expr + expr

| expr ∗ expr

| (expr )

| id

id ::= A | B | C

Example Derivation

assign⇒

id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒

A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒

A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒

A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒

A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒

A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒

A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒

A :=B∗(A+ id )⇒A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒

A :=B∗(A+C)

Example Derivation

assign⇒id :=expr ⇒A :=expr ⇒A := id ∗expr ⇒A :=B∗expr ⇒A :=B∗(expr )⇒A :=B∗( id +expr )⇒A :=B∗(A+expr )⇒A :=B∗(A+ id )⇒A :=B∗(A+C)

Abstract Syntax Tree

Concrete parse trees may have a lot of extra stuff and inconvenient touse directly.(Webber, Section 3.8 Abstract Syntax Trees, page 38.)

Example Derivation

assign⇒id := expr ⇒A := expr ⇒A := id ∗ expr ⇒A := B ∗ expr ⇒A := B ∗ (expr )⇒A := B ∗ ( id +expr )⇒A := B ∗ (A+expr )⇒A := B ∗ (A+ id )⇒A := B ∗ (A+C)

Concrete Parse Tree

Abstract Syntax Tree

A:=B*(A+C). expr can be replaced by the kind of expression +, *. idcontains no information. Parens are no longer needed.

Parse Tree

If A:=(B*A)+C was the statement, then the abstract syntax would bedifferent and so no important information would be lost.

Dangling else

An ambiguous grammar is one in which some sentence has more thanone parse tree. [Sebesta, 3.3.1.7 Ambiguity, page 132ff.]The grammar of Example 3.3 [Sebesta, 7th, page 132], simpleexpressions, is ambiguous.Also, this grammar of if/then/else statements is ambiguous.Sebesta, 7th, 3.3.1.10 An Unambiguous Grammar for if-then-else,page 137; Figure 3.5, page 138.

Two Parse Trees

[Convert from PS tricks.]

Equivalent Grammar

Just because a grammar is ambiguous does not mean that allgrammars for the same language are ambiguous.

S → if C then S | if C then S else S | S′

This grammar for the same language is not.

S → MS | UMS

MS → if C then MS else MS | S′

UMS → if C then S | if C then MS else UMS

Alternative Language

If the “natural” grammar is ambiguous, could it mean that the constructis confusing to programmers? Is another approach possible?Many languages have redesigned the syntax of the if statement:

S → if C then S endif |if C then S else S endif |S′

The two statements with different structure now have different syntax.

if C1 then if C2 then S1 endif else S2 endifif C1 then if C2 then S1 else S2 endif endif

Tools

Recursive Descent Parsers

Grammars in the right form permit a parser of simple recursivefunctions. Each nonterminal turns into a recursive function forming aset of mutually recursive functions. Each function does a case analysison the input to determine which production/RHS to follow. Terminals inthe RHS are matched (consumed) and nonterminals turned intorecursive calls.Java Applet demo: http://cswebsrv.cs.binghamton.edu/˜zdu/parsdemo/recframe.htmlhttp://www.uni-paderborn.de/fachbereich/AG/agkastens/compiler/parsdemo/recframe.htmlSee example in Figure 2.10 in Scott.

http://cswebsrv.cs.binghamton.edu/~zdu/parsdemo/recframe.html

http://cswebsrv.cs.binghamton.edu/~zdu/parsdemo/recframe.html

http://www.uni-paderborn.de/fachbereich/AG/agkastens/compiler/parsdemo/recframe.html

http://www.uni-paderborn.de/fachbereich/AG/agkastens/compiler/parsdemo/recframe.html

Limitations of BNF/CFG

〈block〉 ::= 〈block identifier〉 :

begin 〈statement〉 {〈statement〉}end 〈block identifier〉 ;

〈call〉 ::= 〈identifier〉 ({〈expression〉} )

Beyond BNF/CFG

Formalisms with greater expressive power

I unrestricted grammars

I Post systems

I attribute grammars

Such formalisms are used in describing semantics — a later topic.Notice the pragmatic definition of syntax (at or below context-free) andsemantics (beyond context-free).

The set of strings

{an ·bn · cn}= {ε,abc,aabbcc,aaabbbccc, . . .}

is not context free!

programminglanguages

language design,description, semantics

compilers

implementation,recognition

formallanguages

expressivity

Documents

Doctor Syntax was the ﬁctitious hero of several 19th ...ryan/cse4250/notes/syntax.pdf · Doctor Syntax was the ﬁctitious hero of several 19th century satirical ... Post Systems