Upload
lillian-hines
View
240
Download
0
Embed Size (px)
Citation preview
Syntax
The Structure of a Language
Lexical Structure
• The structure of the tokens of a programming language
• The scanner takes a sequence of characters and collects them into tokens
Tokens
• Reserved words (keywords)– if while
• Literals or constants– 3.14 “Fred”
• Special symbols– + =
• Identifiers
Principle of Longest Substring
• At each point, the longest possible string is collected into a single token
• Natural token separators– Token separators
•; + =
– White space• Spaces and tabs• Newlines• Comments
FORTRAN violates these rules
• DO 99 I = 1.10– Assigns 1.10 to the variable DO99I
• DO 99 I = 1,10– Sets up a loop with loop counter I going from 1 to 10
• FORTRAN has no reserved words at all
C token conventions
• Six classes of tokens– Identifiers– Keywords– Constants– String literals– Operators– Other operators
• White space characters are ignored except as they separate tokens
• Adheres to the principle of longest substring
Regular Expressions
• Regular expressions were invented by Stephen Kleene and appeared in a Rand Corporation report in about 1950
• Regular expressions represent a form of language definition
• Each regular expression E denotes a language L(E) defined over the alphabet of the language
Rules defining REs
• Empty is a RE
• Atom– Any symbol from the alphabet is a RE
• Alternation– If a and b are REs then so is a|b– All strings identified by a and all those identified by b
• Concatenation– If a and b are REs then so is ab– All strings formed by concatenating a string identified by b to
the end of one identified by a
More rules for REs
• Kleene Closure– If a is an RE then so is a*– All strings formed by concatenating zero or
more strings identified by a
• Positive Closure– If a is an RE then so is a+– All strings formed by concatenating one or
more strings identified by a
Examples of Res
• (a|b)c– Recognizes ac and bc but no others
• (a|b)*c– Recognizes c ac bc aac abc abac
• (a|b)+c– Does not recognize c but all the others above
Extensions
• [] – any one of a set of characters– [A-Z] – any capitol letter– [0123456789] – any digit
• ? – an optional item (0 or 1 of these)– [A-Z][0-9]? – a single capitol letter or a
single capitol letter followed by a single digit
• . (period) – any character
More Examples
• [0-9]+– Simple integer constants
• [0-9]+(\.[0-9])?– Simple floating-point constants
Context-Free Grammars (CFGs)
• Context-free grammars were developed by Noam Chomsky as a way to specify language
• Rules are generally specified in Backus-Naur Form (BNF) or ain Extended BNF (EBNF)
What makes up a CFG?
• A set N of non-terminal symbols
• A set T of terminal symbols
• A set P of production rules
• A special non-terminal symbol S called the start symbol (or goal symbol)
Sample CFG
• sentence noun-phrase verb-phrase .
• noun-phrase article noun
• article a | the
• noun girl | dog
• verb-phrase verb noun-phrase
• verb sees | pets
Parts of the grammar
• Non-terminal symbols: {sentence, noun-phrase, article, noun, verb-phrase,
verb}
• Terminal Sumbols{ . ,a, the, girl, dog, sees, pets}
• Production rulesThe previous slide provides these
• Start Symbolsentence
Notes on CFG
• Non-terminal symbols are those that appear on the left-hand side (lhs) of the production rules
• Terminal symbols are those that appear only on the right-hand side (rhs) of the production rules
and | are meta-symbols
(Left-Most) Derivation
sentence noun-phrase verb-phrase . article noun verb-phrase . the noun verb-phrase . the girl verb-phrase . the girl verb noun-phrase . the girl sees noun-phrase . the girl sees article noun . the girl sees a noun . the girl sees a dog .
Corresponding Parse Treesentence
noun-phrase verb-phrase .
article noun verb noun-phrase
article noun
the girl sees
a dog
Ambiguous Grammars
• A grammar is ambiguous of a sentence has • two distinct derivations or
• two distinct parse trees
Grammar for expressions
expr expr + expr
| expr * expr
| (expr)
| number
number number digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Parse trees for 3 + 5 * 7expr
expr expr
expr expr
+
expr
expr expr
expr expr
*
+*number
digit
3
number
digit
5
number
digit
7
number
digit
3
number
digit
5
number
digit
7
Handling Ambiguity
• The grammar rules for expressions can be modified to eliminate the ambiguity that precedence should take care of
• Introduce a new non-terminal that forces the higher-precedence operator lower in the parse tree
Precedence handled
expr expr + expr | term
term term * term | ( expr ) | number
number number digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Associativity
• This grammar is still ambiguous
• There are two parse trees for 5 + 7 + 9
• This may be ok for addition & multiplication, but not for subtraction & addition which are left-associative
Revised Grammar (not ambiguous)
expr expr + term | term
term term * factor | factor
factor ( expr ) | number
number number digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
EBNFs
• Extended BNF adds more metasymbols• { } – a repeated item (0 or more times)
• [ ] – an optional item (0 or 1 time)
Expression Grammar in EBNF
expr term { + term }
term factor { * factor }
factor ( expr ) | number
number digit { digit }
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
EBNF for if-statement
if-statement if (expression) statement [ else statement ]
Syntax Diagrams
• Syntax diagrams are an alternative to EBNF
• Study the diagrams on pp 99-101 and observe the direct relationship of each to the EBNF grammar rules for expressions
Parsers
• This simplest parser is a recognizer• Accepts or rejects strings on whether they
are legal strings in the language
• More general parsers• Build parse trees (or abstract syntax
trees)
• May calculate values of expressions, etc.
Bottom-up Parsers
• Attempts to match the input with the RHSs of the grammar rules
• When a match occurs, the RHS is replaced by the non-teminal on the LHS of the rule (called a reduce)
• Sometimes called shift-reduce parsing
Top-down Parsers
• Non-terminals are expanded to match incoming tokens and the parser directly constructs a derivation
Recursive-Descent Parsing
• A program made up of a collection of mutually recursive procedures, one for each non-terminal.