34
Syntax The Structure of a Language

Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Embed Size (px)

Citation preview

Page 1: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Syntax

The Structure of a Language

Page 2: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Lexical Structure

• The structure of the tokens of a programming language

• The scanner takes a sequence of characters and collects them into tokens

Page 3: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Tokens

• Reserved words (keywords)– if while

• Literals or constants– 3.14 “Fred”

• Special symbols– + =

• Identifiers

Page 4: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Principle of Longest Substring

• At each point, the longest possible string is collected into a single token

• Natural token separators– Token separators

•; + =

– White space• Spaces and tabs• Newlines• Comments

Page 5: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

FORTRAN violates these rules

• DO 99 I = 1.10– Assigns 1.10 to the variable DO99I

• DO 99 I = 1,10– Sets up a loop with loop counter I going from 1 to 10

• FORTRAN has no reserved words at all

Page 6: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

C token conventions

• Six classes of tokens– Identifiers– Keywords– Constants– String literals– Operators– Other operators

• White space characters are ignored except as they separate tokens

• Adheres to the principle of longest substring

Page 7: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Regular Expressions

• Regular expressions were invented by Stephen Kleene and appeared in a Rand Corporation report in about 1950

• Regular expressions represent a form of language definition

• Each regular expression E denotes a language L(E) defined over the alphabet of the language

Page 8: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Rules defining REs

• Empty is a RE

• Atom– Any symbol from the alphabet is a RE

• Alternation– If a and b are REs then so is a|b– All strings identified by a and all those identified by b

• Concatenation– If a and b are REs then so is ab– All strings formed by concatenating a string identified by b to

the end of one identified by a

Page 9: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

More rules for REs

• Kleene Closure– If a is an RE then so is a*– All strings formed by concatenating zero or

more strings identified by a

• Positive Closure– If a is an RE then so is a+– All strings formed by concatenating one or

more strings identified by a

Page 10: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Examples of Res

• (a|b)c– Recognizes ac and bc but no others

• (a|b)*c– Recognizes c ac bc aac abc abac

• (a|b)+c– Does not recognize c but all the others above

Page 11: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Extensions

• [] – any one of a set of characters– [A-Z] – any capitol letter– [0123456789] – any digit

• ? – an optional item (0 or 1 of these)– [A-Z][0-9]? – a single capitol letter or a

single capitol letter followed by a single digit

• . (period) – any character

Page 12: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

More Examples

• [0-9]+– Simple integer constants

• [0-9]+(\.[0-9])?– Simple floating-point constants

Page 13: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Context-Free Grammars (CFGs)

• Context-free grammars were developed by Noam Chomsky as a way to specify language

• Rules are generally specified in Backus-Naur Form (BNF) or ain Extended BNF (EBNF)

Page 14: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

What makes up a CFG?

• A set N of non-terminal symbols

• A set T of terminal symbols

• A set P of production rules

• A special non-terminal symbol S called the start symbol (or goal symbol)

Page 15: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Sample CFG

• sentence noun-phrase verb-phrase .

• noun-phrase article noun

• article a | the

• noun girl | dog

• verb-phrase verb noun-phrase

• verb sees | pets

Page 16: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Parts of the grammar

• Non-terminal symbols: {sentence, noun-phrase, article, noun, verb-phrase,

verb}

• Terminal Sumbols{ . ,a, the, girl, dog, sees, pets}

• Production rulesThe previous slide provides these

• Start Symbolsentence

Page 17: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Notes on CFG

• Non-terminal symbols are those that appear on the left-hand side (lhs) of the production rules

• Terminal symbols are those that appear only on the right-hand side (rhs) of the production rules

and | are meta-symbols

Page 18: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

(Left-Most) Derivation

sentence noun-phrase verb-phrase . article noun verb-phrase . the noun verb-phrase . the girl verb-phrase . the girl verb noun-phrase . the girl sees noun-phrase . the girl sees article noun . the girl sees a noun . the girl sees a dog .

Page 19: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Corresponding Parse Treesentence

noun-phrase verb-phrase .

article noun verb noun-phrase

article noun

the girl sees

a dog

Page 20: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Ambiguous Grammars

• A grammar is ambiguous of a sentence has • two distinct derivations or

• two distinct parse trees

Page 21: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Grammar for expressions

expr expr + expr

| expr * expr

| (expr)

| number

number number digit | digit

digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Page 22: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Parse trees for 3 + 5 * 7expr

expr expr

expr expr

+

expr

expr expr

expr expr

*

+*number

digit

3

number

digit

5

number

digit

7

number

digit

3

number

digit

5

number

digit

7

Page 23: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Handling Ambiguity

• The grammar rules for expressions can be modified to eliminate the ambiguity that precedence should take care of

• Introduce a new non-terminal that forces the higher-precedence operator lower in the parse tree

Page 24: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Precedence handled

expr expr + expr | term

term term * term | ( expr ) | number

number number digit | digit

digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Page 25: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Associativity

• This grammar is still ambiguous

• There are two parse trees for 5 + 7 + 9

• This may be ok for addition & multiplication, but not for subtraction & addition which are left-associative

Page 26: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Revised Grammar (not ambiguous)

expr expr + term | term

term term * factor | factor

factor ( expr ) | number

number number digit | digit

digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Page 27: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

EBNFs

• Extended BNF adds more metasymbols• { } – a repeated item (0 or more times)

• [ ] – an optional item (0 or 1 time)

Page 28: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Expression Grammar in EBNF

expr term { + term }

term factor { * factor }

factor ( expr ) | number

number digit { digit }

digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Page 29: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

EBNF for if-statement

if-statement if (expression) statement [ else statement ]

Page 30: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Syntax Diagrams

• Syntax diagrams are an alternative to EBNF

• Study the diagrams on pp 99-101 and observe the direct relationship of each to the EBNF grammar rules for expressions

Page 31: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Parsers

• This simplest parser is a recognizer• Accepts or rejects strings on whether they

are legal strings in the language

• More general parsers• Build parse trees (or abstract syntax

trees)

• May calculate values of expressions, etc.

Page 32: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Bottom-up Parsers

• Attempts to match the input with the RHSs of the grammar rules

• When a match occurs, the RHS is replaced by the non-teminal on the LHS of the rule (called a reduce)

• Sometimes called shift-reduce parsing

Page 33: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Top-down Parsers

• Non-terminals are expanded to match incoming tokens and the parser directly constructs a derivation

Page 34: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters

Recursive-Descent Parsing

• A program made up of a collection of mutually recursive procedures, one for each non-terminal.