Upload
connor-snell
View
237
Download
6
Embed Size (px)
Citation preview
Chapter 2Syntax
A language that is simple to parse for the compiler is also simple to parse for the human programmer.
N. Wirth
Programming Languages2nd edition
Tucker and Noonan
2.1 Grammars2.1.1 Backus-Naur Form2.1.2 Derivations2.1.3 Parse Trees
2.1.4 Associativity and Precedence2.1.5 Ambiguous Grammars
2.2 Extended BNF2.3 Syntax of a Small Language: Clite
2.3.1 Lexical Syntax2.3.2 Concrete Syntax
2.4 Compilers and Interpreters2.5 Linking Syntax and Semantics
2.5.1 Abstract Syntax2.5.2 Abstract Syntax Trees2.5.3 Abstract Syntax of Clite
Contents
Expr Expr + Term | Expr – Term | TermTerm Term * Factor | Term / Factor |
Term % Factor | FactorFactor Primary ** Factor | PrimaryPrimary 0 | ... | 9 | ( Expr )
Red indicates a terminal of G1, blue indicates meta-symbols (symbols that aren’t part of the language but are used to describe the language)
Example: 10 * ( 5 – 3)
Grammar G1: Review
Motivation for using a subset of C:
GrammarLanguage (pages) ReferencePascal 5 Jensen & WirthC 6 Kernighan &
RichieC++ 22 StroustrupJava 14 Gosling, et. al.
The Clite grammar fits on one page (next 3 slides), so it’s a far better tool for studying language design.
2.3 Syntax of a Small Language: Clite
Program int main ( ) { Declarations Statements }
Declarations { Declaration }
Declaration Type Identifier [ [Integer ]] { , Identifier
[[Integer ] ] };
Type int | bool | float | char
Statements { Statement }
Statement ; | Block | Assignment | IfStatement |
WhileStatement
Block { Statements }
Assignment Identifier [ [ Expression ] ] = Expression ;
IfStatement if ( Expression ) Statement [ else Statement ]
WhileStatement while ( Expression ) Statement
Clite EBNF Grammar: StatementsFig. 2.7
Expression Conjunction { || Conjunction }Conjunction Equality { && Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp < | <= | > | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ] Primary UnaryOp - | ! Primary Identifier [ [ Expression ] ] | Literal | ( Expression ) |
Type ( Expression )
Fig. 2.7 Clite Grammar: Expressions
Identifier Letter { Letter | Digit } Letter a | b | ... | z | A | B | ... | Z Digit 0 | 1 | ... | 9 Literal Integer | Boolean | Float | Char Integer Digit { Digit } Boolean true | false Float Integer . Integer Char ‘ ASCII Char ‘
(ASCII Char is the set of ASCII characters)
Clite grammar: lexical level
• 13 grammar rules – compare to 4 pages, for C++ • Metabraces { }(0 or more) is interpreted to mean
left associativity; e.g., ◦ Addition → Term { AddOp Term }
AddOp → + | - • Metabrackets [ ] (optional) means an addition can
only be followed by one or no relational operators plus another addition.
• Relation Addition [ RelOp Addition ]◦ RelOp < | <= | > | >=
◦ (no a > b > c for example)
Clite Grammar - Discussion
• Comments• The significance of whitespace• Distinguishing one token <= from two
tokens < =• Distinguishing identifiers from keywords
like if
Issues Not Addressed by this Grammar
The Clite grammar has two levels◦ lexical level (described by lexical syntax)◦ syntactic level (described by concrete
syntax) They correspond to two separate parts of
a compiler. The issues on the previous slide are
lexical issues.
Grammar Levels
Examples of lexical entities (tokens):
◦ Identifiers e.g., numbr1, X◦ Literals e.g., 123, 'x', 3.25, true◦ Keywords bool char else false float
if int main true while◦ Operators = || && == != < <= > >= + -
* / ! %◦ Punctuation ; , { } ( )◦ Char e.g.,‘?’
2.3.1 Lexical Syntax & Analysis
Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment
No token may contain embedded whitespace (unless it is a character or string literal) Example:
>= one token> = two tokens
Whitespace
while ( a <= b) legal - spacing between tokens
while(a<=b) also legal - spacing not needed
while (a < = b) no lexical errors but illegal syntactically – lexer would identify tokens while, (, a, <, =, b, )
The Role of Whitespace
Clite uses // comment style of C++ Not defined in Clite grammar (but could
be) Instead, it’s defined outside the grammar The use of whitespace to differentiate
between one and two character operators is also defined outside the grammar.
Comments
• Sequence of letters and digits, starting with a letter
“if” is an identifier which also is a keyword
• Keywords versus reserved words:◦ Keyword: predefined by the language
◦ Reserved word: can only be used as defined.
◦ In most languages all keywords are also reserved, but in a few; e.g., Pascal, a subset of the keyword identifiers are predefined but not reserved (and can be redefined by the programmer). Implications? Flexibility, confusion …
Clite Identifiers
Concrete syntax of a language is the set of rules for writing correct programs
The structure of a specific program can be represented by a parse tree, based on the concrete syntax of the language, using the stream of Tokens identified during lexical analysis
The root of the parse tree is the Start Symbol of the language (Program, in Clite).
2.3.2 Concrete Syntax
• Clite’s expression rules are non-ambiguous with respect to precedence and associativity◦ Rule ordering defines precedence; rule
format defines associativity.• C/C++ expression grammar definition
is ambiguous – precedence and associativity are specified separately.
Clite Grammar - Discussion
Clite Operator AssociativityUnary - ! none* / left+ - left< <= > >= none (i.e., no a < b <= c)== != none&& left|| left
Associativity and Precedence
… are non-associative.(an idea borrowed from Ada)
Why is this important?In C & C++, the expression:
if (10 < x < 20)is not equivalent to if (10 < x && x < 20)
But it is error-free! So, what does it mean?
Clite Relational Operators
Grammar rules don’t specify the operand types to be used with various operators; e.g., is
true + 13 a legal expression? What is the type of the expression 123.78 + 37 ?
These are type and semantic issues, not lexical or syntax.
Syntax versus Semantics
2.4 Compilers and Interpreters
Figure 2.8: Major Stages in the Compilation Process
Lexical Analyzer(lexer)
SyntacticAnalyzer(parser)
Semantic Analyzer
CodeOptimizer
CodeGenerator
Toke
ns
Abs
trac
t Syn
tax
Machine Code
Inte
rmed
iate
Cod
e (I
C)
SourceProgram
Inte
rmed
iate
Cod
e (I
C)
Input: characters (the program) Output: tokens & token type Lexical grammars are simpler than syntax
grammars Often generated automatically by lexical
analyzer generating programs
Lexer
• Often based on BNF/EBNF grammar• Input: tokens• Output: abstract syntax tree or some
other representation of the program• Abstract syntax: similar to a concrete
parse tree but with punctuation, many nonterminals discarded
Parser (Syntax Analyzer)
• Typical tasks:◦ Check that all identifiers are declared◦ Perform type checking for expressions,
assignments, …◦ Insert implied conversion operators (i.e., make
them explicit)• Context free grammars can’t express the
semantic rules that are needed for this phase of translation.
• Output: Intermediate code (IC) tree, modified abstract syntax tree representation.
Semantic Analysis/Type Checker
Purpose: Improve the run-time performance of the object code◦ Usually, to make it run faster◦ Other possibilities: reduce amount of storage
required Drawback: optimization is time-
consuming; slows down debugging Output: Intermediate code, similar to
abstract syntax notation; closer to machine code
Code Optimization
• Evaluate constant expressions at compile-time
• In-line expansion (of function calls)• Loop unrolling• Reorder code to improve cache
performance• Eliminate common sub-expressions• Eliminate unnecessary code• Store local variables/intermediate results
in registers rather than on the stack or elsewhere in memory
Code Optimization - Techniques
• Output: machine code• Instruction selection, register
management• “Peephole” optimization: look at a small
segment of machine code, make it more efficient; e.g.,◦ x = y; →→ load y, R0
store R0, xz = x * 2; load x, R0 // redundant code
mul R0, #2
Code Generation
Replaces last 2 phases of a compiler with direct execution
Input: ◦ Mixed: generates & uses intermediate
code/abstract syntax ◦ Pure: start from stream of ASCII characters
each time a statement is executed Mixed interpreters
◦ Java, Perl, Python, Haskell, Scheme Pure interpreters:
◦ most Basics, shell commands
Interpreter
Source code: a = x + y; Compiler-generated object code:
load r0, x;add r0, y;store r0, a;
Will be executed later, with remainder of program
Interpreter: Call an interpretive routine to actually perform the operation.
e.g., add(x, y, a);
Interpreter versus Compiler
It’s not the case that the lexical analyzer identifies all the tokens, and then the parser analyzes all the tokens, and then the type/semantic analysis is performed.
Instead, parser repeatedly contacts lexer to get another token
As tokens are received they either do or don’t match the expected syntax◦ If there’s a match, perform any type or semantic
testing, possibly generate int. code, call for another token.
Stages of Translation Are Interleaved
Output of parser: the concrete parse tree is large – probably more than needed for next phase◦ The compiler usually produces some more
compact representation of the program
Example: Fig. 2.9 (page 46)
2.5 Linking Syntax and Semantics
Parse Tree for z = x + 2*y;Fig. 2.9
The shape of the parse tree reveals the meaning of the program.
So as output of syntax analysis we want a tree that removes its inefficiency and keeps its meaning. ◦ Remove separator/punctuation terminal
symbols◦ Remove all trivial nonterminals◦ Replace remaining nonterminals with leaf
terminals Example: Fig. 2.10
Finding a More Efficient Tree
Compare: concrete & abstract parse trees for z = x + 2*y
Abstract Syntax
Removes unnecessary details but keeps the essential language elements; e.g., consider the following two equivalent loops:
Pascal C++while j < n do begin while (j < n) { j := j + 1; j = j + 1;end; } Essential information: 1) it is a loop, 2) its terminating condition is j >= n, and 3) its body increments the current value of j.
• Purpose: an intermediate form of the source code
• Generated by the parser during syntax analysis
• Used during type checking/semantic analysis• Abstract syntax rules are defined by the
compiler (or interpreter).• One concrete syntax can have several
abstract syntaxes associated with it, depending on the design of the translator.
Abstract Syntax Tree
LHS = RHS LHS names an abstract syntax class RHS either (1) gives a list of one or more
alternatives or (2) lists the essential elements of
the syntax class Compare to production rules in concrete
grammar
General Form of Abstract Syntax Rules
Simplified Abstract Assignment Syntax Figure 2.11
Assignment = Variable target; Expression sourceExpression = Variable| Value | Binary | UnaryVariable = String idValue = Integer valueBinary = Operator op; Expression term1, term2Unary = UnaryOp op; Expression termOperator = +| - | * | / | !
Priority? Associativity? ….
• Concrete:Assignment Identifier [ [ Expression ] ] = Expression ;
• Abstract:Assignment = Variable target; Expression source
Compare Concrete to Abstract
Abstract Syntax Tree for z = x + 2 * y
target source
Binary
BinaryOperator
Operator
Variable
VariableValue
+
2 y*
x
z
op term1 term2Binary node
op termUnary node
Binaries and unaries represent information that can be used for later processing
Binaries and Unaries
Abstract Syntax of Clite Assignments – Fig. 2.14
Assignment = Variable target; Expression sourceExpression = VariableRef | Value | Binary | UnaryVariableRef = Variable | ArrayRefVariable = String idArrayRef = String id; Expression indexValue = IntValue | BoolValue | FloatValue | CharValueBinary = Operator op; Expression term1, term2Unary = UnaryOp op; Expression termOperator = ArithmeticOp | RelationalOp | BooleanOpIntValue = Integer intValue…
Abstract Syntax as Java Classes - Examples
abstract class Expression { } abstract class VariableRef extends Expression { }class Variable extends VariableRef { String id; }class ArrayRef extends VariableRef { String id; Expression index}class Value extends Expression { … }class Binary extends Expression {
Operator op;Expression term1, term2;
}class Unary extends Expression {
UnaryOp op;Expression term;
}
Remaining Abstract Syntax of Clite (Declarations and Statements)Fig 2.14
Lexical syntax: small, simple, defines language tokens
Concrete syntax: detailed, specific, defines correct programs, used to direct parsing algorithms(language specific, not implementation specific)
Abstract syntax: simpler than concrete, used to describe the structure of the intermediate code (implementation specific, not language specific) NOT INTENDED TO BE USED FOR PARSING
Semantics: program “meaning”, or runtime behavior
Summary
A syntax is ambiguous if a portion of a program has two or more possible interpretations (parse trees)
Non-ambiguous grammars can be written but some ambiguity may be tolerated to reduce grammar size.
Operator associativity and precedence can be defined by the concrete syntax
Compilers generate code for later execution while interpreters execute program statements as they are analyzed.
Summary