Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth

Chapter 2Syntax

A language that is simple to parse for the compiler is also simple to parse for the human programmer.

N. Wirth

Programming Languages2nd edition

Tucker and Noonan

2.1 Grammars2.1.1 Backus-Naur Form2.1.2 Derivations2.1.3 Parse Trees

2.1.4 Associativity and Precedence2.1.5 Ambiguous Grammars

2.2 Extended BNF2.3 Syntax of a Small Language: Clite

2.3.1 Lexical Syntax2.3.2 Concrete Syntax

2.4 Compilers and Interpreters2.5 Linking Syntax and Semantics

2.5.1 Abstract Syntax2.5.2 Abstract Syntax Trees2.5.3 Abstract Syntax of Clite

Contents

Expr Expr + Term | Expr – Term | TermTerm Term * Factor | Term / Factor |

Term % Factor | FactorFactor Primary ** Factor | PrimaryPrimary 0 | ... | 9 | ( Expr )

Red indicates a terminal of G1, blue indicates meta-symbols (symbols that aren’t part of the language but are used to describe the language)

Example: 10 * ( 5 – 3)

Grammar G1: Review

Motivation for using a subset of C:

GrammarLanguage (pages) ReferencePascal 5 Jensen & WirthC 6 Kernighan &

RichieC++ 22 StroustrupJava 14 Gosling, et. al.

The Clite grammar fits on one page (next 3 slides), so it’s a far better tool for studying language design.

2.3 Syntax of a Small Language: Clite

Program int main ( ) { Declarations Statements }

Declarations { Declaration }

Declaration Type Identifier [ [Integer ]] { , Identifier

[[Integer ] ] };

Type int | bool | float | char

Statements { Statement }

Statement ; | Block | Assignment | IfStatement |

WhileStatement

Block { Statements }

Assignment Identifier [ [ Expression ] ] = Expression ;

IfStatement if ( Expression ) Statement [ else Statement ]

WhileStatement while ( Expression ) Statement

Clite EBNF Grammar: StatementsFig. 2.7

Expression Conjunction { || Conjunction }Conjunction Equality { && Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp < | <= | > | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ] Primary UnaryOp - | ! Primary Identifier [ [ Expression ] ] | Literal | ( Expression ) |

Type ( Expression )

Fig. 2.7 Clite Grammar: Expressions

Identifier Letter { Letter | Digit } Letter a | b | ... | z | A | B | ... | Z Digit 0 | 1 | ... | 9 Literal Integer | Boolean | Float | Char Integer Digit { Digit } Boolean true | false Float Integer . Integer Char ‘ ASCII Char ‘

(ASCII Char is the set of ASCII characters)

Clite grammar: lexical level

• 13 grammar rules – compare to 4 pages, for C++ • Metabraces { }(0 or more) is interpreted to mean

left associativity; e.g., ◦ Addition → Term { AddOp Term }

AddOp → + | - • Metabrackets [ ] (optional) means an addition can

only be followed by one or no relational operators plus another addition.

• Relation Addition [ RelOp Addition ]◦ RelOp < | <= | > | >=

◦ (no a > b > c for example)

Clite Grammar - Discussion

• Comments• The significance of whitespace• Distinguishing one token <= from two

tokens < =• Distinguishing identifiers from keywords

like if

Issues Not Addressed by this Grammar

The Clite grammar has two levels◦ lexical level (described by lexical syntax)◦ syntactic level (described by concrete

syntax) They correspond to two separate parts of

a compiler. The issues on the previous slide are

lexical issues.

Grammar Levels

Examples of lexical entities (tokens):

◦ Identifiers e.g., numbr1, X◦ Literals e.g., 123, 'x', 3.25, true◦ Keywords bool char else false float

if int main true while◦ Operators = || && == != < <= > >= + -

* / ! %◦ Punctuation ; , { } ( )◦ Char e.g.,‘?’

2.3.1 Lexical Syntax & Analysis

Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment

No token may contain embedded whitespace (unless it is a character or string literal) Example:

>= one token> = two tokens

Whitespace

while ( a <= b) legal - spacing between tokens

while(a<=b) also legal - spacing not needed

while (a < = b) no lexical errors but illegal syntactically – lexer would identify tokens while, (, a, <, =, b, )

The Role of Whitespace

Clite uses // comment style of C++ Not defined in Clite grammar (but could

be) Instead, it’s defined outside the grammar The use of whitespace to differentiate

between one and two character operators is also defined outside the grammar.

Comments

• Sequence of letters and digits, starting with a letter

“if” is an identifier which also is a keyword

• Keywords versus reserved words:◦ Keyword: predefined by the language

◦ Reserved word: can only be used as defined.

◦ In most languages all keywords are also reserved, but in a few; e.g., Pascal, a subset of the keyword identifiers are predefined but not reserved (and can be redefined by the programmer). Implications? Flexibility, confusion …

Clite Identifiers

Concrete syntax of a language is the set of rules for writing correct programs

The structure of a specific program can be represented by a parse tree, based on the concrete syntax of the language, using the stream of Tokens identified during lexical analysis

The root of the parse tree is the Start Symbol of the language (Program, in Clite).

2.3.2 Concrete Syntax

• Clite’s expression rules are non-ambiguous with respect to precedence and associativity◦ Rule ordering defines precedence; rule

format defines associativity.• C/C++ expression grammar definition

is ambiguous – precedence and associativity are specified separately.

Clite Grammar - Discussion

Clite Operator AssociativityUnary - ! none* / left+ - left< <= > >= none (i.e., no a < b <= c)== != none&& left|| left

Associativity and Precedence

… are non-associative.(an idea borrowed from Ada)

Why is this important?In C & C++, the expression:

if (10 < x < 20)is not equivalent to if (10 < x && x < 20)

But it is error-free! So, what does it mean?

Clite Relational Operators

Grammar rules don’t specify the operand types to be used with various operators; e.g., is

true + 13 a legal expression? What is the type of the expression 123.78 + 37 ?

These are type and semantic issues, not lexical or syntax.

Syntax versus Semantics

2.4 Compilers and Interpreters

Figure 2.8: Major Stages in the Compilation Process

Lexical Analyzer(lexer)

SyntacticAnalyzer(parser)

Semantic Analyzer

CodeOptimizer

CodeGenerator

Toke

ns

Abs

trac

t Syn

tax

Machine Code

Inte

rmed

iate

Cod

e (I

C)

SourceProgram

Inte

rmed

iate

Cod

e (I

C)

Input: characters (the program) Output: tokens & token type Lexical grammars are simpler than syntax

grammars Often generated automatically by lexical

analyzer generating programs

Lexer

• Often based on BNF/EBNF grammar• Input: tokens• Output: abstract syntax tree or some

other representation of the program• Abstract syntax: similar to a concrete

parse tree but with punctuation, many nonterminals discarded

Parser (Syntax Analyzer)

• Typical tasks:◦ Check that all identifiers are declared◦ Perform type checking for expressions,

assignments, …◦ Insert implied conversion operators (i.e., make

them explicit)• Context free grammars can’t express the

semantic rules that are needed for this phase of translation.

• Output: Intermediate code (IC) tree, modified abstract syntax tree representation.

Semantic Analysis/Type Checker

Purpose: Improve the run-time performance of the object code◦ Usually, to make it run faster◦ Other possibilities: reduce amount of storage

required Drawback: optimization is time-

consuming; slows down debugging Output: Intermediate code, similar to

abstract syntax notation; closer to machine code

Code Optimization

• Evaluate constant expressions at compile-time

• In-line expansion (of function calls)• Loop unrolling• Reorder code to improve cache

performance• Eliminate common sub-expressions• Eliminate unnecessary code• Store local variables/intermediate results

in registers rather than on the stack or elsewhere in memory

Code Optimization - Techniques

• Output: machine code• Instruction selection, register

management• “Peephole” optimization: look at a small

segment of machine code, make it more efficient; e.g.,◦ x = y; →→ load y, R0

store R0, xz = x * 2; load x, R0 // redundant code

mul R0, #2

Code Generation

Replaces last 2 phases of a compiler with direct execution

Input: ◦ Mixed: generates & uses intermediate

code/abstract syntax ◦ Pure: start from stream of ASCII characters

each time a statement is executed Mixed interpreters

◦ Java, Perl, Python, Haskell, Scheme Pure interpreters:

◦ most Basics, shell commands

Interpreter

Source code: a = x + y; Compiler-generated object code:

load r0, x;add r0, y;store r0, a;

Will be executed later, with remainder of program

Interpreter: Call an interpretive routine to actually perform the operation.

e.g., add(x, y, a);

Interpreter versus Compiler

It’s not the case that the lexical analyzer identifies all the tokens, and then the parser analyzes all the tokens, and then the type/semantic analysis is performed.

Instead, parser repeatedly contacts lexer to get another token

As tokens are received they either do or don’t match the expected syntax◦ If there’s a match, perform any type or semantic

testing, possibly generate int. code, call for another token.

Stages of Translation Are Interleaved

Output of parser: the concrete parse tree is large – probably more than needed for next phase◦ The compiler usually produces some more

compact representation of the program

Example: Fig. 2.9 (page 46)

2.5 Linking Syntax and Semantics

Parse Tree for z = x + 2*y;Fig. 2.9

The shape of the parse tree reveals the meaning of the program.

So as output of syntax analysis we want a tree that removes its inefficiency and keeps its meaning. ◦ Remove separator/punctuation terminal

symbols◦ Remove all trivial nonterminals◦ Replace remaining nonterminals with leaf

terminals Example: Fig. 2.10

Finding a More Efficient Tree

Compare: concrete & abstract parse trees for z = x + 2*y

Abstract Syntax

Removes unnecessary details but keeps the essential language elements; e.g., consider the following two equivalent loops:

Pascal C++while j < n do begin while (j < n) { j := j + 1; j = j + 1;end; } Essential information: 1) it is a loop, 2) its terminating condition is j >= n, and 3) its body increments the current value of j.

• Purpose: an intermediate form of the source code

• Generated by the parser during syntax analysis

• Used during type checking/semantic analysis• Abstract syntax rules are defined by the

compiler (or interpreter).• One concrete syntax can have several

abstract syntaxes associated with it, depending on the design of the translator.

Abstract Syntax Tree

LHS = RHS LHS names an abstract syntax class RHS either (1) gives a list of one or more

alternatives or (2) lists the essential elements of

the syntax class Compare to production rules in concrete

grammar

General Form of Abstract Syntax Rules

Simplified Abstract Assignment Syntax Figure 2.11

Assignment = Variable target; Expression sourceExpression = Variable| Value | Binary | UnaryVariable = String idValue = Integer valueBinary = Operator op; Expression term1, term2Unary = UnaryOp op; Expression termOperator = +| - | * | / | !

Priority? Associativity? ….

• Concrete:Assignment Identifier [ [ Expression ] ] = Expression ;

• Abstract:Assignment = Variable target; Expression source

Compare Concrete to Abstract

Abstract Syntax Tree for z = x + 2 * y

target source

Binary

BinaryOperator

Operator

Variable

VariableValue

+

2 y*

x

z

op term1 term2Binary node

op termUnary node

Binaries and unaries represent information that can be used for later processing

Binaries and Unaries

Abstract Syntax of Clite Assignments – Fig. 2.14

Assignment = Variable target; Expression sourceExpression = VariableRef | Value | Binary | UnaryVariableRef = Variable | ArrayRefVariable = String idArrayRef = String id; Expression indexValue = IntValue | BoolValue | FloatValue | CharValueBinary = Operator op; Expression term1, term2Unary = UnaryOp op; Expression termOperator = ArithmeticOp | RelationalOp | BooleanOpIntValue = Integer intValue…

Abstract Syntax as Java Classes - Examples

abstract class Expression { } abstract class VariableRef extends Expression { }class Variable extends VariableRef { String id; }class ArrayRef extends VariableRef { String id; Expression index}class Value extends Expression { … }class Binary extends Expression {

Operator op;Expression term1, term2;

}class Unary extends Expression {

UnaryOp op;Expression term;

}

Remaining Abstract Syntax of Clite (Declarations and Statements)Fig 2.14

Lexical syntax: small, simple, defines language tokens

Concrete syntax: detailed, specific, defines correct programs, used to direct parsing algorithms(language specific, not implementation specific)

Abstract syntax: simpler than concrete, used to describe the structure of the intermediate code (implementation specific, not language specific) NOT INTENDED TO BE USED FOR PARSING

Semantics: program “meaning”, or runtime behavior

Summary

A syntax is ambiguous if a portion of a program has two or more possible interpretations (parse trees)

Non-ambiguous grammars can be written but some ambiguity may be tolerated to reduce grammar size.

Operator associativity and precedence can be defined by the concrete syntax

Compilers generate code for later execution while interpreters execute program statements as they are analyzed.

Summary

Documents

Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth