Course Overview

1Syntax Analysis (Chapter 4)

Course Overview

PART I: overview material1 Introduction

2 Language processors (tombstone diagrams, bootstrapping)

3 Architecture of a compiler

PART II: inside a compiler4 Syntax analysis

5 Contextual analysis

6 Runtime organization

7 Code generation

PART III: conclusion8 Interpretation

9 Review


The “Phases” of a Compiler

Syntax Analysis

Contextual Analysis

Code Generation

Source Program

Abstract Syntax Tree

Decorated Abstract Syntax Tree

Object Code

Error Reports

Error Reports

This chapter


In Chapter 4

• Syntax Analysis– Scanning: recognize “words” or “tokens” in the input

– Parsing: recognize structure of program

• Different parsing strategies

• How to construct a recursive descent parser

– AST Construction

• Use of theoretical “Tools”:– Regular Expressions and Finite–State Machines

– Grammars

– Extended BNF notation

– First sets and Follow sets


Syntax Analysis

• The “job” of syntax analysis is to read the source program (text file) and determine its structure.

• Subphases – Scanning

– Parsing

– Construct an internal representation of the source text that shows the structure (usually an AST)

Note: A single-pass compiler usually does not explicitly construct an AST.


Multi Pass Compiler

Compiler Driver

Syntactic Analyzer

callscalls

Contextual Analyzer Code Generator

calls

Dependency diagram of a typical Multi Pass Compiler:

A multi pass compiler makes several passes over the program. The output of a preceding phase is stored in a data structure and used by subsequent phases.

input

Source Text

output

AST

input output

Decorated AST

input output

Object Code

This chapter


Syntax Analysis

Scanner

Source Program

Abstract Syntax Tree

Error Reports

Parser

Stream of “Tokens”

(Stream of Characters)

Error Reports

Dataflow chart


(1) Scan: Divide Input into Tokens

An example Mini–Triangle source program:let var y: Integerin !new year y := y+1

let

let

var

var

ident.

y

scanner

colon

:

ident.

Integer

in

in

ident.

y

becomes

:=

...

... ident.

y

op.

+

intlit

1

eot

Tokens are “words” in the input, for example keywords, operators, identifiers, literals, etc.


(2) Parse: Determine structure of program

Parser analyzes the structure of the token stream with respect to the grammar of the language.

let

let

var

var

id.

y

col.

:

id.

Int

in

in

id.

y

bec.

:=

id.

y

op

+

intlit

1

eot

Ident Ident Ident Ident Op. Int.Lit

V-NameV-NameType Denoter

single-Declaration

Declaration

primary-Exp

primary-Exp

Expression

single-Command

single-Command

Program


(3) AST Construction

Program

LetCommand

Ident Ident Ident Op Int.Lit

SimpleType

VarDecl

SimpleVar

VNameExp Int.ExprSimpleVar

BinaryExpr

AssignCommand

y Integer

Ident

y y + 1


Grammars

RECAP:– The Syntax of a Language can be specified by means of a CFG (Context

Free Grammar).

– CFG can be expressed in BNF (Bachus-Naur Form)

Example: Mini–Triangle grammar in BNF

Program ::= single-CommandCommand ::= single-Command | Command ; single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Program ::= single-CommandCommand ::= single-Command | Command ; single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...


Grammars (continued)For our convenience, we will use EBNF or “Extended BNF” rather than simple BNF.

EBNF = BNF + regular expressions

Program ::= single-CommandCommand ::= (single-Command ;)* single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Program ::= single-CommandCommand ::= (single-Command ;)* single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Example: Mini Triangle in EBNF

* means 0 or more occurrences of


Regular Expressions

• RE are a notation for expressing a set of strings of terminal symbols.

Different kinds of RE: The empty stringt Generates only the string tX Y Generates any string xy such that x is generated by x

and y is generated by YX | Y Generates any string which generated either

by X or by YX* The concatenation of zero or more strings generated

by X(X) Used for grouping


RE: Examples

What sets of strings do each of the following RE generate?

1.

2. M(r|s)“.”

3. (foo|bar)*

4. (foo|bar)(foo|bar)*

5. (0|1|2|3|4|5|6|7|8|9)*

6. 0|(1|..|9)(0|1|..|9)*

1.

2. M(r|s)“.”

3. (foo|bar)*

4. (foo|bar)(foo|bar)*

5. (0|1|2|3|4|5|6|7|8|9)*

6. 0|(1|..|9)(0|1|..|9)*


Regular Expressions

• The “languages” that can be defined by RE and CFG have been extensively studied by theoretical computer scientists. These are some important conclusions / terminology

– RE is a “weaker” formalism than CFG: Any language expressible by a RE can be expressed by CFG but not the other way around!

– The languages expressible as RE are called regular languages

– Generally: a language that exhibits “self–embedding” cannot be expressed by RE.

– Programming languages exhibit self–embedding. (Examples: an expression can contain another expression, and a command can contain another command).


Extended BNF

• Extended BNF combines BNF with RE• A production in EBNF looks like

LHS ::= RHS

where LHS is a non terminal symbol and RHS is an extended regular expression

• An extended RE is just like a regular expression except it is composed of terminals and non–terminals of the grammar.

• Simply put, EBNF adds to BNF these notations– (...) for the purpose of grouping and

– * for denoting “0 or more repetitions of … ”


Extended BNF: an Example

Expression ::= PrimaryExp (Operator PrimaryExp)*PrimaryExpression ::= Literal | Identifier | ( Expression )Identifier ::= Letter (Letter|Digit)*Literal ::= Digit Digit*Letter ::= a | b | c | ... |zDigit ::= 0 | 1 | 2 | 3 | 4 | ... | 9

Expression ::= PrimaryExp (Operator PrimaryExp)*PrimaryExpression ::= Literal | Identifier | ( Expression )Identifier ::= Letter (Letter|Digit)*Literal ::= Digit Digit*Letter ::= a | b | c | ... |zDigit ::= 0 | 1 | 2 | 3 | 4 | ... | 9

Example: a simple expression language


A little bit of useful theory

• We will now look at a few useful bits of theory. These will be necessary later when we implement parsers.– Grammar transformations

• A grammar can be transformed in a number of ways without changing its meaning (i.e. its language, or the set of strings that it generates)

– The definition and computation of starter sets (first sets), follow sets, and nullable symbols


Grammar Transformations

Left factorization

single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command

single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command

single-Command ::= V-name := Expression | if Expression then single-Command ( | else single-Command)

single-Command ::= V-name := Expression | if Expression then single-Command ( | else single-Command)

X Y | X Z X ( Y | Z )

Example:X Y= Z


Grammar Transformations (continued)

Elimination of Left RecursionN ::= X | N Y

Identifier ::= Letter | Identifier Letter | Identifier Digit

Identifier ::= Letter | Identifier Letter | Identifier Digit

N ::= X Y*

Example:

Identifier ::= Letter | Identifier (Letter|Digit)

Identifier ::= Letter | Identifier (Letter|Digit)

Identifier ::= Letter (Letter|Digit)*Identifier ::= Letter (Letter|Digit)*


Grammar Transformations (continued)

Substitution of non-terminal symbolsN ::= XM ::= N

single-Command ::= for controlVar := Expression direction

Expression do single-Commanddirection ::= to | downto

single-Command ::= for controlVar := Expression direction

Expression do single-Commanddirection ::= to | downto

Example:

N ::= XM ::= X

single-Command ::= for controlVar := Expression (to|downto)

Expression do single-Command

single-Command ::= for controlVar := Expression (to|downto)

Expression do single-Command


Starter Sets (a.k.a. First Sets)

Informal Definition:The starter set of a RE X is the set of terminal symbols that can occur as the start of any string generated by X

Example :starters[ (“+”| - | ) (0 | 1 |…| 9)+ ] = {+, -, 0, 1, …, 9}

Formal Definition:starters[={ }starters[t={t} (where t is any terminal symbol)starters[X Y] = starters[X] (if X doesn’t generate )starters[X Y= starters[Xstarters[Yif X generates )starters[X | Y= starters[Xstarters[Ystarters[X*= starters[X


Derivations

• Replacing a non-terminal

S ::= E E ::= T | E + TT ::= i | ( E )

S ::= E E ::= T | E + TT ::= i | ( E )

S S S => ES => ES => E => E + TS => E => E + TS => E => E + T => T + TS => E => E + T => T + TS => E => E + T => T + T => i + TS => E => E + T => T + T => i + TS => E => E + T => T + T => i + T => i + iS => E => E + T => T + T => i + T => i + i

• This is a left-most derivation (it replaces the left-most non-terminal at each step.• Can you find the corresponding right-most derivation?• Can you find a derivation that is neither left-most nor right-most?

• This is a left-most derivation (it replaces the left-most non-terminal at each step.• Can you find the corresponding right-most derivation?• Can you find a derivation that is neither left-most nor right-most?


Sentential forms• A sequence of grammar symbols that can be derived from the start symbol

• A sentence is a sentential form that contains only terminal symbols, that is, a string that can be generated using the grammar.

S => E => E + T => T + T => i + T => i + iS => E => E + T => T + T => i + T => i + i


Ambiguous grammars

A grammar is ambiguous if some sentence has more than one distinct parse tree.

Equivalently, a grammar is ambiguous if some sentence has more than one left-most derivation, or more than one right-most derivation.

S ::= E E ::= i | ( E ) | E + E

S ::= E E ::= i | ( E ) | E + E

Does i + i demonstrate the ambiguity?Does i + i demonstrate the ambiguity?Does i + i demonstrate the ambiguity? E => E + E => i + E => i + i

Does i + i demonstrate the ambiguity? E => E + E => i + E => i + iDoes i + i + i demonstrate the ambiguity?Does i + i + i demonstrate the ambiguity?Does i + i + i demonstrate an ambiguity?

E => E + E => i + E => i + E + E => i + i + E => i + i + i

E => E + E => E + E + E => i + E + E => i + i + E => i + i + i

Does i + i + i demonstrate an ambiguity?

E => E + E => i + E => i + E + E => i + i + E => i + i + i

E => E + E => E + E + E => i + E + E => i + i + E => i + i + i


Augmented grammars

We augment grammars to ensure that we can recognize and handle the end of the input string

S ::= E E ::= i | ( E ) | E + E

S ::= E E ::= i | ( E ) | E + E

S’ ::= S $S ::= E E ::= i | ( E ) | E + E

S’ ::= S $S ::= E E ::= i | ( E ) | E + E

Here $ denotes the end-of-file token


Nullable, First sets (starter sets), and Follow sets

• A non-terminal is nullable if it derives the empty string

• First(N) or starters(N) is the set of all terminals that can begin a sentence derived from N

• Follow(N) is the set of terminals that can follow N in some sentential form

Next we will see algorithms to compute each of these.

Documents

Course Overview