26
1 Syntax Analysis (Chapter 4) Course Overview PART I: overview material 1 Introduction 2 Language processors (tombstone diagrams, bootstrapping) 3 Architecture of a compiler PART II: inside a compiler 4 Syntax analysis 5 Contextual analysis 6 Runtime organization 7 Code generation PART III: conclusion 8 Interpretation 9 Review

Course Overview

  • Upload
    hart

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Course Overview. PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II: inside a compiler 4Syntax analysis 5Contextual analysis 6Runtime organization 7Code generation PART III: conclusion - PowerPoint PPT Presentation

Citation preview

Page 1: Course Overview

1Syntax Analysis (Chapter 4)

Course Overview

PART I: overview material1 Introduction

2 Language processors (tombstone diagrams, bootstrapping)

3 Architecture of a compiler

PART II: inside a compiler4 Syntax analysis

5 Contextual analysis

6 Runtime organization

7 Code generation

PART III: conclusion8 Interpretation

9 Review

Page 2: Course Overview

2Syntax Analysis (Chapter 4)

The “Phases” of a Compiler

Syntax Analysis

Contextual Analysis

Code Generation

Source Program

Abstract Syntax Tree

Decorated Abstract Syntax Tree

Object Code

Error Reports

Error Reports

This chapter

Page 3: Course Overview

3Syntax Analysis (Chapter 4)

In Chapter 4

• Syntax Analysis– Scanning: recognize “words” or “tokens” in the input

– Parsing: recognize structure of program

• Different parsing strategies

• How to construct a recursive descent parser

– AST Construction

• Use of theoretical “Tools”:– Regular Expressions and Finite–State Machines

– Grammars

– Extended BNF notation

– First sets and Follow sets

Page 4: Course Overview

4Syntax Analysis (Chapter 4)

Syntax Analysis

• The “job” of syntax analysis is to read the source program (text file) and determine its structure.

• Subphases – Scanning

– Parsing

– Construct an internal representation of the source text that shows the structure (usually an AST)

Note: A single-pass compiler usually does not explicitly construct an AST.

Page 5: Course Overview

5Syntax Analysis (Chapter 4)

Multi Pass Compiler

Compiler Driver

Syntactic Analyzer

callscalls

Contextual Analyzer Code Generator

calls

Dependency diagram of a typical Multi Pass Compiler:

A multi pass compiler makes several passes over the program. The output of a preceding phase is stored in a data structure and used by subsequent phases.

input

Source Text

output

AST

input output

Decorated AST

input output

Object Code

This chapter

Page 6: Course Overview

6Syntax Analysis (Chapter 4)

Syntax Analysis

Scanner

Source Program

Abstract Syntax Tree

Error Reports

Parser

Stream of “Tokens”

(Stream of Characters)

Error Reports

Dataflow chart

Page 7: Course Overview

7Syntax Analysis (Chapter 4)

(1) Scan: Divide Input into Tokens

An example Mini–Triangle source program:let var y: Integerin !new year y := y+1

let

let

var

var

ident.

y

scanner

colon

:

ident.

Integer

in

in

ident.

y

becomes

:=

...

... ident.

y

op.

+

intlit

1

eot

Tokens are “words” in the input, for example keywords, operators, identifiers, literals, etc.

Page 8: Course Overview

8Syntax Analysis (Chapter 4)

(2) Parse: Determine structure of program

Parser analyzes the structure of the token stream with respect to the grammar of the language.

let

let

var

var

id.

y

col.

:

id.

Int

in

in

id.

y

bec.

:=

id.

y

op

+

intlit

1

eot

Ident Ident Ident Ident Op. Int.Lit

V-NameV-NameType Denoter

single-Declaration

Declaration

primary-Exp

primary-Exp

Expression

single-Command

single-Command

Program

Page 9: Course Overview

9Syntax Analysis (Chapter 4)

(3) AST Construction

Program

LetCommand

Ident Ident Ident Op Int.Lit

SimpleType

VarDecl

SimpleVar

VNameExp Int.ExprSimpleVar

BinaryExpr

AssignCommand

y Integer

Ident

y y + 1

Page 10: Course Overview

10Syntax Analysis (Chapter 4)

Grammars

RECAP:– The Syntax of a Language can be specified by means of a CFG (Context

Free Grammar).

– CFG can be expressed in BNF (Bachus-Naur Form)

Example: Mini–Triangle grammar in BNF

Program ::= single-CommandCommand ::= single-Command | Command ; single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Program ::= single-CommandCommand ::= single-Command | Command ; single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Page 11: Course Overview

11Syntax Analysis (Chapter 4)

Grammars (continued)For our convenience, we will use EBNF or “Extended BNF” rather than simple BNF.

EBNF = BNF + regular expressions

Program ::= single-CommandCommand ::= (single-Command ;)* single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Program ::= single-CommandCommand ::= (single-Command ;)* single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...

Example: Mini Triangle in EBNF

* means 0 or more occurrences of

Page 12: Course Overview

12Syntax Analysis (Chapter 4)

Regular Expressions

• RE are a notation for expressing a set of strings of terminal symbols.

Different kinds of RE: The empty stringt Generates only the string tX Y Generates any string xy such that x is generated by x

and y is generated by YX | Y Generates any string which generated either

by X or by YX* The concatenation of zero or more strings generated

by X(X) Used for grouping

Page 13: Course Overview

13Syntax Analysis (Chapter 4)

RE: Examples

What sets of strings do each of the following RE generate?

1.

2. M(r|s)“.”

3. (foo|bar)*

4. (foo|bar)(foo|bar)*

5. (0|1|2|3|4|5|6|7|8|9)*

6. 0|(1|..|9)(0|1|..|9)*

1.

2. M(r|s)“.”

3. (foo|bar)*

4. (foo|bar)(foo|bar)*

5. (0|1|2|3|4|5|6|7|8|9)*

6. 0|(1|..|9)(0|1|..|9)*

Page 14: Course Overview

14Syntax Analysis (Chapter 4)

Regular Expressions

• The “languages” that can be defined by RE and CFG have been extensively studied by theoretical computer scientists. These are some important conclusions / terminology

– RE is a “weaker” formalism than CFG: Any language expressible by a RE can be expressed by CFG but not the other way around!

– The languages expressible as RE are called regular languages

– Generally: a language that exhibits “self–embedding” cannot be expressed by RE.

– Programming languages exhibit self–embedding. (Examples: an expression can contain another expression, and a command can contain another command).

Page 15: Course Overview

15Syntax Analysis (Chapter 4)

Extended BNF

• Extended BNF combines BNF with RE• A production in EBNF looks like

LHS ::= RHS

where LHS is a non terminal symbol and RHS is an extended regular expression

• An extended RE is just like a regular expression except it is composed of terminals and non–terminals of the grammar.

• Simply put, EBNF adds to BNF these notations– (...) for the purpose of grouping and

– * for denoting “0 or more repetitions of … ”

Page 16: Course Overview

16Syntax Analysis (Chapter 4)

Extended BNF: an Example

Expression ::= PrimaryExp (Operator PrimaryExp)*PrimaryExpression ::= Literal | Identifier | ( Expression )Identifier ::= Letter (Letter|Digit)*Literal ::= Digit Digit*Letter ::= a | b | c | ... |zDigit ::= 0 | 1 | 2 | 3 | 4 | ... | 9

Expression ::= PrimaryExp (Operator PrimaryExp)*PrimaryExpression ::= Literal | Identifier | ( Expression )Identifier ::= Letter (Letter|Digit)*Literal ::= Digit Digit*Letter ::= a | b | c | ... |zDigit ::= 0 | 1 | 2 | 3 | 4 | ... | 9

Example: a simple expression language

Page 17: Course Overview

17Syntax Analysis (Chapter 4)

A little bit of useful theory

• We will now look at a few useful bits of theory. These will be necessary later when we implement parsers.– Grammar transformations

• A grammar can be transformed in a number of ways without changing its meaning (i.e. its language, or the set of strings that it generates)

– The definition and computation of starter sets (first sets), follow sets, and nullable symbols

Page 18: Course Overview

18Syntax Analysis (Chapter 4)

Grammar Transformations

Left factorization

single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command

single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command

single-Command ::= V-name := Expression | if Expression then single-Command ( | else single-Command)

single-Command ::= V-name := Expression | if Expression then single-Command ( | else single-Command)

X Y | X Z X ( Y | Z )

Example:X Y= Z

Page 19: Course Overview

19Syntax Analysis (Chapter 4)

Grammar Transformations (continued)

Elimination of Left RecursionN ::= X | N Y

Identifier ::= Letter | Identifier Letter | Identifier Digit

Identifier ::= Letter | Identifier Letter | Identifier Digit

N ::= X Y*

Example:

Identifier ::= Letter | Identifier (Letter|Digit)

Identifier ::= Letter | Identifier (Letter|Digit)

Identifier ::= Letter (Letter|Digit)*Identifier ::= Letter (Letter|Digit)*

Page 20: Course Overview

20Syntax Analysis (Chapter 4)

Grammar Transformations (continued)

Substitution of non-terminal symbolsN ::= XM ::= N

single-Command ::= for controlVar := Expression direction

Expression do single-Commanddirection ::= to | downto

single-Command ::= for controlVar := Expression direction

Expression do single-Commanddirection ::= to | downto

Example:

N ::= XM ::= X

single-Command ::= for controlVar := Expression (to|downto)

Expression do single-Command

single-Command ::= for controlVar := Expression (to|downto)

Expression do single-Command

Page 21: Course Overview

21Syntax Analysis (Chapter 4)

Starter Sets (a.k.a. First Sets)

Informal Definition:The starter set of a RE X is the set of terminal symbols that can occur as the start of any string generated by X

Example :starters[ (“+”| - | ) (0 | 1 |…| 9)+ ] = {+, -, 0, 1, …, 9}

Formal Definition:starters[={ }starters[t={t} (where t is any terminal symbol)starters[X Y] = starters[X] (if X doesn’t generate )starters[X Y= starters[Xstarters[Yif X generates )starters[X | Y= starters[Xstarters[Ystarters[X*= starters[X

Page 22: Course Overview

22Syntax Analysis (Chapter 4)

Derivations

• Replacing a non-terminal

S ::= E E ::= T | E + TT ::= i | ( E )

S ::= E E ::= T | E + TT ::= i | ( E )

S S S => ES => ES => E => E + TS => E => E + TS => E => E + T => T + TS => E => E + T => T + TS => E => E + T => T + T => i + TS => E => E + T => T + T => i + TS => E => E + T => T + T => i + T => i + iS => E => E + T => T + T => i + T => i + i

• This is a left-most derivation (it replaces the left-most non-terminal at each step.• Can you find the corresponding right-most derivation?• Can you find a derivation that is neither left-most nor right-most?

• This is a left-most derivation (it replaces the left-most non-terminal at each step.• Can you find the corresponding right-most derivation?• Can you find a derivation that is neither left-most nor right-most?

Page 23: Course Overview

23Syntax Analysis (Chapter 4)

Sentential forms• A sequence of grammar symbols that can be derived from the start symbol

• A sentence is a sentential form that contains only terminal symbols, that is, a string that can be generated using the grammar.

S => E => E + T => T + T => i + T => i + iS => E => E + T => T + T => i + T => i + i

Page 24: Course Overview

24Syntax Analysis (Chapter 4)

Ambiguous grammars

A grammar is ambiguous if some sentence has more than one distinct parse tree.

Equivalently, a grammar is ambiguous if some sentence has more than one left-most derivation, or more than one right-most derivation.

S ::= E E ::= i | ( E ) | E + E

S ::= E E ::= i | ( E ) | E + E

Does i + i demonstrate the ambiguity?Does i + i demonstrate the ambiguity?Does i + i demonstrate the ambiguity? E => E + E => i + E => i + i

Does i + i demonstrate the ambiguity? E => E + E => i + E => i + iDoes i + i + i demonstrate the ambiguity?Does i + i + i demonstrate the ambiguity?Does i + i + i demonstrate an ambiguity?

E => E + E => i + E => i + E + E => i + i + E => i + i + i

E => E + E => E + E + E => i + E + E => i + i + E => i + i + i

Does i + i + i demonstrate an ambiguity?

E => E + E => i + E => i + E + E => i + i + E => i + i + i

E => E + E => E + E + E => i + E + E => i + i + E => i + i + i

Page 25: Course Overview

25Syntax Analysis (Chapter 4)

Augmented grammars

We augment grammars to ensure that we can recognize and handle the end of the input string

S ::= E E ::= i | ( E ) | E + E

S ::= E E ::= i | ( E ) | E + E

S’ ::= S $S ::= E E ::= i | ( E ) | E + E

S’ ::= S $S ::= E E ::= i | ( E ) | E + E

Here $ denotes the end-of-file token

Page 26: Course Overview

26Syntax Analysis (Chapter 4)

Nullable, First sets (starter sets), and Follow sets

• A non-terminal is nullable if it derives the empty string

• First(N) or starters(N) is the set of all terminals that can begin a sentence derived from N

• Follow(N) is the set of terminals that can follow N in some sentential form

Next we will see algorithms to compute each of these.