284
Lexical Analysis

Lexical Analysis - Philadelphia

  • Upload
    others

  • View
    30

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lexical Analysis - Philadelphia

Lexical Analysis

Page 2: Lexical Analysis - Philadelphia

Where We Are

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

Source Code

Machine

Code

Page 3: Lexical Analysis - Philadelphia

++ip;

while (ip < z)

Page 4: Lexical Analysis - Philadelphia

do[for] = new 0;

Page 5: Lexical Analysis - Philadelphia

T_Do [ T_For T_New T_IntConst

0

d o [ f o r ] = n e w 0 ;

] =

do[for] = new 0;

Page 6: Lexical Analysis - Philadelphia

Scanning a Source File

( i < z ) \ \ + i p ;p + + +w h i l e ( 1 3 7 < i ) \n \t + + i ;

( i < z ) \ \ + i p ;p + + +w h i l e ( 1 3 7 < i ) \n \t + + i ;

( i < z ) \ \ + i p ;p + + +w h i l e ( 1 3 7 < i ) \n \t + + i ;

( i < z ) \ \ + i p ;p + + +w h i l e ( 1 3 7 < i ) \n \t + + i ;

( i < z ) \ \ + i p ;p + + +w h i l e ( 1 3 7 < i ) \n \t + + i ;

( i < z ) \ \ + i p ;p + + +w h i l e ( 1 3 7 < i ) \n \t + + i ;

Page 7: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Page 8: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

This is called a token . You can

think o f i t as an enumerated type

r ep resent ing what logical ent i t y we

read ou t o f the source code .

T he piece of the or ig inal p rog r am

f r o m which we made the token is

called a lexeme .

Page 9: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Page 10: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Page 11: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Sometimes we will discard a lexeme

r at her t han sto r ing it f o r lat er use.

Her e, we ig nor e whit espace, since it

has no bearing on the meaning o f

the program.

Page 12: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Page 13: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Page 14: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Page 15: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While (

Page 16: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While (

Page 17: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While (

Page 18: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While (

Page 19: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While (

Page 20: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While (

Page 21: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_IntConst

137

Page 22: Lexical Analysis - Philadelphia

Scanning a Source File

+w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_IntConst

137

Some tokens can have

at t r ibutes that s tore

ext r a in for mat ion about

the token . Here we

s tore which in teger is

represented.

Page 23: Lexical Analysis - Philadelphia

Goals of Lexical Analysis

● Convert from physical description of a program into sequence of of tokens .●

E a c h token represents one logical piece of the source file – a keyword, the name of a variable, etc .

E a c h token is associated with a lexeme .

● The actual text of the token: “137 ,” “ int ,” etc.

● E a c h token may have optional attributes .

Extra information derived from the text – perhaps a numeric value.

The token sequence will be used in the parser to recover the program structure.

Page 24: Lexical Analysis - Philadelphia

Choosing Tokens

Page 25: Lexical Analysis - Philadelphia

What Tokens are Useful Her e?

for (int k = 0; k < myArray[5]; ++k) {

cout << k << endl;

}

Page 26: Lexical Analysis - Philadelphia

What Tokens are Useful Her e?

for (int k = 0; k < myArray[5]; ++k) {

cout << k << endl;

}

for

int

<<

=

(

)

{

}

;

<

[

]+

+

Page 27: Lexical Analysis - Philadelphia

What Tokens are Useful Her e?

for (int k = 0; k < myArray[5]; ++k) {

cout << k << endl;

}

for

int

<<

=

(

)

{

}

;

<

[

]++

Identifier

IntegerConstant

Page 28: Lexical Analysis - Philadelphia

Choosing Good Tokens

Very much dependent on the language.

Typically:

Give keywords their own tokens.

Give different punctuation symbols their own tokens.

Group lexemes representing identifiers, numeric constants, strings, etc . into their own groups.

Discard irrelevant information (whitespace, comments)

Page 29: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● FO RT R A N : Whitespace is irrelevant

DO 5 I = 1,25

DO 5 I = 1.25

Page 30: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● FO RT R A N : Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

Page 31: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● FO RT R A N : Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

● C a n be difficult to tell when to partition input.

Page 32: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● C + + : Nested template declarations

vector<vector<int>> myVector

Page 33: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● C + + : Nested template declarations

vector < vector < int >> myVector

Page 34: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● C + + : Nested template declarations

(vector < (vector < (int >> myVector)))

Page 35: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● C + + : Nested template declarations

(vector < (vector < (int >> myVector)))

● Again, can be difficult to determine where to split .

Page 36: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

Page 37: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

Page 38: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

Page 39: Lexical Analysis - Philadelphia

Thanks to Prof. AlexAiken

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

● C a n be difficult to determine how to label lexemes.

Page 40: Lexical Analysis - Philadelphia

Chal lenges in Scanning

Ho w do we determine which lexemes are associated with each token?

When there are multiple ways we could scan the input, how do we know which one to pick?

Ho w do we address these concerns efficiently?

Page 41: Lexical Analysis - Philadelphia

Associating Lexemes with Tokens

Page 42: Lexical Analysis - Philadelphia

Lexemes and Tokens

Tokens give a way to categorize lexemes by what information they provide.

S o m e tokens might be associated with only a single lexeme:

Tokens for keywords like if and whileprobably only match those lexemes exactly.

S o m e tokens might be associated with lots of different lexemes:

● All variable names, all possible numbers, all possible strings, etc .

Page 43: Lexical Analysis - Philadelphia

Sets of Lexemes

Idea: Associate a set of lexemes with each token.

We might associate the “number” token with the set { 0, 1, 2, … , 10, 11, 12, … }

We might associate the “string” token with the set { "", "a", "b", "c", … }

We might associate the token for the keyword while with the set { while } .

Page 44: Lexical Analysis - Philadelphia

Ho w do we describe which (potentially infinite) set of lexemes is associated with

each token type?

Page 45: Lexical Analysis - Philadelphia

Formal Languages

A formal language is a set of strings.

M a ny infinite languages have finite descriptions:

Define the language using an automaton.

Define the language using a grammar.

Define the language using a regular expression.

We can use these compact descriptions of the language to define sets of strings.

Over the course of this class, we will use all of these approaches.

Page 46: Lexical Analysis - Philadelphia

Regular Expressions

Regu lar expressions are a family of descriptions that can be used to capture certain languages (the regular languages).

Often provide a compact and human-readable description of the language.

U s e d as the basis for numerous software systems, including the flex tool we will

use in this course.

Page 47: Lexical Analysis - Philadelphia

Atomic Regular Expressions

The regular expressions we will use in this course begin with two simple building blocks.

The symbol ε is a regular expression matches the empty string.

For any symbol a, the symbol a is a re g ular expression that just m atc hes a.

Page 48: Lexical Analysis - Philadelphia

Compound Regular Expressions

If R 1 and R 2 are regular expressions, R 1 R 2 is a regular

expression represents the concatenation of the

languages of R 1 and R 2 .

If R 1 and R 2 are regular expressions, R 1 |R 2 is a regular

expression representing the union of R 1 and R 2 .

If R is a regular expression, R * is a regular expression for the Kleene closure of R .

If R is a regular expression, (R ) is a regular expression with the same meaning as R .

Page 49: Lexical Analysis - Philadelphia

Operator Precedence

Regular expression operator precedence is

(R)

R*

R 1R 2

R1 | R2

S o ab*c|d is parsed as ((a(b*))c)|d

Page 50: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings containing00 as a substring:

(0 | 1)*00(0 | 1)*

Page 51: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings containing00 as a substring:

(0 | 1)*00(0 | 1)*

Page 52: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings containing00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Page 53: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings containing00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Page 54: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

Page 55: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

Page 56: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

Page 57: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

0000101011111000

Page 58: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

0000101011111000

Page 59: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

(0|1){4}

0000101011111000

Page 60: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings of length exactly four:

(0|1){4}

0000101011111000

Page 61: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings that contain at most one zero:

Page 62: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

Page 63: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

Page 64: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

111101111111110111

0

Page 65: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

111101111111110111

0

Page 66: Lexical Analysis - Philadelphia

Simple Regular Expressions

Suppose the only characters are 0 and 1.

H e re is a regular expression for strings that contain at most one zero:

1*0?1*

111101111111110111

0

Page 67: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

aa* (.aa*)* @ aa*.aa* (.aa*)*

Page 68: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

aa* (.aa*)* @ aa*.aa* (.aa*)*

[email protected] [email protected]

[email protected]

Page 69: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

aa* (.aa*)* @ aa*.aa* (.aa*)*

[email protected] [email protected]

[email protected]

Page 70: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

aa* (.aa*)* @ aa*.aa* (.aa*)*

[email protected] [email protected]

[email protected]

Page 71: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

aa* (.aa*)* @ aa*.aa* (.aa*)*

[email protected]

[email protected]@whitehouse.gov

Page 72: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

a+ (.aa*)* @ aa*.aa* (.aa*)*

[email protected] [email protected]

[email protected]

Page 73: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

a+ (.a+)* @ a+.a+ (.a+)*

[email protected] [email protected]

[email protected]

Page 74: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

a+ (.a+)* @ a+.a+ (.a+)*

[email protected] [email protected]

[email protected]

Page 75: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

a+ (.a+)* @ a+ (.a+)+

[email protected] [email protected]

[email protected]

Page 76: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose our alphabet is a, @ , and ., where a

represents “some letter.”

A regular expression for email addresses is

a+(.a+)*@a+(.a+)+

[email protected] [email protected]

[email protected]

Page 77: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose that our alphabet is all AS C I I characters.

A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Page 78: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose that our alphabet is all AS C I I characters.

A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

42+1370-3248

-9999912

Page 79: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose that our alphabet is all AS C I I characters.

A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

42+1370-3248

-9999912

Page 80: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose that our alphabet is all AS C I I characters.

A regular expression for even numbers is

(+|-)?[0123456789]*[02468]

42+1370-3248

-9999912

Page 81: Lexical Analysis - Philadelphia

Applied Regular Expressions

Suppose that our alphabet is all AS C I I characters.

A regular expression for even numbers is

(+|-)?[0-9]*[02468]

42+1370-3248

-9999912

Page 82: Lexical Analysis - Philadelphia

Matching Regular Expressions

Page 83: Lexical Analysis - Philadelphia

Implementing Regular Expressions

Regular expressions can be implemented using finite automata .

There are two main kinds of finite automata:

NFA s (nondeterministic finite automata), which we'll see in a second, and

DFAs (deterministic finite automata), which we'll see later.

● Automata are best explained by example. . .

Page 84: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

Page 85: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

Each circle is a s ta te o f the

auto mato n. T he auto mato n' s

conf igurat ion is determined

by what s t a t e ( s ) i t is in .

A Simple Automaton

Page 86: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

These arrows are called

t ransit ions . The automaton

chang es which sta t e ( s) it is in

by fol lowing transi t ions.

A Simple Automaton

Page 87: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 88: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

T he aut omat on ta kes a st r ing

as input and decides whether

t o accept o r r e j e c t the s t r i ng .

Page 89: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 90: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 91: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 92: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 93: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 94: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 95: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 96: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 97: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 98: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 99: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

T he d oub le cir cle ind icat e s that th is

s tate is an accepting s t a t e . The

auto mato n accept s the st r ing if it

ends in an accepting s ta te .

Page 100: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 101: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 102: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 103: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 104: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 105: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 106: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

T her e is no transit ion on "

here, so the automaton

d ies and r ej ects.

Page 107: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

T her e is no transit ion on "

here, so the automaton

d ies and r ej ects.

Page 108: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 109: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 110: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 111: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 112: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 113: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 114: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 115: Lexical Analysis - Philadelphia

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

T his is not an accept ing

sta t e , so the auto mato n

r e j ec t s .

Page 116: Lexical Analysis - Philadelphia

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 117: Lexical Analysis - Philadelphia

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Not ice t hat t her e ar e mul t iple t r ansit ions

def i ned her e on 0 and 1. I f we r ead a

0 or 1 her e, we f o llow b o t h t r ansit ions

and enter multiple s ta tes.

Page 118: Lexical Analysis - Philadelphia

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 119: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 120: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 121: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 122: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 123: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 124: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 125: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 126: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 127: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 128: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 129: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 130: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 131: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 132: Lexical Analysis - Philadelphia

0 1 1 1 0 1

A Mo r e Complex Automaton

01 0

10 1

start0, 1

Page 133: Lexical Analysis - Philadelphia

A Mo r e Complex Automaton

01 0

10 1

start0, 1

0 1 1 1 0 1Since we are in a t least

one accep t ing sta t e , the

automaton accepts.

Page 134: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 135: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

These are called ε - t rans i t ions . These

t r ansit ions ar e f ollo wed aut omat i cally and

without consuming any inpu t .

Page 136: Lexical Analysis - Philadelphia

b c b a

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 137: Lexical Analysis - Philadelphia

b c b a

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 138: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b c b a

b, c

start

ε

ε

ε

a, b

c

b

a

Page 139: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b c b a

b, c

start

ε

ε

ε

a, b

c

b

a

Page 140: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b c b a

b, c

start

ε

ε

ε

a, b

c

b

a

Page 141: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b c b a

b, c

start

ε

ε

ε

a, b

c

b

a

Page 142: Lexical Analysis - Philadelphia

An Even More Complex Automaton

a, c

b c b a

b, c

start

ε

ε

ε

a, b

c

b

a

Page 143: Lexical Analysis - Philadelphia

b c b a

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 144: Lexical Analysis - Philadelphia

b c b a

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 145: Lexical Analysis - Philadelphia

b c b a

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 146: Lexical Analysis - Philadelphia

b c b a

An Even More Complex Automaton

a, c

b, c

start

ε

ε

ε

a, b

c

b

a

Page 147: Lexical Analysis - Philadelphia

Simulating an N F A

Keep track of a set of states, initially the start state and everything reachable by ε-moves.

For each character in the input:

● Maintain a set of next states, initially empty.

For each current state:

– Follow all transitions labeled with the current letter.

– Add these states to the set of new states.

Add every state reachable by an ε-move to the set of next states.

● Complexity: O(mn2) for strings of length m and automata with n states.

Page 148: Lexical Analysis - Philadelphia

From Regular Expressions to NFAs

There is a (beautiful!) procedure from converting a regular expression to an N FA .

Associate each regular expression with an N FA with the following properties:

There is exactly one accepting state.

There are no transitions out of the accepting state.

There are no transitions into the starting state.

● These restrictions are stronger than necessary, but make the construction easier.

start

Page 149: Lexical Analysis - Philadelphia

Base C a s e s

εstart

Automaton for ε

astart

Automaton for single character a

Page 150: Lexical Analysis - Philadelphia

Construction for R 1 R 2

Page 151: Lexical Analysis - Philadelphia

R1

Construction for R 1 R 2

R2

start start

Page 152: Lexical Analysis - Philadelphia

Construction for R 1 R 2

R1

R2

start

Page 153: Lexical Analysis - Philadelphia

R1

Construction for R 1 R 2

R2

start

ε

Page 154: Lexical Analysis - Philadelphia

R1

Construction for R 1 R 2

R2

start

ε

Page 155: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

Page 156: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

start

R1

start

R2

Page 157: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

start

R1

start

start

R2

Page 158: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

R1start

ε

ε

R2

Page 159: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

R1start

ε

ε

R2

Page 160: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

R1start

ε

ε

ε

ε

R2

Page 161: Lexical Analysis - Philadelphia

Construction for R 1 | R 2

R1start

ε

ε

ε

ε

R2

Page 162: Lexical Analysis - Philadelphia

Construction for R*

Page 163: Lexical Analysis - Philadelphia

Construction for R*

R

start

Page 164: Lexical Analysis - Philadelphia

Construction for R*

R

startstart

Page 165: Lexical Analysis - Philadelphia

Construction for R*

R

startstart

ε

Page 166: Lexical Analysis - Philadelphia

Construction for R*

R

start

ε

ε ε

Page 167: Lexical Analysis - Philadelphia

Construction for R*

R

start

ε

ε ε

ε

Page 168: Lexical Analysis - Philadelphia

Construction for R*

R

start

ε

ε ε

ε

Page 169: Lexical Analysis - Philadelphia

Overall Result

Any regular expression of length n can be converted into an N FA with O(n) states.

C a n determine whether a string of lengthm matches a regular expression of lengthn in time O(mn2).

We'll see how to make this O(m) later (this is independent of the complexity of the regular expression!)

Page 170: Lexical Analysis - Philadelphia

Chal lenges in Scanning

Ho w do we determine which lexemes are associated with each token?

When there are multiple ways we could scan the input, how do we know which one to pick?

Ho w do we address these concerns efficiently?

Page 171: Lexical Analysis - Philadelphia

Chal lenges in Scanning

Ho w do we determine which lexemes are associated with each token?

When there are multiple ways we could scan the input, how do we know which one to pick?

Ho w do we address these concerns efficiently?

Page 172: Lexical Analysis - Philadelphia

Lexing Ambiguities

T_For

T_Identifier

for

[A-Za-z_][A-Za-z0-9_]*

Page 173: Lexical Analysis - Philadelphia

Lexing Ambiguities

T_For

T_Identifier

for

[A-Za-z_][A-Za-z0-9_]*

f o r t

Page 174: Lexical Analysis - Philadelphia

Lexing Ambiguities

T_For

T_Identifier

for

[A-Za-z_][A-Za-z0-9_]*

f o r t

f o r t

f o r t

f o r t

f o r t

f o r t

f

f

f

f

o r t

o r t

o r t

o r t

Page 175: Lexical Analysis - Philadelphia

Conflict Resolution

Assume all tokens are specified as regular expressions.

Algorithm: Left-to-right scan .

Tiebreaking rule one: Max ima l munch .

● Always match the longest possible prefix of the remaining text .

Page 176: Lexical Analysis - Philadelphia

Lexing Ambiguities

T_For

T_Identifier

for

[A-Za-z_][A-Za-z0-9_]*

f o r t

f o r t

f o r t

f o r t

f o r t

f o r t

f

f

f

f

o r t

o r t

o r t

o r t

Page 177: Lexical Analysis - Philadelphia

Lexing Ambiguities

T_For

T_Identifier

for

[A-Za-z_][A-Za-z0-9_]*

f o r t

f o r t

Page 178: Lexical Analysis - Philadelphia

Implementing Maximal M u n c h

Given a set of regular expressions, how can we use them to implement maximum munch?

Idea:

Convert expressions to N FAs .

Run all NFAs in parallel, keeping track of the last match.

When all automata get stuck, report the last match and restart the search at that point.

Page 179: Lexical Analysis - Philadelphia

T_Do

T_Double

T_Mystery

do

double

[A-Za-z]

Implementing Maximal M u n c h

Page 180: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 181: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 182: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 183: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 184: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 185: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 186: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 187: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 188: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 189: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 190: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 191: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 192: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 193: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 194: Lexical Analysis - Philadelphia

D O U B D O U B L E

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

Page 195: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 196: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 197: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 198: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 199: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 200: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 201: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

U B D O U B L ED O

Page 202: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

B D O U B L ED O U

Page 203: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

B D O U B L ED O U

Page 204: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

B D O U B L ED O U

Page 205: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

B D O U B L ED O U

Page 206: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

B D O U B L ED O U

Page 207: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

B D O U B L ED O U

Page 208: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 209: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 210: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 211: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 212: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 213: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 214: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 215: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 216: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 217: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 218: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 219: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 220: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 221: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 222: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 223: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 224: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 225: Lexical Analysis - Philadelphia

do

double

[A-Za-z]

Implementing Maximal M u n c h

o

o u b l e

T_Do

T_Double

T_Mystery

start d

start d

start Σ

D O U B L ED O U B

Page 226: Lexical Analysis - Philadelphia

A Minor Simplification

Page 227: Lexical Analysis - Philadelphia

A Minor Simplification

start d o

start d o u b l e

start Σ

Page 228: Lexical Analysis - Philadelphia

A Minor Simplification

o

o u b l e

start

start

start

d

d

Σ

Page 229: Lexical Analysis - Philadelphia

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Page 230: Lexical Analysis - Philadelphia

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Build a single automaton

t hat r uns all t he mat ching

automata in paral lel .

Page 231: Lexical Analysis - Philadelphia

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Page 232: Lexical Analysis - Philadelphia

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Annotate each accepting

st at e wi t h which aut omat on

i t came f r o m .

Page 233: Lexical Analysis - Philadelphia

Other Conflicts

T_Do do

T_Double double

T_Identifier [A-Za-z_][A-Za-z0-9_]*

Page 234: Lexical Analysis - Philadelphia

Other Conflicts

T_Do do

T_Double double

T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e

Page 235: Lexical Analysis - Philadelphia

Other Conflicts

T_Do do

T_Double double

T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e

d o u b l e

d o u b l e

Page 236: Lexical Analysis - Philadelphia

Mo r e Tiebreaking

When two regular expressions apply, choose the one with the greater “priority.”

Simple priority system: pick the rule

that was defined first.

Page 237: Lexical Analysis - Philadelphia

Other Conflicts

T_Do do

T_Double double

T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e

d o u b l e

d o u b l e

Page 238: Lexical Analysis - Philadelphia

Other Conflicts

T_Do do

T_Double double

T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e

d o u b l e

Page 239: Lexical Analysis - Philadelphia

Other Conflicts

T_Do

T_Double

do

double

T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e

d o u b l eWhy isn ' t

this a

problem?

Page 240: Lexical Analysis - Philadelphia

O n e Last Detail . . .

We know what to do if multiple rules match.

What if nothing matches?

Trick: Add a “catch-all” rule that matches any character and reports an error.

Page 241: Lexical Analysis - Philadelphia

Summary of Conflict Resolution

Construct an automaton for each regular expression.

M e r g e them into one automaton by adding a new start state.

S c a n the input, keeping track of the last known match.

Break ties by choosing higher-precedence matches.

Have a catch-all rule to handle errors.

Page 242: Lexical Analysis - Philadelphia

Chal lenges in Scanning

Ho w do we determine which lexemes are associated with each token?

When there are multiple ways we could scan the input, how do we know which one to pick?

Ho w do we address these concerns efficiently?

Page 243: Lexical Analysis - Philadelphia

Chal lenges in Scanning

Ho w do we determine which lexemes are associated with each token?

When there are multiple ways we could scan the input, how do we know which one to pick?

Ho w do we address these concerns efficiently?

Page 244: Lexical Analysis - Philadelphia

D FAs

The automata we've seen so far have all been N FAs .

A DFA is like an N F A , but with tighter restrictions:

Every state must have exactly one

transition defined for every letter.

ε-moves are not allowed.

Page 245: Lexical Analysis - Philadelphia

A Sample DFA

Page 246: Lexical Analysis - Philadelphia

A Sample DFA

start

0 0

1

1

0 0

1

1

Page 247: Lexical Analysis - Philadelphia

A Sample DFA

D

start

0 0

1

1

0 0

1

A

C

B1

Page 248: Lexical Analysis - Philadelphia

A Sample DFA

D

start

0 0

1

1

0 0

1

A

C

B1

A. C B

B. D A

C. A D

D. B C

0 1

Page 249: Lexical Analysis - Philadelphia

C o d e for DFAs

int kTransitionTable[kNumStates][kNumSymbols] = {{0, 0, 1, 3, 7, 1, …},…

};bool kAcceptTable[kNumStates] = {

false, true, true,…

};bool simulateDFA(string input) {

int state = 0;for (char ch: input)

state = kTransitionTable[state][ch];return kAcceptTable[state];

}

Page 250: Lexical Analysis - Philadelphia

C o d e for DFAs

int kTransitionTable[kNumStates][kNumSymbols] = {{0, 0, 1, 3, 7, 1, …},…

};bool kAcceptTable[kNumStates] = {

false, true, true,…

};bool simulateDFA(string input) {

int state = 0;for (char ch: input)

state = kTransitionTable[state][ch];return kAcceptTable[state];

}

Runs in t ime O(m )

on a str ing o f

length m .

Page 251: Lexical Analysis - Philadelphia

Speeding up Matching

In the worst-case, an N FA with n states takes time O(mn2) to match a string of length m .

DFAs, on the other hand, take only O(m).

There is another (beautiful!) algorithm to convert N FAs to DFAs.

Lexical Specification

Regular Expressions

NFA DFATable-Driven

DFA

Page 252: Lexical Analysis - Philadelphia

Subset Construction

N FA s can be in many states at once, while DFAs can only be in a single state at a time.

Key idea: M a k e the DFA s imulate the NFA .

H ave the states of the DFA correspond to the sets of states of the N FA .

Transitions between states of DFA correspond to transitions between sets of states in the N FA .

Page 253: Lexical Analysis - Philadelphia

From N F A to DFA

Page 254: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

Page 255: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

Page 256: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start

Page 257: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start

Page 258: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Page 259: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Page 260: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Page 261: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Page 262: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

12

Page 263: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

12

Page 264: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

12

Page 265: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

12

Page 266: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

o3, 6

12

Page 267: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

o3, 6

12

Page 268: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

o3, 6 7 8 9 10

u b l e

12

Page 269: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

o3, 6 7 8 9 10

u b l e

Σ – oΣ – u Σ – b Σ – l Σ – e

Σ

Σ12

Σ

Page 270: Lexical Analysis - Philadelphia

From N F A to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

1211 Σ

0

ε

ε

εstart

0, 1, 4, 11start 2, 5, 12d

Σ – d

o3, 6 7 8 9 10

u b l e

Σ – oΣ – u Σ – b Σ – l Σ – e

Σ

Σ12

Σ

Page 271: Lexical Analysis - Philadelphia

Modified Subset Construction

Instead of marking whether a state is accepting, remember which token type it matches.

Break ties with priorities.

When using DFA as a scanner, consider the DFA “stuck” if it enters the state corresponding to the empty set .

Page 272: Lexical Analysis - Philadelphia

Performance Concerns

The NFA-to-DFA construction can introduce exponentially many states.

Time/memory tradeoff:

Low-memory N FA has higher scan time.

High-memory DFA has lower scan time.

● Could use a hybrid approach by simplifying N FA before generating code.

Page 273: Lexical Analysis - Philadelphia

Real-World Scanning: Python

Page 274: Lexical Analysis - Philadelphia

+

while (ip < z)

++ip;

w h i l e ( i p < z ) \n \t + + i p ;

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

While

++<

Ident

ip

Ident

z

Ident

ip

Page 275: Lexical Analysis - Philadelphia

Python Blocks

● Scoping handled by whitespace:

if w == z:

a = b

c = d

else:

e = f

g = h

● What does that mean for the scanner?

Page 276: Lexical Analysis - Philadelphia

Whitespace Tokens

Special tokens inserted to indicate changes in levels of indentation.

N E W L I N E marks the end of a line.

I N D E N T indicates an increase in indentation.

D E D E N T indicates a decrease in indentation.

N ote that I N D E N T and D E D E N T encodechange in indentation, not the total amount ofindentation.

Page 277: Lexical Analysis - Philadelphia

if w == z:

a = b

c = d

else:

e = f

g = h

Scanning Python

Page 278: Lexical Analysis - Philadelphia

if w == z:

a = b

c = d

else:

e = f

g = h

if ident

w

== ident

z

: NEWLINE

INDENT

a b

ident = ident NEWLINE

c d

ident = ident NEWLINE

DEDENT else :

INDENT

e f

ident = ident NEWLINE

Scanning Python

NEWLINE

DEDENT ident

g

= ident

h

NEWLINE

Page 279: Lexical Analysis - Philadelphia

if w == z: {

a = b;

c = d;

} else {

e = f;

}

g = h;

if ident

w

== ident

z

: NEWLINE

INDENT ident

a

ident

c

= ident NEWLINE

b

= ident

d

NEWLINE

DEDENT else :

INDENT

e f

ident = ident NEWLINE

Scanning Python

NEWLINE

DEDENT ident

g

= ident

h

NEWLINE

Page 280: Lexical Analysis - Philadelphia

if w == z: {

a = b;

c = d;

} else {

e = f;

}

g = h;

if ident

w

== ident

z

:

{

a

;

ident

c

ident = ident

b

= ident

d

;

} else :

e

ident = ident

f

;

Scanning Python

{

} ident

g

= ident

h

;

Page 281: Lexical Analysis - Philadelphia

Where to I N D E N T / D E D E N T ?

Scanner maintains a stack of line indentations keeping track of all indented contexts so far.

Initially, this stack contains 0, since initially the contents of the file aren't indented.

O n a newline:

S e e how m u c h whitespace is at the start of the line.

If this value exceeds the top of the stack:

Push the value onto the stack.

Emit an I N D E N T token.

● Otherwise, while the value is less than the top of the stack:

Pop the stack.

Emit a D E D E N T token.

Source: http://docs.python.org/reference/lexical_analysis.html

Page 282: Lexical Analysis - Philadelphia

Interesting Observation

Normally, more text on a line translates into more tokens.

With D E D E N T , less text on a line often means more tokens:

if cond1:

if cond2:

if cond3:

if cond4:

if cond5:

statement1

statement2

Page 283: Lexical Analysis - Philadelphia

S umm ary

Lexical analysis splits input text into tokens

holding a lexeme and an attribute .

Lexemes are sets of strings often defined with regular expressions .

Regular expressions can be converted toNFAs and from there to DFAs .

Max ima l -munch using an automaton allows for fast scanning.

N ot all tokens come directly from the source code.

Page 284: Lexical Analysis - Philadelphia

Next Time

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

Source Code

Machine

Code

( Plus a li t t le b i t her e )