46
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. COMP 412 FALL 2010

Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled

Embed Size (px)

Citation preview

Lexical Analysis — Part IIFrom Regular Expression

to Scanner

Comp 412

Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

COMP 412FALL 2010

Comp 412, Fall 2010 2

Quick Review

Last class:— The scanner is the first stage in the front end— Specifications can be expressed using regular expressions— Build tables and code from a DFA

ScannerGenerator

specifications

Scannersource code parts of speech &

words

Specifications written as “regular expressions”

Represent words as

indices into a global

table

tables or code

design time

compile time

Comp 412, Fall 2010 3

Quick Review of Regular Expressions

• All strings of 1s and 0s ending in a 1

• All strings over lowercase letters where the vowels (a,e,i,o, & u) occur exactly once, in ascending order

• All strings of 1s and 0s that do not contain three 0s in a row:

( 1* ( |01 | 001 ) 1* )* ( | 0 | 00 )

( 0 | 1 )* 1

Let Cons be (b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z)

Cons* a Cons* e Cons* i Cons* o Cons* u Cons*

Comp 412, Fall 2010 4

Consider the problem of recognizing ILOC register names

Register r (0|1|2| … | 9) (0|1|2| … | 9)*

• Allows registers of arbitrary number• Requires at least one digit

RE corresponds to a recognizer (or DFA)

Transitions on other inputs go to an error state, se

Example (from Lab 1)

S0 S2 S1

r

(0|1|2| … 9)

accepting state

(0|1|2| … 9)

Recognizer for Register

Comp 412, Fall 2010 5

DFA operation

• Start in state S0 & make transitions on each input character

• DFA accepts a word x iff x leaves it in a final state (S2 )

So,

• r17 takes it through s0, s1, s2 and accepts

• r takes it through s0, s1 and fails

• a takes it straight to se

Example (continued)

S0 S2 S1

r

(0|1|2| … 9)

accepting state

(0|1|2| … 9)

Recognizer for Register

Comp 412, Fall 2010 6

Example (continued)

To be useful, the recognizer must be converted into code

r0,1,2,3,4,5,6,7,8,

9

All others

s0 s1 se se

s1 se s2 se

s2 se s2 se

se se se se

Char next characterState s0

while (Char EOF) State (State,Char) Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer

Table encoding the RE

O(1) cost per character

(or per transition)S0 S2 S1

r

(0|1|2| … 9)

(0|1|2| … 9)

Comp 412, Fall 2010 7

Example (continued)

We can add “actions” to each transition

r0,1,2,3,4,5,6,7,8,

9

All other

s

s0 s1

startse

errorse

error

s1 se

errors2

addse

error

s2 se

errors2

addse

error

se se

errorse

errorse

error

Char next characterState s0

while (Char EOF) Next (State,Char) Act (State,Char) perform action Act State Next Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer

Table encoding RE

Typical action is to capture the lexeme

Comp 412, Fall 2010 8

r Digit Digit* allows arbitrary numbers• Accepts r00000 • Accepts r99999• What if we want to limit it to r0 through r31 ?

Write a tighter regular expression— Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

— Register r0|r1|r2| … |r31|r00|r01|r02| … |r09

Produces a more complex DFA

• DFA has more states• DFA has same cost per transition (or per

character)• DFA has same basic implementation

What if we need a tighter specification?

Comp 412, Fall 2010 9

Tighter register specification (continued)

The DFA forRegister r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

• Accepts a more constrained set of register names• Same set of actions, more states

S0 S5 S1

r

S4

S3

S6

S2

0,1,2

3 0,1

4,5,6,7,8,9

(0|1|2| … 9)

Comp 412, Fall 2010 10

Tighter register specification (continued)

r 0,1 2 3 4-9All

others

s0 s1 se se se se se

s1 se s2 s2 s5 s4 se

s2 se s3 s3 s3 s3 se

s3 se se se se se se

s4 se se se se se se

s5 se s6 se se se se

s6 se se se se se se

se se se se se se se

Table encoding RE for the tighter register specification

This table runs in the same skeleton recognizer

This table uses the same O(1) time per character

The extra precision costs us table space, not time

S0 S5 S1

r

S4

S3

S6

S2

0,1,2

3 0,1

4,5,6,7,8,9

(0|1|2| … 9)

Comp 412, Fall 2010 11

Tighter register specification (continued)

State Action

r 0,1 2 34,5,67,8,9

other

01

starte e e e e

1 e2

add2

add5

add4

adde

2 e3

add3

add3

add3

adde

exit

3,4 e e e e ee

exit

5 e6

adde e e

eexit

6 e e e e ex

exit

e e e e e e e

S0 S5 S1

r

S4

S3

S6

S2

0,1,2

3 0,1

4,5,6,7,8,9

(0|1|2| … 9)

We care about path lengths (time) and finite size of set of states (representability), but we don’t worry (much) about number of states.

Comp 412, Fall 2010 12

Where are we going?

• We will show how to construct a finite state automaton to recognize any RE

• Overview:— Direct construction of a nondeterministic finite automaton

(NFA) to recognize a given RE Easy to build in an algorithmic way Requires -transitions to combine regular subexpressions

— Construct a deterministic finite automaton (DFA) to simulate the NFA

Use a set-of-states construction

— Minimize the number of states in the DFA Hopcroft state minimization algorithm

— Generate the scanner code Additional specifications needed for the actions

Introduce NFAs

Optional, but worthwhile

Comp 412, Fall 2010 13

Non-deterministic Finite Automata

What about an RE such as ( a | b )* abb ?

Each RE corresponds to a deterministic finite automaton (DFA)

• We know a DFA exists for each RE

• The DFA may be hard to build directly

• Automatic techniques will build it for us …

S0 S1 S3 S2

b

b

b

b

a

a

a

a

Nothing here that would change the O(1) cost per

transition

Comp 412, Fall 2010 14

Non-deterministic Finite Automata

Here is a simpler RE for ( a | b )* abb

This recognizer is more intuitive• Structure seems to follow the RE’s structure

This recognizer is not a DFA

• S0 has a transition on

• S1 has two transitions on a

This is a non-deterministic finite automaton (NFA)

a | b

S0 S1 S4 S2 S3

a bb

This NFA needs one more transition, at O(1) cost per

transition

Comp 412, Fall 2010 15

Non-deterministic Finite Automata

An NFA accepts a string x iff a path though the transition graph from s0 to a final state such that

the edge labels spell x, ignoring ’s

• Transitions on consume no input

• To “run” the NFA, start in s0 and guess the right transition at each step— Always guess correctly— If some sequence of correct guesses accepts x then accept

Why study NFAs?• They are the key to automating the REDFA construction

• We can paste together NFAs with -transitions

NFA NFA becomes an NFA

Comp 412, Fall 2010 16

Relationship between NFAs and DFAs

DFA is a special case of an NFA

• DFA has no transitions• DFA’s transition function is single-valued• Same rules will work

DFA can be simulated with an NFA— Obviously

NFA can be simulated with a DFA (less obvious)

• Simulate sets of possible states• Possible exponential blowup in the state space• Still, one state per character in the input stream

Rabin & Scott, 1959

Comp 412, Fall 2010 17

Automating Scanner Construction

To convert a specification into code:

1 Write down the RE for the input language

2 Build a big NFA

3 Build the DFA that simulates the NFA

4 Systematically shrink the DFA

5 Turn it into code

Scanner generators• Lex and Flex work along these lines• Algorithms are well-known and well-understood• Key issue is interface to parser (define all parts of

speech)• You could build one in a weekend!

Comp 412, Fall 2010 18

Where are we? Why are we doing this?

RE NFA (Thompson’s construction) • Build an NFA for each term• Combine them with -moves

NFA DFA (Subset construction) • Build the simulation

DFA Minimal DFA • Hopcroft’s algorithm

DFA RE

• All pairs, all paths problem• Union together paths from s0 to a final state

minimal DFA

RE NFA DFA

The Cycle of Constructions

Comp 412, Fall 2010 19

RE NFA using Thompson’s Construction

Key idea

• NFA pattern for each symbol & each operator

• Join them with moves in precedence order

S0 S1

a

NFA for a

S0 S1

aS3 S4

b

NFA for ab

NFA for a | b

S0

S1 S2

a

S3 S4

b

S5

S0 S1

S3 S4

NFA for a*

a

Ken Thompson, CACM, 1968

Comp 412, Fall 2010 20

Example of Thompson’s Construction

Let’s try a ( b | c )*

1. a, b, & c

2. b | c

3. ( b | c )*

S0 S1 a

S0 S1 b

S0 S1 c

S2 S3 b

S4 S5 c

S1 S6 S0 S7

S1 S2 b

S3 S4 c

S0 S5

Comp 412, Fall 2010 21

Example of Thompson’s Construction (con’t)

4. a ( b | c )*

Of course, a human would design something simpler ...

S0 S1 a

b | cBut, we can automate production of the more complex NFA version ...

S0 S1 a

S4 S5 b

S6 S7 c

S3 S8 S2 S9

Comp 412, Fall 2010 22

Where are we? Why are we doing this?

RE NFA (Thompson’s construction) • Build an NFA for each term• Combine them with -moves

NFA DFA (subset construction) • Build the simulation

DFA Minimal DFA • Hopcroft’s algorithm

DFA RE

• All pairs, all paths problem• Union together paths from s0 to a final state

minimal DFA

RE NFA DFA

The Cycle of Constructions

Comp 412, Fall 2010 23

NFA DFA with Subset Construction

Need to build a simulation of the NFA

Two key functions

• Move(si , a) is the set of states reachable from si by a

• -closure(si) is the set of states reachable from si by

The algorithm:

• Start state derived from s0 of the NFA

• Take its -closure S0 = -closure({s0})

• Take the image of S0, Move(S0, ) for each , and

take its -closure

• Iterate until no more states are added

Sounds more complex than it is…

Rabin & Scott, 1959

Comp 412, Fall 2010 24

NFA DFA with Subset Construction

The algorithm:

s0 -closure({n0})

S { s0 }

W { s0 }

while ( W ≠ Ø )select and remove s from

Wfor each

t -closure(Move(s,))

T[s,] t if ( t S ) then

add t to S add t to W

Let’s think about why this works

The algorithm halts:

1. S contains no duplicates (test before adding)

2. 2{NFA states} is finite

3. while loop adds to S, but does not remove from S

(monotone)

the loop halts

S contains all the reachable NFA states

It tries each character in each si.

It builds every possible NFA configuration.

S and T form the DFAThis test is a little trickys0 is a set of states

S & W are sets of sets of states

Comp 412, Fall 2010 25

NFA DFA with Subset Construction

The algorithm:

s0 -closure({n0})

S { s0 }

W { s0 }

while ( W ≠ Ø )select and remove s from

Wfor each

t -closure(Move(s,))

T[s,] t if ( t S ) then

add t to S add t to W

Let’s think about why this works

The algorithm halts:

1. S contains no duplicates (test before adding)

2. 2{NFA states} is finite

3. while loop adds to S, but does not remove from S

(monotone)

the loop halts

S contains all the reachable NFA states

It tries each character in each si.

It builds every possible NFA configuration.

S and T form the DFAAny DFA state containing a final state of the NFA final state becomes a final state of the DFA.

Comp 412, Fall 2010 26

NFA DFA with Subset Construction

Example of a fixed-point computation• Monotone construction of some finite set• Halts when it stops adding to the set• Proofs of halting & correctness are similar• These computations arise in many contexts

Other fixed-point computations• Canonical construction of sets of LR(1) items

— Quite similar to the subset construction

• Classic data-flow analysis (& Gaussian Elimination)— Solving sets of simultaneous set equations

We will see many more fixed-point computations

Comp 412, Fall 2010 27

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 28

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 29

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 30

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 31

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 32

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 33

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 34

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 35

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Comp 412, Fall 2010 36

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none

s3q7, q8, q9,q3, q4, q6

none

Comp 412, Fall 2010 37

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

q7 is the core state of s3

Comp 412, Fall 2010 38

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

q5 is the core state of s2

Comp 412, Fall 2010 39

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0q1, q2, q3,q4, q6, q9

none none

s1

q1, q2, q3,q4, q6, q9

noneq5, q8, q9,q3, q4, q6

q7, q8, q9,q3, q4, q6

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Final states because of q9

Comp 412, Fall 2010 40

NFA DFA with Subset Construction

a ( b | c )* :

q0 q1 a

q4 q5 b

q6 q7 c

q3 q8 q2 q9

States -closure(Move(s,*))

DFA NFA a b c

s0 q0 s1none none

s1

q1, q2, q3,q4, q6, q9

none s2 s3

s2

q5, q8, q9,q3, q4, q6

none s2 s3

s3q7, q8, q9,q3, q4, q6

none s2 s3

Transition table for the DFA

Comp 412, Fall 2010 41

NFA DFA with Subset Construction

The DFA for a ( b | c )*

• Much smaller than the NFA (no -transitions)

• All transitions are deterministic

• Use same code skeleton as before

s3

s2

s0 s1

c

ba

b

c

c

b

S0 S1 a

b | cBut, remember our goal:

a b c

s0 s1none none

s1none s2 s3

s2none s2 s3

s3none s2 s3

Comp 412, Fall 2010 42

Where are we? Why are we doing this?

RE NFA (Thompson’s construction) • Build an NFA for each term• Combine them with -moves

NFA DFA (subset construction) • Build the simulation

DFA Minimal DFA • Hopcroft’s algorithm

DFA RE

• All pairs, all paths problem• Union together paths from s0 to a final state

Not enough time to teach Hopcroft’s algorithm today

minimal DFA

RE NFA DFA

The Cycle of Constructions

Comp 412, Fall 2010 43

Alternative Approach to DFA Minimization

The Intuition• The subset construction merges prefixes in the NFA

s0

s10s9s8

s5 s7s6

s3s2s1 s4

a

b

a

b

c

d

c

abc | bc | ad

Thompson’s construction would leave -transitions between each single-character automaton

s0

s6

s4 s5

s2s1 s3a

b

b

c

d

c Subset construction eliminates -transitions and merges the paths for a. It leaves duplicate tails, such as bc.

Comp 412, Fall 2010 44

Alternative Approach to DFA Minimization

Idea: use the subset construction twice• For an NFA N

— Let reverse(N) be the NFA constructed by making initial states final (& vice-versa) and reversing the edges

— Let subset(N) be the DFA that results from applying the subset construction to N

— Let reachable(N) be N after removing all states that are not reachable from the initial state

• Then,

reachable(subset(reverse[reachable(subset(reverse(N))]))

is the minimal DFA that implements N [Brzozowski, 1962]

This result is not intuitive, but it is true.Neither algorithm dominates the other.

Comp 412, Fall 2010 45

Alternative Approach to DFA Minimization

Step 1• The subset construction on reverse(NFA) merges

suffixes in original NFA

s11

Reversed NFAs0

s10s9s8

s5 s7s6

s3s2s1 s4

a

b

a

b

c

d

c

s11s9s8

s3s2s1a

a

b

d

c

subset(reverse(NFA))

Comp 412, Fall 2010 46

Alternative Approach to DFA Minimization

Step 2• Reverse it again & use subset to merge prefixes …

Reverse it, agains11

s9s8

s3s2s1a

a

b

d

c

s0

And subset it, again

s11s3

s2a

b

d

cs0

b

Minimal DFA

minimal DFA

RE NFA DFA

The Cycle of Constructions

BrzozowskiBrzozowski