Upload
christian-mills
View
230
Download
2
Tags:
Embed Size (px)
Citation preview
Lexical Analysis — Part IIFrom Regular Expression
to Scanner
Comp 412
Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
COMP 412FALL 2010
Comp 412, Fall 2010 2
Quick Review
Last class:— The scanner is the first stage in the front end— Specifications can be expressed using regular expressions— Build tables and code from a DFA
ScannerGenerator
specifications
Scannersource code parts of speech &
words
Specifications written as “regular expressions”
Represent words as
indices into a global
table
tables or code
design time
compile time
Comp 412, Fall 2010 3
Quick Review of Regular Expressions
• All strings of 1s and 0s ending in a 1
• All strings over lowercase letters where the vowels (a,e,i,o, & u) occur exactly once, in ascending order
• All strings of 1s and 0s that do not contain three 0s in a row:
( 1* ( |01 | 001 ) 1* )* ( | 0 | 00 )
( 0 | 1 )* 1
Let Cons be (b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z)
Cons* a Cons* e Cons* i Cons* o Cons* u Cons*
Comp 412, Fall 2010 4
Consider the problem of recognizing ILOC register names
Register r (0|1|2| … | 9) (0|1|2| … | 9)*
• Allows registers of arbitrary number• Requires at least one digit
RE corresponds to a recognizer (or DFA)
Transitions on other inputs go to an error state, se
Example (from Lab 1)
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
Comp 412, Fall 2010 5
DFA operation
• Start in state S0 & make transitions on each input character
• DFA accepts a word x iff x leaves it in a final state (S2 )
So,
• r17 takes it through s0, s1, s2 and accepts
• r takes it through s0, s1 and fails
• a takes it straight to se
Example (continued)
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
Comp 412, Fall 2010 6
Example (continued)
To be useful, the recognizer must be converted into code
r0,1,2,3,4,5,6,7,8,
9
All others
s0 s1 se se
s1 se s2 se
s2 se s2 se
se se se se
Char next characterState s0
while (Char EOF) State (State,Char) Char next character
if (State is a final state ) then report success else report failure
Skeleton recognizer
Table encoding the RE
O(1) cost per character
(or per transition)S0 S2 S1
r
(0|1|2| … 9)
(0|1|2| … 9)
Comp 412, Fall 2010 7
Example (continued)
We can add “actions” to each transition
r0,1,2,3,4,5,6,7,8,
9
All other
s
s0 s1
startse
errorse
error
s1 se
errors2
addse
error
s2 se
errors2
addse
error
se se
errorse
errorse
error
Char next characterState s0
while (Char EOF) Next (State,Char) Act (State,Char) perform action Act State Next Char next character
if (State is a final state ) then report success else report failure
Skeleton recognizer
Table encoding RE
Typical action is to capture the lexeme
Comp 412, Fall 2010 8
r Digit Digit* allows arbitrary numbers• Accepts r00000 • Accepts r99999• What if we want to limit it to r0 through r31 ?
Write a tighter regular expression— Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
— Register r0|r1|r2| … |r31|r00|r01|r02| … |r09
Produces a more complex DFA
• DFA has more states• DFA has same cost per transition (or per
character)• DFA has same basic implementation
What if we need a tighter specification?
Comp 412, Fall 2010 9
Tighter register specification (continued)
The DFA forRegister r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
• Accepts a more constrained set of register names• Same set of actions, more states
S0 S5 S1
r
S4
S3
S6
S2
0,1,2
3 0,1
4,5,6,7,8,9
(0|1|2| … 9)
Comp 412, Fall 2010 10
Tighter register specification (continued)
r 0,1 2 3 4-9All
others
s0 s1 se se se se se
s1 se s2 s2 s5 s4 se
s2 se s3 s3 s3 s3 se
s3 se se se se se se
s4 se se se se se se
s5 se s6 se se se se
s6 se se se se se se
se se se se se se se
Table encoding RE for the tighter register specification
This table runs in the same skeleton recognizer
This table uses the same O(1) time per character
The extra precision costs us table space, not time
S0 S5 S1
r
S4
S3
S6
S2
0,1,2
3 0,1
4,5,6,7,8,9
(0|1|2| … 9)
Comp 412, Fall 2010 11
Tighter register specification (continued)
State Action
r 0,1 2 34,5,67,8,9
other
01
starte e e e e
1 e2
add2
add5
add4
adde
2 e3
add3
add3
add3
adde
exit
3,4 e e e e ee
exit
5 e6
adde e e
eexit
6 e e e e ex
exit
e e e e e e e
S0 S5 S1
r
S4
S3
S6
S2
0,1,2
3 0,1
4,5,6,7,8,9
(0|1|2| … 9)
We care about path lengths (time) and finite size of set of states (representability), but we don’t worry (much) about number of states.
Comp 412, Fall 2010 12
Where are we going?
• We will show how to construct a finite state automaton to recognize any RE
• Overview:— Direct construction of a nondeterministic finite automaton
(NFA) to recognize a given RE Easy to build in an algorithmic way Requires -transitions to combine regular subexpressions
— Construct a deterministic finite automaton (DFA) to simulate the NFA
Use a set-of-states construction
— Minimize the number of states in the DFA Hopcroft state minimization algorithm
— Generate the scanner code Additional specifications needed for the actions
Introduce NFAs
Optional, but worthwhile
Comp 412, Fall 2010 13
Non-deterministic Finite Automata
What about an RE such as ( a | b )* abb ?
Each RE corresponds to a deterministic finite automaton (DFA)
• We know a DFA exists for each RE
• The DFA may be hard to build directly
• Automatic techniques will build it for us …
S0 S1 S3 S2
b
b
b
b
a
a
a
a
Nothing here that would change the O(1) cost per
transition
Comp 412, Fall 2010 14
Non-deterministic Finite Automata
Here is a simpler RE for ( a | b )* abb
This recognizer is more intuitive• Structure seems to follow the RE’s structure
This recognizer is not a DFA
• S0 has a transition on
• S1 has two transitions on a
This is a non-deterministic finite automaton (NFA)
a | b
S0 S1 S4 S2 S3
a bb
This NFA needs one more transition, at O(1) cost per
transition
Comp 412, Fall 2010 15
Non-deterministic Finite Automata
An NFA accepts a string x iff a path though the transition graph from s0 to a final state such that
the edge labels spell x, ignoring ’s
• Transitions on consume no input
• To “run” the NFA, start in s0 and guess the right transition at each step— Always guess correctly— If some sequence of correct guesses accepts x then accept
Why study NFAs?• They are the key to automating the REDFA construction
• We can paste together NFAs with -transitions
NFA NFA becomes an NFA
Comp 412, Fall 2010 16
Relationship between NFAs and DFAs
DFA is a special case of an NFA
• DFA has no transitions• DFA’s transition function is single-valued• Same rules will work
DFA can be simulated with an NFA— Obviously
NFA can be simulated with a DFA (less obvious)
• Simulate sets of possible states• Possible exponential blowup in the state space• Still, one state per character in the input stream
Rabin & Scott, 1959
Comp 412, Fall 2010 17
Automating Scanner Construction
To convert a specification into code:
1 Write down the RE for the input language
2 Build a big NFA
3 Build the DFA that simulates the NFA
4 Systematically shrink the DFA
5 Turn it into code
Scanner generators• Lex and Flex work along these lines• Algorithms are well-known and well-understood• Key issue is interface to parser (define all parts of
speech)• You could build one in a weekend!
Comp 412, Fall 2010 18
Where are we? Why are we doing this?
RE NFA (Thompson’s construction) • Build an NFA for each term• Combine them with -moves
NFA DFA (Subset construction) • Build the simulation
DFA Minimal DFA • Hopcroft’s algorithm
DFA RE
• All pairs, all paths problem• Union together paths from s0 to a final state
minimal DFA
RE NFA DFA
The Cycle of Constructions
Comp 412, Fall 2010 19
RE NFA using Thompson’s Construction
Key idea
• NFA pattern for each symbol & each operator
• Join them with moves in precedence order
S0 S1
a
NFA for a
S0 S1
aS3 S4
b
NFA for ab
NFA for a | b
S0
S1 S2
a
S3 S4
b
S5
S0 S1
S3 S4
NFA for a*
a
Ken Thompson, CACM, 1968
Comp 412, Fall 2010 20
Example of Thompson’s Construction
Let’s try a ( b | c )*
1. a, b, & c
2. b | c
3. ( b | c )*
S0 S1 a
S0 S1 b
S0 S1 c
S2 S3 b
S4 S5 c
S1 S6 S0 S7
S1 S2 b
S3 S4 c
S0 S5
Comp 412, Fall 2010 21
Example of Thompson’s Construction (con’t)
4. a ( b | c )*
Of course, a human would design something simpler ...
S0 S1 a
b | cBut, we can automate production of the more complex NFA version ...
S0 S1 a
S4 S5 b
S6 S7 c
S3 S8 S2 S9
Comp 412, Fall 2010 22
Where are we? Why are we doing this?
RE NFA (Thompson’s construction) • Build an NFA for each term• Combine them with -moves
NFA DFA (subset construction) • Build the simulation
DFA Minimal DFA • Hopcroft’s algorithm
DFA RE
• All pairs, all paths problem• Union together paths from s0 to a final state
minimal DFA
RE NFA DFA
The Cycle of Constructions
Comp 412, Fall 2010 23
NFA DFA with Subset Construction
Need to build a simulation of the NFA
Two key functions
• Move(si , a) is the set of states reachable from si by a
• -closure(si) is the set of states reachable from si by
The algorithm:
• Start state derived from s0 of the NFA
• Take its -closure S0 = -closure({s0})
• Take the image of S0, Move(S0, ) for each , and
take its -closure
• Iterate until no more states are added
Sounds more complex than it is…
Rabin & Scott, 1959
Comp 412, Fall 2010 24
NFA DFA with Subset Construction
The algorithm:
s0 -closure({n0})
S { s0 }
W { s0 }
while ( W ≠ Ø )select and remove s from
Wfor each
t -closure(Move(s,))
T[s,] t if ( t S ) then
add t to S add t to W
Let’s think about why this works
The algorithm halts:
1. S contains no duplicates (test before adding)
2. 2{NFA states} is finite
3. while loop adds to S, but does not remove from S
(monotone)
the loop halts
S contains all the reachable NFA states
It tries each character in each si.
It builds every possible NFA configuration.
S and T form the DFAThis test is a little trickys0 is a set of states
S & W are sets of sets of states
Comp 412, Fall 2010 25
NFA DFA with Subset Construction
The algorithm:
s0 -closure({n0})
S { s0 }
W { s0 }
while ( W ≠ Ø )select and remove s from
Wfor each
t -closure(Move(s,))
T[s,] t if ( t S ) then
add t to S add t to W
Let’s think about why this works
The algorithm halts:
1. S contains no duplicates (test before adding)
2. 2{NFA states} is finite
3. while loop adds to S, but does not remove from S
(monotone)
the loop halts
S contains all the reachable NFA states
It tries each character in each si.
It builds every possible NFA configuration.
S and T form the DFAAny DFA state containing a final state of the NFA final state becomes a final state of the DFA.
Comp 412, Fall 2010 26
NFA DFA with Subset Construction
Example of a fixed-point computation• Monotone construction of some finite set• Halts when it stops adding to the set• Proofs of halting & correctness are similar• These computations arise in many contexts
Other fixed-point computations• Canonical construction of sets of LR(1) items
— Quite similar to the subset construction
• Classic data-flow analysis (& Gaussian Elimination)— Solving sets of simultaneous set equations
We will see many more fixed-point computations
Comp 412, Fall 2010 27
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 28
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 29
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 30
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 31
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 32
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 33
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 34
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 35
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Comp 412, Fall 2010 36
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none
s3q7, q8, q9,q3, q4, q6
none
Comp 412, Fall 2010 37
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
q7 is the core state of s3
Comp 412, Fall 2010 38
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
q5 is the core state of s2
Comp 412, Fall 2010 39
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0q1, q2, q3,q4, q6, q9
none none
s1
q1, q2, q3,q4, q6, q9
noneq5, q8, q9,q3, q4, q6
q7, q8, q9,q3, q4, q6
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Final states because of q9
Comp 412, Fall 2010 40
NFA DFA with Subset Construction
a ( b | c )* :
q0 q1 a
q4 q5 b
q6 q7 c
q3 q8 q2 q9
States -closure(Move(s,*))
DFA NFA a b c
s0 q0 s1none none
s1
q1, q2, q3,q4, q6, q9
none s2 s3
s2
q5, q8, q9,q3, q4, q6
none s2 s3
s3q7, q8, q9,q3, q4, q6
none s2 s3
Transition table for the DFA
Comp 412, Fall 2010 41
NFA DFA with Subset Construction
The DFA for a ( b | c )*
• Much smaller than the NFA (no -transitions)
• All transitions are deterministic
• Use same code skeleton as before
s3
s2
s0 s1
c
ba
b
c
c
b
S0 S1 a
b | cBut, remember our goal:
a b c
s0 s1none none
s1none s2 s3
s2none s2 s3
s3none s2 s3
Comp 412, Fall 2010 42
Where are we? Why are we doing this?
RE NFA (Thompson’s construction) • Build an NFA for each term• Combine them with -moves
NFA DFA (subset construction) • Build the simulation
DFA Minimal DFA • Hopcroft’s algorithm
DFA RE
• All pairs, all paths problem• Union together paths from s0 to a final state
Not enough time to teach Hopcroft’s algorithm today
minimal DFA
RE NFA DFA
The Cycle of Constructions
Comp 412, Fall 2010 43
Alternative Approach to DFA Minimization
The Intuition• The subset construction merges prefixes in the NFA
s0
s10s9s8
s5 s7s6
s3s2s1 s4
a
b
a
b
c
d
c
abc | bc | ad
Thompson’s construction would leave -transitions between each single-character automaton
s0
s6
s4 s5
s2s1 s3a
b
b
c
d
c Subset construction eliminates -transitions and merges the paths for a. It leaves duplicate tails, such as bc.
Comp 412, Fall 2010 44
Alternative Approach to DFA Minimization
Idea: use the subset construction twice• For an NFA N
— Let reverse(N) be the NFA constructed by making initial states final (& vice-versa) and reversing the edges
— Let subset(N) be the DFA that results from applying the subset construction to N
— Let reachable(N) be N after removing all states that are not reachable from the initial state
• Then,
reachable(subset(reverse[reachable(subset(reverse(N))]))
is the minimal DFA that implements N [Brzozowski, 1962]
This result is not intuitive, but it is true.Neither algorithm dominates the other.
Comp 412, Fall 2010 45
Alternative Approach to DFA Minimization
Step 1• The subset construction on reverse(NFA) merges
suffixes in original NFA
s11
Reversed NFAs0
s10s9s8
s5 s7s6
s3s2s1 s4
a
b
a
b
c
d
c
s11s9s8
s3s2s1a
a
b
d
c
subset(reverse(NFA))