Upload
quentin-george
View
245
Download
0
Tags:
Embed Size (px)
Citation preview
– 2 –
Lexical Analysis
Objectives
• To Understand1. The Role of a Lexical Analyzer
2. Lexical Analysis using formal Language definitions with Finite Automata
3. Specifications & Recognition of Tokens
4. A Language for Specifying Lexical Analyzerswww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Programming Language Structure Recall that a Programming Language is defined by
1. SYNTAX: – Decides whether a sentence in a language is well-formed
2. SEMANTICS– Determines the meaning, if any, of a syntactically well-
formed sentence
3. GRAMMAR – A formal system that provides a generative finite
description of the language www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Syntax of a Programming Language• Describes the structure of programs without any
consideration of their meaning. • The syntactic elements of a programming
language are determined by the computation model and pragmatic concerns
• well developed tools (regular, context-free and attribute grammars) are available for the description of the syntax of programming language
• Lexical Analyzer & the Parser of a compiler handle the Syntax of the programming language
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Some Basic Definitions• lex-i-cal :
• lexical analysis:
• syntax analysis:
• parsing:
Of or relating to words or the vocabulary of a language as distinguished from its grammar and construction
The task concerned with breaking an input into its smallest meaningful units, called tokens.
The task concerned with fitting a sequence of tokens into a specified syntax.
To break a sentence down into its component parts of speech with an explanation of the form, function,and syntactical relationship of each part.
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Lexical Analyzer (A.k.a. Scanner) • The only part of a compiler that looks at each character of the source text and does a linear analysis
• Reads source text and produces • Also keeps track of the source-coordinates of each
token - which file name, line number and position – (This is useful for debugging & error indication purposes.)
• Advantages of a separate Lexical Analyzer:– Keeps Compiler design simple– Improves Efficiency and – Increases Portability
TOKENS
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
The Role of a Lexical AnalyzerLexical analyzer
Syntaxanalyzer
symboltable
get nexttoken
SourceProgram
get nextchar
next char next token
(Contains a record for each identifier)
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Tokens, Patterns and Lexemes• What are Tokens ?– The basic lexical units of the language– A sequence of Abstract Characters that can be treated
as a unit in the grammar of the language – A programming language classifies the tokens into a
finite set of token types
• A note on TerminologySome texts refer to– token types as tokens &– tokens as lexemes
We will stick to the terms Tokens and Token Types
Some tokens may have attributesinteger constant token will have the
actual integer (17, 42) as an attribute;
Identifiers will have a string with the actual id
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Tokens Example• Let us Consider the program segment:
void main() { printf("Hello World\n"); }
• The tokens of this program segment are:1. void, 2. main,3. (, 4. ), 5. {6. printf,
7.7. (, (,
8.8. "Hello World\n","Hello World\n",
9.9. ), ),
10.10. ; and ; and
11.11. }}www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Specifications of TokensString Words and Sentences
1. Prefix of s A string obtained by deleting trailing symbols
2. suffix of s A string obtained by deleting leading symbols
3. Substring of s A string obtained by deleting a prefix & a suffix
4. Proper A prefix, suffix or sub string that is nonempty s.t s = x
5. Subsequence of s A string obtained by deleting symbols not necessarily contiguouswww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
The Principle of Longest match• In most languages, the scanner should pick the
longest possible string to make up the next token if there is a choice
• Examplereturn foobar != hohum;
should be recognized as 5 tokens
not more (i.e., not parts of words or identifiers, or ! and = as separate tokens)
RETURN ID(foobar)0 NEQ ID(hohum) SCOLON
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Typical Tokens in Programming Languages• Operators & Punctuation
– + - * / ( ) { } [ ] ; : :: < <= == = != ! …– Each of these is a distinct lexical class ( or token type )
• Keywords– if while for goto return switch void …– Each of these is also a distinct lexical class (not a string)
• Identifiers– A single ID lexical class, but parameterized by actual id
• Integer constants– A single INT lexical class, but parameterized by int value
• Other constants, etc.www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Tokens of a Typical Language
, (Comma) != (Noteq) ( (Lparen) …….SYMBOLS
IF DO WHILE INT ………KEYWORDS
66.1 .5 10. 1e67 5.5e-10 ……..REAL
73 , 0 , 00 , 515 , +2 ……..NUM
foo, n14, a, temp……ID
EXAMPLETYPE
Tokens of a Typical Language
, (Comma) != (Noteq) ( (Lparen) …….SYMBOLS
IF DO WHILE INT ………KEYWORDS
66.1 .5 10. 1e67 5.5e-10 ……..REAL
73 , 0 , 00 , 515 , +2 ……..NUM
foo, n14, a, temp……ID
EXAMPLETYPE
Question: How are tokens fo
rmally defined and recognized?
Answer: By u
sing regular expressions to
define a token as
a form
al regular la
nguage
Formal Theory of Languages• A language in real life is made up of
1. words made up of alphabets and2. Sentences made up of words arranged according to
the Grammar of that language
• Natural languages display amazing variety of expressions with Explicit & implicit meanings and variations in meaning as well as grammars
• Computer languages on the contrary focus on – The limited set of tasks to be performed– Hence mathematical precision is essential in
defining their structure and Grammarwww.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Formal Definition of Languages• Alphabet• String
• Language
A finite (non-empty) set of symbols denoted by Σ
A finite sequence of symbols from an alphabet which includes even the empty sequence (denoted by λ ) A set ( often infinite) of finite strings The set of all possible finite strings of elements of
alphabet Σ ( including λ ) is denoted by Σ* Finite specifications of (possibly infinite) languages is
possible with1. Automaton – a recognizer; a machine that accepts all strings
in a language (and rejects all other strings)2. Grammar – a generator; a system for producing all strings in
the language (and no other strings)www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Formal Definition of Languages• Alphabet• String
• Language
A finite (non-empty) set of symbols denoted by Σ
A finite sequence of symbols from an alphabet which includes even the empty sequence (denoted by λ ) A set ( often infinite) of finite strings The set of all possible finite strings of elements of
alphabet Σ ( including λ ) is denoted by Σ* Finite specifications of (possibly infinite) languages is
possible with1. Automaton – a recognizer; a machine that accepts all strings
in a language (and rejects all other strings)2. Grammar – a generator; a system for producing all strings in
the language (and no other strings)
A language may be specifie
d by many different grammars &
automata
BUT
A grammar or automaton specifies only one language
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Formal Language Definition ( Contd. )• As already defined A language L over an alphabet
Σ is a collection of strings of elements of Σ– The PASCAL Language is the set of all strings that
constitute legal PASCAL programs (infinite set)– The Language of primes is a set of all decimal digit
strings that constitute prime numbers (infinite set)– The language of C reserved words is the set of all
alphabetic strings that can not be used as identifiers in the C programming language (finite set)
• To specify some of these (possibly infinite) languages with finite description we use the notation of
Regular Expressionswww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Regular Expressions• Is always defined over some alphabet Σ (For programming languages, it is commonly ASCII
or Unicode)• If E is a regular expression, L(E ) is the “language”
(set of strings) generated by E• For Example – For each symbol ‘a’ in the alphabet
of the language the regular expression {a} denotes the language containing just the string a ( Known as symbol)
• A regular expression generated with empty sequence λ is denoted by ε
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Operations with Regular Expressions• Given 2 regular expressions M & N • Alternation ( denoted by | )
makes a new regular expression M | N denoting a “UNION” of languages L(M) and L(N) . { L(M) L(N) }
• Concatenation ( denoted by . Or )makes a new regular expression MN denoting a language L(M) followed by L(N).
• The Repetiton ( denoted y * ) makes a new expression denoting a language that has 0 or more occurrences (Kleene closure) of L(M)
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Regular Expression ExampleExpression Language Example Words a | b { a, b } a , b ab * a {a} {b} * {a) aa , aba , abba , abbba … (ab)* { ab} * ε , ab , abab , ababab , … abba { abba } abba (0 | 1) * 0 { {0} {1} } * {0} 0 , 00 , 10, 010, 110, …... ( All binary Even numbers)
b*(abb*)*(a | ε) Strings of a and b with NO consecutive a
Similarly, using symbols, | , . ,* and ε, we can specify the regular expressions corresponding to the lexical tokens of a programming
language using rules ( A.k.a. Productions)www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Table of Operators & Abbreviations
Stands for a single character ( except New line).
One of the given characters (a|b|x|y|z) [abxyz]Character set alteration[a–z A–z ]
Optional (Zero or one Occurrence of M)M?Repetition ( one or more times)M+
Repetition ( Zero or more Times)M*Concatenation : An M followed by NMNAlternation; Choosing from M OR NM | N
The empty StringεAn ordinary character that stands for itselfa
DescriptionNotation
Quotation: A string in quotes stands for itself literally‘a.+*’
Regular Expression Construction• Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)• Observations on numbers:
1. Could be made up of one or more digits from set (0 – 9)
2. Optionally Can have a decimal point in the end followed by 0 or more digits “.”(0 – 9)*
3. A number can also start with a Point followed by one or more digits
][“.”(0 – 9)*] ?[ (0 – 9)+ | [“.”(0 – 9) +] www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Regular Expressions for Some Tokens of a Programming Language
return ERROR.
return Comment(‘\*’ [ a – z ] * ‘\n’ ) | (‘ ’)| ‘\n’ | ‘*/’)+
Return REAL( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( [ ‘ . ’[ 0 – 9 ] +)
[ return NUM ][ 0 – 9 ] +
[ return ID ][ a – z ] [ a – z 0 – 9 ]*
[ Return IF; ]if
Token TypeRegular Expression
A regular Expression Recognizer• Given an input string,
The function of a “regular Expression Analyzer” is to say :
– “YES, the input is part of the language generated from the regular expression”
– “NO, the input isn’t part of the language generated from the regular expression”
• Using results from Finite Automata theory and theory of algorithms, we can automate construction of such recognizers from Regular Expressions
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Finite Automata• A finite Automation is a Transition Graph that has:
– A finite set of states S (represented by Nodes) with Edges leading from one state to another
– Each edge is labeled with the symbol ( from the set Σ ) that causes the transition ( Could be ε also !)
– One state is denoted as start state S0 and certain of the states are distinguished as final states ( normally denoted with two concentric circles)
• Mathematically, It can be represented as:
A = {S, , s0, F, move }www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Recognizing Expressions as Tokens with Finite State Automaton
• Operate by reading input symbols (usually characters)– Transition can be taken if labeled with current symbol– ε-transition can be taken at any time
• Accept when final state reached & no more input– Scanner slightly different – accept longest match even
if more input
• Reject if no transition possible or no more input and not in final state (DFA)
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Finite Automata Examples
1 2i f 3 return IF
1 a – z 2
a – z
0 – 9
return ID
1 0 – 9 0 – 9 return NUM2start
start
start
[ 0 – 9 ] +
[ a – z ] [ a – z 0 – 9 ]*
if
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Finite Automata Examples ( Contd.)
1
2 3
4 5
start
0 – 9
0 – 9
0 – 9 0 – 9
0 – 9
.
.
return REAL
( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( ‘ . ’[ 0 – 9 ] +)
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Deterministic Finite Automata (DFA)• A finite automaton is deterministic if
1. It has no edges/transitions labeled with epsilon.2. For each state and for each symbol in the alphabet,
there is exactly one edge labeled with that symbol.• Such a transition graph is called a state graph.
A Deterministic Finite Automaton (DFA):
0 1 2 3a
b
b bstart
b*abbwww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Non-deterministic Finite Automata (NFA)
• In Non-deterministic Finite Automata:1. From a state (node), there may be more than one
edge labeled with the same alphabet and there may be no edge from a node labeled with an input symbol
2. An edge can be labeled by an empty symbol tooA Non-deterministic Finite Automaton (NFA):
0 1 2 3
a
a
b
b bstart
(a|b)*abbwww.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Another NFA
start
a
b
a
b
An -transition is taken without consuming any character from the input.
What does the above NFA accept? aa* | bb*www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
NFA and DFA – A Comparison• DFA
– no edges/transitions labeled with epsilon
– For each state and for each symbol in the alphabet, there is exactly one edge labeled with that symbol
– Slower to build but quicker to simulate
• NFA– Has edges/transitions
labeled with epsilon – From a state (node), there
may be more than one edge labeled with the same alphabet and there may be no edge from a node labeled with an input symbol
– Quicker to build but slower to simulatewww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Relationship between DFA & NFA
• It is obvious that DFA can be simulated with an NFA
• But what is not so obvious is that NFA can be simulated with a DFA !!!
• How ?• Simulate sets of possible states• Possible exponential blowup in the state space• Still, Maintain one state per character in the input
streamwww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Automating a RE Recognizer Construction
• To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Building NFA From Regular Expression• Remember that
A regular expression is formed by the use of :– Basic symbols and their– Alternation, – Concatenation, and – Repetition.
• Hence, All we need to do is to know is:– How to build the NFA for the above (symbols &
Operations), and – How to assemble those NFA’s corresponding to these
symbols into a composite NFA for the expressionwww.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Building NFA for Symbols & Operations1. Building NFA for a basic symbol a:
1. Start with an Initial State i,
2. Draw an edge / Transition labeled with an alphabet
(This Could be an epsilon symbol too!!)
3. to the final state f
ai fstart i fstart
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Building NFA for Symbols & Operations2. Building NFA for Alternation N (s | t) :
– Given two NFA N(s) and N(t),1. Construct new start state i, and new final state f.2. Add a transition from the start state i to the start states of N(s) and N(t) and
label them with epsilon symbol3. Add a transition from the Final states of N(s) and N(t) to the final state f and
label them with Epsilon symbol
start i fN(s)
N(t)
www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Building NFA for Symbols & Operations3. Building NFA for Concatenation N(s.t) or N(st) :
– Given two NFA N(s) and N(t),1. Construct new start state i, and new final state f.2. Overlap the Start state of later [ N(t) ] with the final state of the
former [N(s) ]3. From the start state, add an edge labeled with epsilon to start
state of N(s)4. From the final state of E1, add an epsilon transition to Start
state of N(t)
start i f
N(s) N(t)
www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Building NFA for Symbols & Operations4. Building NFA for Repetition N(s*) : 1. Construct new start state and new final state2. Add an epsilon transition from new Start state to
the new final state.3. Add an epsilon transition from the new final state to
the start state of N(s).4. Add another epsilon transition from the final state
of N(s) to the constructed final state.
start i fN(s)
www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Construction of NFA – Examples(a|b).(a|b)
a b
a
b
a
b
a
b
(a|b)
(a|b).(a|b)
(a) (b)
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Construction of NFA – Examples (Contd.)[ a – z ] [ a – z 0 – 9 ]*
86 7
a-z
0-9
1start a – z 2
Symbol Repetition
[ 0 – 9 ] + = [ 0 – 9 ] [ 0 – 9 ] *Symbol
Repetition
Return NUM
1 0 – 9 start 2
0 – 9
3 4 5
Return ID
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Combining Several NFA’s2
3
4
9
14
1
i
f
85 6 7a-z
a-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
character
NUM
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Automating a RE Recognizer Construction
• To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by
a tool such as lexwww.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Conversion of NFA to DFA• A DFA can be constructed from the NFA, where
each DFA state represents a set of NFA states from the NFA
• Key idea
The state of the DFA after reading some input is the set of all states the NFA could have reached after reading the same input
• If NFA has n states, DFA will have at most 2n states
• Resulting DFA may have more states than needed
• Let us study the conversion with an examplewww.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
characterNUM
Q: What states can be reached from state 1 without consuming a character?
A: {1,4,9,14} form the -closure of state 1
Defn: Given a set of NFA states T, the -closure(T) is the set of states that are reachable through -transiton from any state s T.
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
characterNUM
What are ALL the state closures in this NFA?
closure(1) = {1,4,9,14}closure(5) = {5,6,8}closure(8) = {6,8}closure(7) = {7,8,6}
closure(10) = {10,11,13}closure(13) = {11,13}closure(12) = {12,13}www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Converting NFA to DFA• We already Know that Given a set of NFA states T, the -closure(T) is the set
of states that are reachable through -transiton from any state s T.
• We now define Given a set of NFA states T, move( T, a) is the set of states that are reachable on input a from any state sT
• Now the Problem Definition:Given an NFA find the DFA with the minimum number of states that has the same behavior as the NFA for all inputs. www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
character NUM
1. Start with the initial state in the NFA ( s0), & work out the set of states in the DFA, Dstates, initialized with a state representing -closure(s0).
Dstates = {1-4-9-14}
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
character NUM
Dstates = {1-4-9-14}
1-4-9-14Now we need to compute:
Move(1-4-9-14,a-h) = ?{ 5,15 }
Then, -closure({5,15}) = {5,6,8,15}
a-h 5-6-8-15
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
characterNUM
1-4-9-14Next we need to compute:
Move(1-4-9-14,i) = ?{ 2,5,15 }Then, -closure({2,5,15}) ={2,5,6,8,15}
a-h 5-6-8-15
2-5-6-8-15i
Dstates = {1-4-9-14}
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
characterNUM
1-4-9-14Next we need to compute:Move(1-4-9-14,j-z) = ?{ 5,15 }
Then, -closure(5,15}) = {5,6,8,15}
a-h 5-6-8-15
2-5-6-8-15i
Dstates = {1-4-9-14}
j-z
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
characterNUM
1-4-9-14Next we need to compute:Move(1-4-9-14,0-9) = ?{10,15 }
Then, -closure(10,15}) = {10,13,11,15}
a-h 5-6-8-15
2-5-6-8-15i
Dstates = {1-4-9-14}
j-z
0-9
10-11-13-15www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
character NUM
1-4-9-14Next we need to compute:Move(1-4-9-14,other) = ?{15 }
Then, -closure(15) = {15}
a-h 5-6-8-15
2-5-6-8-15i
Dstates = {1-4-9-14}
j-z
10-11-13-150-9
15other
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Converting NFA to DFA
1
23
4
9
14
i
f
85 6 7a-za-z
0-9
1310 11 120-90-9
IF
ERROR
ID
15Any
characterNUM
1-4-9-14
a-h 5-6-8-15
2-5-6-8-15i
Dstates = {1-4-9-14}
j-z
10-11-13-150-9
15other
The analysis for 1-4-9-14 is complete. We mark it and pick
another state in the DFA to analyze.www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Converted DFA
5-6-8-15
2-5-6-8-15
10-11-13-15
3-6-7-8
11-12-13
6-7-8
15
1-4-9-14
a-e, g-z, 0-9
a-z,0-9
a-z,0-9
0-9
0-9
f
i
a-hj-z
0-9
other
ID
ID
NUM NUM
IF
error
ID
a-z,0-9
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Another Example of Conversiona
b
a
b
S0
S1
S2
S3
S4
S7
S8
S9
S10
S5 S6S11
s0,s1,s2
s3,s5,s6,s7,s8 s9,s11
s4,s5,s6,s7,s8 s10,s11
a a
ab
b
b
The above NFA would result in DFA below:
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Automating a RE Recognizer Construction
• To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Systematically shrink the DFA• The Big Picture
– Discover sets of equivalent states– Represent each such set with just one state
• Two states are equivalent if and only if:– The set of paths leading to them are equivalent– α Є Σ, transitions on α lead to equivalent states (DFA)– α-transitions to distinct sets states must be in distinct
sets
• A partition P of S– A collection of sets P s.t. each s Є S is in exactly one pi Є P– The algorithm iteratively partitions the DFA’s states
A
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Minimization
Group all the states together.
Separate states according to available exit transitions.
Separate a set to two if from some of its states one can reach another set and with others one cannot.
Repeat until cannot separate.
p0
p1 p3
p2 p4
a a
abb
b
{p0, p1, p2, p3, p4}.
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Minimization
bb
aa
The above DFA can now be minimized as:
p0
p1 p3
p2 p4
a a
abb
b
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Automating a RE Recognizer Construction
• To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )www.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Pseudo Code For lexical Analyzerfunction lexan; integer Var lexbuf : array [0, ..100] of char C: charBegin loop begin
read a character into C: if C is a blank or a tab then do nothing else if C is a newline then increment lineno else if C is a digit
begin set Tokenval to the value
of this & flwg digits; return NUM end
else if C is a letter then begin place C and successive letters & digits into lexbuf : p := lookup ( lexbuf ) : tokenval := p: return the token field of table entry p
end else begin /* token is a single character */
set tokenval to NONE /* no attribute */ return integer encoding of character C end end end
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Automating a RE Recognizer Construction
• To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as Lexwww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Building Lexical Analyzers Automatically
• The point to note is :The Process studied so far is well suited for Automation
1. Implementer writes down the regular expressions2. Scanner generator builds NFA, DFA, minimal DFA,
and then writes out the (table-driven or direct-coded) code
3. This process reliably produces fast, robust Lexical Analyzers
• One such Tool is Lexwww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Lexx – A tool for generating Scanner• A widely used tool for specifying Lexical Analyzers
for a wide variety of languages.1. Specs of a Lexical Analyzer is
prepared by creating a program lex.l ( containing RE’s) in the Lex language
2. Then lex.l is run thru Lex Compiler to produce a program lex.yy.c ( Contains a tabular representaion of state Transition Diagram)
3. Lex.yy.c is run thru C compiler to produce an object code of Lex Analyzer
LEX Compiler
Lexx Source Pgm lex.l
C Compiler
lex.yy.c
A.out
A.out
SequenceOf Tokens
InputStream
How does it work ?
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Lexx Functions1. Translates the definitions into an automaton.
2. The automaton looks for the longest matching string.
3. Either return some value to the reading program
(parser), or looks for next token.
4. Look ahead operator: x/y allow the token x only if y follows it (but y is not part of the token).
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Lexx Program Structure• A Lexx Program ( nothing but specifications in lex.l )
Consists of THREE Parts.
1. Declarations
2. Translation Rules
3. Auxilliary procedures
Three sections are separated by lines beginning with%%
This section includes declaration of Variables, manifest Constants.
This section includes patterns and the corresponding action to be taken ( RE)
This section includes what ever Auxiliary procedures that are needed
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
A Sample Lexx Program1) %{
/* Remove uppercase letters . Commands to execute are
lex test.l and gcc lex.yy.c -ll -o test */
%}
%%
[A-Z]+ ;
2) %{
/* Line numbering */
%}
%%
^.*\n printf(“%d\t%s”,yylineno-1,yytext);www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Any
Questions ????
Thank youThank you
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Regular Expression Construction• Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)• Solution : Start with symbol and keep defining
regular sub-expressions till the final expression is achieved 0 | 1 | 2 | 3 | … | 9
digit digit* (or digit+) [Kleene star closure meaning 1 or more digits]
‘.’ digits | epsilon
digits optional_fraction
1. digit
2. digits
3. optional_fraction
4. Num
RULE
RULE
RULE
RULEwww.Bookspar.com | Website for Students |
VTU - Notes - Question Papers
Regular Expression Construction• Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)• Solution : Start with symbol and keep defining
regular sub-expressions till the final expression is achieved 0 | 1 | 2 | 3 | … | 9
digit digit* (or digit+) [Kleene star closure meaning 1 or more digits]
‘.’ digit | epsilon
digit optional_fraction
1. digit
2. digit
3. optional_fraction
4. Num
RULE
RULE
RULE
RULENote that we have used ALL the definitions of a regular expression
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Unsigned Number validation using Rules • Let us derive the number from these rules
0 | 1 | 2 | 3 | … | 9
digit digit* (or digit+) [Kleene star closure meaning 1 or more digits]
‘.’ digits | epsilon
digits optional_fraction
RULE
RULE
RULE
RULE
1. digit
2. digits
3. optional_fraction
4. Num
1 9 9 7 2 5 9 7. 3 6 . 1 4.
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Regular Expression Construction • Qn: How to write a regular expression for
identifiers? (identifiers are letters followed by a letter or a digit).
• Answer:
• One can define similar regular expression (s) for comments, Strings, operators and delimiters ( the different tokens of a language)
a | A | b | B | … | z | Z
0 | 1 | 2 | 3 | … | 9 Letter | Digit
Letter | letter_or_digit
1. Letter2. Digit 3. Letter_or_Digit4. Identifier
www.Bookspar.com | Website for Students | VTU - Notes - Question Papers
Grammar for a Tiny Language• program ::= statement | program
statement
• statement ::= assignStmt | ifStmt
• assignStmt ::= id = expr ;
• ifStmt ::= if ( expr ) stmt
• expr ::= id | int | expr + expr
• Id ::= a | b | c | i | j | k | n | x | y | z
• int ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9The rules of a grammar are also Known as Productions www.Bookspar.com | Website for Students | VTU - Notes - Question Papers