Upload
bonnie-mathews
View
221
Download
0
Embed Size (px)
Citation preview
어휘분석 (Lexical Analysis)
Overview
• Main task: to read input characters and group them into “tokens.”• Secondary tasks:
– Skip comments and whitespace;– Correlate error messages with source program (e.g., line number of
error).
Overview (cont’d)
Lexical Analysis: Terminology
• token: a name for a set of input strings with related structure.
Example: “identifier,” “integer constant”
• pattern: a rule describing the set of strings associated with a token.
Example: “a letter followed by zero or more letters, digits, or underscores.”
• lexeme: the actual input string that matches a pattern.
Example: count
Examples
Input: count = 123
Tokens:identifier : Rule: “letter followed by …”
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by …”
Lexeme: 123
Why token?• Language syntax is specified with token types, not
words (lexeme)
1. goal expr
2. expr expr op term3. | term
4. term number5. | id
6. op +7. | –
S = goal
T = { number, id, +, - }
N = { goal, expr, term, op }
P = { 1, 2, 3, 4, 5, 6, 7}
Specifying Tokens: regular expressions• Terminology:
– alphabet : a finite set of symbols
– string : a finite sequence of alphabet symbols
– language : a (finite or infinite) set of strings.
• Regular Operations on languages:Union: R S = { x | x R or x S}
Concatenation: RS = { xy | x R and y S}
Kleene closure: R* = R concatenated with itself 0 or more times
= {} R RR RRR
= strings obtained by concatenating a finite
number of strings from the set R.
Regular Expressions
A pattern notation for describing certain kinds of sets over strings:
Given an alphabet : is a regular exp. (denotes the language {})– for each a , a is a regular exp. (denotes the language
{a})– if r and s are regular exps. denoting L(r) and L(s) respect
ively, then so are:• (r) | (s) ( denotes the language L(r) L(s) )• (r)(s) ( denotes the language L(r)L(s) )• (r)* ( denotes the language L(r)* )
Common Extensions to r.e. Notation
• One or more repetitions of r : r+• A range of characters : [a-zA-Z], [0-9]• An optional expression: r?• Any single character: .• Giving names to regular expressions, e.g.:
– letter = [a-zA-Z_]– digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9– ident = letter ( letter | digit )*– Integer_const = digit+
Examples of Regular Expressions
Identifiers:Letter (a|b|c| … |z|A|B|C| … |Z)
Digit (0|1|2| … |9)
Identifier Letter ( Letter | Digit )*
Numbers:[0-9] [0-9]+ [0-9]*[1-9][0-9]* ([1-9][0-9]*)|0-?[0-9]+[0-9]*\.[0-9]+ ([0-9]+)|([0-9]*\.[0-9]+)[eE][-+]?[0-9]+ ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?-?( ([0-9]+) | ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? )
Examples of Regular Expressions
Numbers:Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )
Decimal Integer . Digit *
Real ( Integer | Decimal ) E (+|-|) Digit *
Complex ( Real , Real )
Exercise of Regular Expressions
• 문자열[a-z] [a-zA-Z] [a-zA-Z0-9]
[a-zA-Z][a-zA-Z0-9]*
• 스트링– "this is a string"
\".*\" <- wrong!!! why?
\"[^"]*\"
• 몇가지 연습0 과 1 로 이루어진 문자열 중에서 ...
– 0 으로 시작하는 문자열 0[01]*
– 0 으로 시작해서 0 으로 끝나는 문자열 0[01]*0
– 0 과 1 이 번갈아 나오는 문자열 ______
– 0 이 두번 계속 나오지 않는 문자열 ______
Recognizing Tokens: Finite Automata
A finite automaton is a 5-tuple (Q, , T, q0, F), where: is a finite alphabet;– Q is a finite set of states;– T: Q Q is the
transition function;– q0 Q is the initial state;
and– F Q is a set of final
states.
Finite Automata: An Example
A (deterministic) finite automaton (DFA) to match C-style comments:
Consider the problem of recognizing register names
Register r (0|1|2| … | 9) (0|1|2| … | 9)*
• Allows registers of arbitrary number• Requires at least one digit
RE corresponds to a recognizer (or DFA)
Transitions on other inputs go to an error state, se
Example 2
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
DFA operation
• Start in state S0 & take transitions on each input character
• DFA accepts a word x iff x leaves it in a final state (S2 )
So,
• r17 takes it through s0, s1, s2 and accepts
• r takes it through s0, s1 and fails
• a takes it straight to se
Example 2 (continued)
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
Example 2 (continued)
To be useful, recognizer must turn into code
sesesese
ses2ses2
ses2ses1
seses1s0
All others
0,1,2,3,4,5,6,7,8,9rChar next character
State s0
while (Char EOF) State (State,Char) Char next character
if (State is a final state ) then report success else report failure
Skeleton recognizer
Table encoding RE
r Digit Digit* allows arbitrary numbers
• Accepts r00000
• Accepts r99999
• What if we want to limit it to r0 through r31 ?
Write a tighter regular expression– Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
– Register r0|r1|r2| … |r31|r00|r01|r02| … |r09
Produces a more complex DFA
• Has more states
• Same cost per transition
• Same basic implementation
What if we need a tighter specification?
Tighter register specification (continued)
The DFA forRegister r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
• Accepts a more constrained set of registers
• Same set of actions, more states
S0 S5 S1
r
S4
S3
S6
S2
0,1,2
3 0,1
4,5,6,7,8,9
(0|1|2| … 9)
seseseseseS1s0
sesesesesesese
seseseseseses6
seseseses6ses5
seseseseseses4
seseseseseses3
ses3s3s3s3ses2
ses4s5s2s2ses1
Allothers4-9320,1r
Table encoding RE for the tighter register specification
Tighter register specification (continued)
Automating Scanner Construction
• RE→ NFA (Thompson’s construction)– Build an NFA for each term– Combine them with ε-moves
• NFA → DFA (subset construction)– Build the simulation
• DFA → Minimal DFA– Hopcroft’s algorithm
• DFA →RE (Not part of the scanner construction)– All pairs, all paths problem– Take the union of all paths from s0 to an accepting state