어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;

어휘분석 (Lexical Analysis)

Overview

• Main task: to read input characters and group them into “tokens.”• Secondary tasks:

– Skip comments and whitespace;– Correlate error messages with source program (e.g., line number of

error).

Overview (cont’d)

Lexical Analysis: Terminology

• token: a name for a set of input strings with related structure.

Example: “identifier,” “integer constant”

• pattern: a rule describing the set of strings associated with a token.

Example: “a letter followed by zero or more letters, digits, or underscores.”

• lexeme: the actual input string that matches a pattern.

Example: count

Examples

Input: count = 123

Tokens:identifier : Rule: “letter followed by …”

Lexeme: count

assg_op : Rule: =

Lexeme: =

integer_const : Rule: “digit followed by …”

Lexeme: 123

Why token?• Language syntax is specified with token types, not

words (lexeme)

1. goal expr

2. expr expr op term3. | term

4. term number5. | id

6. op +7. | –

S = goal

T = { number, id, +, - }

N = { goal, expr, term, op }

P = { 1, 2, 3, 4, 5, 6, 7}

Specifying Tokens: regular expressions• Terminology:

– alphabet : a finite set of symbols

– string : a finite sequence of alphabet symbols

– language : a (finite or infinite) set of strings.

• Regular Operations on languages:Union: R S = { x | x R or x S}

Concatenation: RS = { xy | x R and y S}

Kleene closure: R* = R concatenated with itself 0 or more times

= {} R RR RRR

= strings obtained by concatenating a finite

number of strings from the set R.

Regular Expressions

A pattern notation for describing certain kinds of sets over strings:

Given an alphabet : is a regular exp. (denotes the language {})– for each a , a is a regular exp. (denotes the language

{a})– if r and s are regular exps. denoting L(r) and L(s) respect

ively, then so are:• (r) | (s) ( denotes the language L(r) L(s) )• (r)(s) ( denotes the language L(r)L(s) )• (r)* ( denotes the language L(r)* )

Common Extensions to r.e. Notation

• One or more repetitions of r : r+• A range of characters : [a-zA-Z], [0-9]• An optional expression: r?• Any single character: .• Giving names to regular expressions, e.g.:

– letter = [a-zA-Z_]– digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9– ident = letter ( letter | digit )*– Integer_const = digit+

Examples of Regular Expressions

Identifiers:Letter (a|b|c| … |z|A|B|C| … |Z)

Digit (0|1|2| … |9)

Identifier Letter ( Letter | Digit )*

Numbers:[0-9] [0-9]+ [0-9]*[1-9][0-9]* ([1-9][0-9]*)|0-?[0-9]+[0-9]*\.[0-9]+ ([0-9]+)|([0-9]*\.[0-9]+)[eE][-+]?[0-9]+ ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?-?( ([0-9]+) | ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? )

Examples of Regular Expressions

Numbers:Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )

Decimal Integer . Digit *

Real ( Integer | Decimal ) E (+|-|) Digit *

Complex ( Real , Real )

Exercise of Regular Expressions

• 문자열[a-z] [a-zA-Z] [a-zA-Z0-9]

[a-zA-Z][a-zA-Z0-9]*

• 스트링– "this is a string"

\".*\" <- wrong!!! why?

\"[^"]*\"

• 몇가지 연습0 과 1 로 이루어진 문자열 중에서 ...

– 0 으로 시작하는 문자열 0[01]*

– 0 으로 시작해서 0 으로 끝나는 문자열 0[01]*0

– 0 과 1 이 번갈아 나오는 문자열 ______

– 0 이 두번 계속 나오지 않는 문자열 ______

Recognizing Tokens: Finite Automata

A finite automaton is a 5-tuple (Q, , T, q0, F), where: is a finite alphabet;– Q is a finite set of states;– T: Q Q is the

transition function;– q0 Q is the initial state;

and– F Q is a set of final

states.

Finite Automata: An Example

A (deterministic) finite automaton (DFA) to match C-style comments:

Consider the problem of recognizing register names

Register r (0|1|2| … | 9) (0|1|2| … | 9)*

• Allows registers of arbitrary number• Requires at least one digit

RE corresponds to a recognizer (or DFA)

Transitions on other inputs go to an error state, se

Example 2

S0 S2 S1

r

(0|1|2| … 9)

accepting state

(0|1|2| … 9)

Recognizer for Register

DFA operation

• Start in state S0 & take transitions on each input character

• DFA accepts a word x iff x leaves it in a final state (S2 )

So,

• r17 takes it through s0, s1, s2 and accepts

• r takes it through s0, s1 and fails

• a takes it straight to se

Example 2 (continued)

S0 S2 S1

r

(0|1|2| … 9)

accepting state

(0|1|2| … 9)

Recognizer for Register

Example 2 (continued)

To be useful, recognizer must turn into code

sesesese

ses2ses2

ses2ses1

seses1s0

All others

0,1,2,3,4,5,6,7,8,9rChar next character

State s0

while (Char EOF) State (State,Char) Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer

Table encoding RE

r Digit Digit* allows arbitrary numbers

• Accepts r00000

• Accepts r99999

• What if we want to limit it to r0 through r31 ?

Write a tighter regular expression– Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

– Register r0|r1|r2| … |r31|r00|r01|r02| … |r09

Produces a more complex DFA

• Has more states

• Same cost per transition

• Same basic implementation

What if we need a tighter specification?

Tighter register specification (continued)

The DFA forRegister r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

• Accepts a more constrained set of registers

• Same set of actions, more states

S0 S5 S1

r

S4

S3

S6

S2

0,1,2

3 0,1

4,5,6,7,8,9

(0|1|2| … 9)

seseseseseS1s0

sesesesesesese

seseseseseses6

seseseses6ses5

seseseseseses4

seseseseseses3

ses3s3s3s3ses2

ses4s5s2s2ses1

Allothers4-9320,1r

Table encoding RE for the tighter register specification

Tighter register specification (continued)

Automating Scanner Construction

• RE→ NFA (Thompson’s construction)– Build an NFA for each term– Combine them with ε-moves

• NFA → DFA (subset construction)– Build the simulation

• DFA → Minimal DFA– Hopcroft’s algorithm

• DFA →RE (Not part of the scanner construction)– All pairs, all paths problem– Take the union of all paths from s0 to an accepting state

Documents

어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;