18
Lexical Analysis Lecture 2 Mon, Jan 19, 2004

Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Embed Size (px)

Citation preview

Page 1: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Lexical Analysis

Lecture 2

Mon, Jan 19, 2004

Page 2: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Tokens

A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily with identifiers and

numbers. If we read “count”, the type is ID and the value is

“count”. If we read “123.45”, the type is NUM and the value

is “123.45”.

Page 3: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Analyzing Tokens

Each type of token can be described by a regular expression. (Why?)

Therefore, the set of all tokens can be described by a regular expression. (Why?)

Regular expressions are accepted by DFAs. Therefore, the tokens can be processed and

accepted by a DFA.

Page 4: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Regular Expressions

The set of all regular expressions may be defined in two parts.

The basic part. represents the language {}. a represents the language {a} for every a . Call these languages L() and L(a), respectively.

Page 5: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Regular Expressions

The recursive part. Let r and s denote regular expressions. r | s represents the language L(r) L(s). rs represents the language L(r)L(s). r* represents the language L(r)*.

In other words L(r | s) = L(r) L(s). L(rs) = L(r)L(s). L(r*) = L(r)*.

Page 6: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Example: Identifiers

Identifiers in C++ can be represented by a regular expression. r = A | B | … | Z | a | b | … | z s = 0 | 1 | … | 9 t = r(r | s)*

Page 7: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Regular Expressions

A regular definition of a regular expression is a “grammar” of the form d1 r1,

d2 r2,

: dn rn,

where each ri is a regular expression over {d1, …, di – 1}.

Page 8: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Regular Expressions

Note that this definition does not allow recursively defined tokens.

In other words, di cannot be defined in terms of di, even indirectly.

Page 9: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Example: Identifiers

We may now describe C++ identifiers as follows. letter A | B | … | Z | a | b | … | z digit 0 | 1 | … | 9 id letter(letter | digit)*

Page 10: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Lexical Analysis

After writing a regular expression for each kind of token, we may combine them into one big regular expression describing all tokens. id letter(letter | digit)* num digit(digit)* relop < | > | == | != | >= | <= token id | num | relop | …

Page 11: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

A regular expression may be represented by a transition diagram.

The transition diagram provides a good guide to writing a lexical analyzer program.

Page 12: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Example: Transition Diagram

id: letterletter | digit

num: digitdigit

token:

digitdigit

letter letter | digit

Page 13: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

Unfortunately, it is not that simple. At what point may we stop in an accepting state?

Do not read “count” as 5 identifiers: “c”, “o”, “u”, “n”, “t”.

When we stop in an accepting state, we must be able to determine the type of token processed. Did we read the ID token “count” or did we read the IF

token “if”?

Page 14: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

Consider transitions diagrams to accept relational operators ==, !=, <, >, <=, and >=.

==:= =

!=:! =

<=:< =

and so on.

Page 15: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

Combine them into a single transition diagram.

relop: 1

2

4

3

= | ! =

< | > =

Page 16: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

When we reach an accepting state, how can we tell which operator was processed?.

In general, we design the diagram so that each kind of token has its own accepting state.

Page 17: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

If we reach accepting state #3, how do we decide whether to continue to accepting state #4?

We read characters until the current character does not match any pattern.

Then we “back up” to the previous accepting state (if there is one) and accept the token.

Page 18: Lexical Analysis Lecture 2 Mon, Jan 19, 2004. Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily

Transition Diagrams

relop:=

!

<

>

=

=

=

=

other

other

other

other

other

other