31
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

Embed Size (px)

DESCRIPTION

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 3 Administration Web page has moved:

Citation preview

Page 1: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS412/413

Introduction toCompilers and Translators

Spring ’99

Lecture 2: Lexical Analysis

Page 2: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

2

Outline• Administration• Compilation in a nutshell (or two)• What is lexical analysis?• Specifying tokens: regular expressions• Writing a lexer• Writing a lexer generator

– Converting regular expressions to Non-deterministic finite automata (NFAs)

– NFA to DFA transformation

Page 3: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

3

Administration• Web page has moved:

www.cs.cornell.edu/cs412-sp99/

Page 4: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

4

Compilation in a Nutshell 1Source code(character stream)

Lexical analysis

Parsing

Token stream

Abstract syntax tree(AST)

Semantic Analysis

if (b == 0) a = b;

if ( b ) a = b ;0==

if==

b 0=

a b

if==

int b int 0=

int alvalue

int b

booleanDecorated AST

int ;

;

Page 5: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

5

Compilation in a Nutshell 2

Intermediate Code Generation

Optimization

Code generation

if==

int b int 0=

int alvalue

int b

boolean int ;

CJUMP ==MEM

fp 8+

CONST MOVE0 MEM MEM

fp 4 fp 8

NOP

+ +

CJUMP ==CONST

MOVE0 D

XCX

NOP

CX CMP CX, 0

CMOVZ DX,CX

Page 6: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

6

Lexical AnalysisSource code(character stream)

Lexical analysis

Parsing

Token stream

Semantic Analysis

if (b == 0) a = b;

if ( b ) a = b ;0==

Page 7: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

7

What is Lexical Analysis?• Converts character stream to token streamif (x1 * x2<1.0) {

y = x1;}

i f ( x 1 * x 2 < 1 . 0 ) { \n

Keyword: if ( Id: x1 * Id: x2 < Num: 1.0 ) { Id: y

Page 8: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

8

Tokens• Identifiers: x y11 elsex _i00• Keywords: if else while break• Integers: 2 1000 -500 5L• Floating point: 2.0 0.00020 .02 1. 1e5• Symbols: + * { } ++ < << [ ] >=• Strings: “x” “He said, \“Are you?\””• Comments: /** comment **/

Page 9: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

9

Problems• How to describe tokens

2.e0 20.e-01 2.0000• How to break text up into tokens

if (x == 0) a = x<<1; iff (x == 0) a = x<1;

• How to tokenize efficiently– tokens may have similar prefixes– want to look at each character ~1 time

Page 10: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

10

How to Describe Tokens• Programming language tokens can be

described as regular expressions• A regular expression R describes some set

of strings L(R)– L(abc) = { “abc” }

• e.g., tokens for floating point numbers, identifiers, etc.

Page 11: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

11

Regular Expression Notation

a ordinary character stands for itself the empty stringR|S any string from either L(R) or L(S)RS string from L(R) followed by one from L(S)R* zero or more strings from L(R),

concatenated |R|RR|RRR|RRRR …

Page 12: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

12

RE ShorthandR+ one or more strings from L(R): R(R*)R? optional R: (R| )[a-z] one character from this range: (a|b|c|d|e|...)[^a-z] one character not from this range

Page 13: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

13

ExamplesRegular Expression Strings in L(R)

a “a”ab “ab”a | b “a” “b”ε “”(ab)* “” “ab” “abab” …(a | ε) b “ab” “b”

Page 14: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

14

More Practical ExamplesRegular Expression Strings in L(R)

digit = [0-9] “0” “1” “2” “3” …posint = digit+ “8” “412” …int = -? posint “-42” “1024” …real = int (ε | (. posint)) “-1.56” “12” “1.0”

[a-zA-Z_][a-zA-Z0-9_]* C, Java identifiers

Page 15: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

15

How to break up textelsex = 0;

• Most languages: longest matching token wins

• Exception: FORTRAN (totally whitespace-insensitive)

• Implication of longest matching token rule:– automatically convert RE’s into lexer

else x =elsex =

00

12

Page 16: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

16

• Scan text one character at a time• Use look-ahead character to determine

when current token ends

while (identifierChar(next)) {token[len++] = next;next = input.read ();

}

Implementing a lexer

e l s e x

Page 17: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

17

Problems• What if it’s a keyword?

– Can look up identifier in hash table containing all keywords

• Verbose, spaghetti code; not easily maintainable or extensible– suppose add new floating point representation

(as in C) -- may need to rip up existing tokenizer code

Page 18: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

18

Solution: Lexer Generator• Input to lexer generator:

– set of regular expressions– associated action for each RE (generates

appropriate kind of token, other bookkeeping).• Output:

– program that reads an input stream and breaks it up into tokens according to the REs. (Or reports lexical error -- “Unexpected character” )

• Examples: lex, flex, JLex

Page 19: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

19

How does it work?• Translate REs into one non-deterministic finite

automaton (NFA)• Convert NFA into an equivalent DFA• Minimize DFA (to reduce # states)• Emit code driven by DFA tables• Advantage: DFAs efficient to implement

– inner loop: look up next state using current state & look-ahead character

Page 20: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

20

RE NFA-?[0-9]+ (-|) [0-9][0-9]*

— 0,1,2..

0,1,2..

NFA: multiple arcs may have same label, transitions don’teat input

Page 21: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

21

NFA minimized DFA

0-9

0-9

0-9

• Arcs may not conflict, no transitions

Page 22: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

22

DFA table

0-9

0-9

0-9

- 0-9 other A B C ERRB ERR C ERRC ERR C A

A

B

C

Page 23: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

23

Lexer Generator OutputToken nextToken() {

state = START; // reset the DFAwhile (nextChar != EOF) {

next = table [state][nextChar];if (next == START) return new Token(state);state = next;nextChar = input.read( );

}}

+table+user actions in Token()

Page 24: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

24

Converting an RE to an NFA

a

εε

rr

ss

If r and s are regular expressions with the NFA’s

a

Page 25: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

25

Inductive Constructionr s

r s

r | s r

s

ε

ε ε

ε

r*r

ε

ε

εε

ε

Page 26: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

26

Converting NFA to DFA• NFA inefficient to implement directly, so

convert to a DFA that recognizes the same strings

• Idea:– NFA can be in multiple states simultaneously– Each DFA state corresponds to a distinct set of

NFA states– n-state NFA may be 2n state DFA in theory, not

in practice (construct states lazily & minimize)

Page 27: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

27

NFA states to DFA states

W X

Y0,1,2..

Z0,1,2..

• Possible sets of states:W&X, X, Y&Z

• DFA states: A B C

0-9

0-9

0-9A

B

C

Page 28: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

28

Handling multiple token REs

whitespace

identifier

number

operators

keywords

NFA DFA

Page 29: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

29

Running DFA• On character not in transition table

– If in final state, return token & reset DFA– Otherwise report lexical error

Page 30: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

30

Summary• Lexical analyzer converts a text stream to

tokens• Tokens defined using regular expressions• Regular expressions can be converted to a

simple table-driven tokenizer by converting them to NFAs and then to DFAs.

• Have covered chapters 1-2 from Appel

Page 31: CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers

31

Groups• If you haven’t got a full group, hang around

after lecture and talk to prospective group members

• If you find a group, fill in group handout and submit it

• Send mail to cs412 if you still cannot make a full group

• Submit questionnaire today in any case