Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Embed Size (px)

Citation preview

Page 1: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lexical Analysis (4.2)

Programming LanguagesHiram CollegeEllen Walker

Page 2: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lexical Analysis is Pattern Matching

• From a sequence of characters to a sequence of lexemes, e.g.– “public static void main(char[] args)” ->– <id> <id> <id> <id> <lparen> <id> <lsquare>

<rsquare> <id> <rparen>

• Patterns are simpler (easy grammars), e.g.<id> -> <letter> <id> | <letter><letter> -> a | b | c | … | z

Page 3: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Regular Grammars

• Subset of Context Free Grammars• Every rule contains at most one non-terminal

symbol (or can be rewritten so it does…)

Page 4: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Rewritten Grammar for ID

• Original:<id> -> <letter> <id> | <letter><letter> -> a | b | c | … | z

• Rewrite:<id> -> (a | b | c | … | z) <id> | (a | b | c | … z )

• Fully expanded (52 rules):<id> -> a <id> | b <id> | c <id> … a | b | c |… | z

Page 5: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Parsing using a Regular Grammar

1. Transform the grammar into a state machine2. Implement the state machine in a computer

program– By hand– Automatically, using table-lookup

3. Run this program on input strings

Page 6: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

What is a State Machine?

• State machine abstraction– At any time, the process is in a “state”– Each time an “event” happens, the process takes

an “action” and goes to the next state–We can describe the entire algorithm as a diagram

where each state has an arrow for each event/action pair to the next appropriate state

Page 7: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

State Machine for a Kitten


Hungry Sleeping

Food available / Eat Toys available / Play

X hrs passed / Awaken

Page 8: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

State Machine for a Language

• Each “event” processes an input symbol• Two important special states– Initial state: state the machine is in before the

first symbol– Final state: state the machine is in whenever the

sequence of symbols up to now is in the language

Page 9: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Transforming a Regular Grammar to a State Machine

• Put the grammar into a form so every rule is<nonterm1> -> symbol <nonterm2><nonterm1> -> symbol

• Make a state for each nonterminal• Make a transition (arrow) for each rule. The

transition goes from <nonterm1> to <nonterm2> based on the symbol.

• The start symbol of the grammar is initial.• There is one final state that every rule that

doesn’t have a nonterminal on the right goes to.

Page 10: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

State Machine Example

• <id> -> a <id> | b <id> | a | b

• Two states: id (initial) and f (final)• Example: aabba

Page 11: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Simpler State Machine

• This is a cleaner version of the other machine. Each character, state combination has only one next state.

• It is called a DFA (deterministic finite automaton)

Page 12: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lexical Analysis for Integer Expressions

Page 13: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

From DFA to Program

Method doScan() reads tokens from an input stream (assume System.in for now) and creates a list of them in order.

Method lex(s) scans and returns a single Token from a stream.

A Token consists of a type (e.g. INT) and a string (e.g. “1234”)


Page 14: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Defining Constants

• //Number all the statesPublic static final int NUMSTATES = 4;Public static final int START = 0;Public static final int INT = 1;Public static final int ID = 2;Public static final int UNK = 3;Public static final int ERR = 4;


Page 15: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Constructing Transition Table (in constructor)

String chars = “01234abcdef+-()”int[][] tt = new int[[chars.size()][NUMSTATES];tt[ID][5] = ID; // ’a’tt[ID][6] = ID; // ’b’tt[START][5] = ID; // ’a’tt[START][1] = INT;// … etc …tt[ID][0] = ERR;// … etc …

Page 16: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Recognizing Final States

//For this grammar, all states but ERR are final//Usually, this method is a bit more complexboolean final(int state){

return (state != ERR);}


Page 17: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lex Method

//Read one token from the input ( any Scanner)public static Token lex(Scanner s){ //initialize variables StringBuilder lexeme = new StringBuilder; int state = START; char ch = s.nextChar(); …


Page 18: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lex Method (cont’d)

//loop through characters, updating statewhile (state != ERR){ oldstate = state; lexeme += ch; state = tt[oldstate][chars.indexOf(ch)]; ch = s.getChar();}


Page 19: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lex Method (cont’d)

//return the tokenif final(oldstate) //valid token

return new Token(oldstate,lexeme);else //not a valid token – return the chars

return new Token(ERR, lexeme);} //end of lex()


Page 20: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

From DFA to Program (cont’d)

Public static boolean doScan(){ Scanner s = new Scanner (System.in); while(s.peek()){ //not EOF //removes whitespace

eatWhitespace(s); token = lex(s); tokens.add(token); if (token.getType == ERR) return false;} return true;

Page 21: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Another Program (pp. 176-181)

• Programmed in C (no classes)• Global variables instead of class variables (used in

many functions, e.g. charClass)• Token (int) and lexeme (string) unconnected

• States and transitions are implicit• Lex() is a big case statement• Many special purpose functions, e.g. getChar(),

addChar(), lookup() executing portions of DFA
