Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code source

Lexical Analysis - An Introduction

The Front End

The purpose of the front end is to deal with the input language

Perform a membership test: code source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler

Sourcecode

FrontEnd

Errors

Machinecode

BackEnd

IR

The Front End

Scanner Maps stream of characters into words

Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; >

Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments

Sourcecode Scanner

IRParser

Errors

tokens

Speed is an issue in scanning

use a specialized recognizer

The Front End

Parser Checks stream of classified words (parts of

speech) for grammatical correctness Determines if code is syntactically well-

formed Guides checking at deeper levels than syntax Builds an IR representation of the code

Sourcecode Scanner

IRParser

Errors

tokens

The Big Picture Language syntax is specified with parts of

speech, not words Syntax checking matches parts of speech

against a grammar

1. goal expr

2. expr expr op term3. | term

4. term number5. | id

6. op +7. | –

S = goal

T = { number, id, +, - }

N = { goal, expr, term, op }

P = { 1, 2, 3, 4, 5, 6, 7}

The Big Picture Language syntax is specified with parts of

speech, not words Syntax checking matches parts of speech

against a grammar1. goal expr

2. expr expr op term3. | term

4. term number5. | id

6. op +7. | –

S = goal

T = { number, id, +, - }

N = { goal, expr, term, op }

P = { 1, 2, 3, 4, 5, 6, 7}

No words here!

Parts of speech, not words!

The Big Picture

Scanner

ScannerGenerator

specifications

source code parts of speech & words

tables or code

Specifications written as “regular expressions”

Represent words as indices into a global table

The Big Picture

Why study lexical analysis? Goals:

To simplify specification & implementation of scanners

To understand the underlying techniques and technologies

How to implement a scanner

Regular Expressions

NFA DFA

Regular Expressions

Lexical patterns form a regular language

*** any finite language is regular ***

Regular expressions (REs) describe regular languages

Ever type “rm *.o a.out” ?

Regular Expressions

Regular Expression (over alphabet )

is a RE denoting the set {}

If a is in , then a is a RE denoting {a}

If x and y are REs denoting L(x) and L(y) then x |y is an RE denoting L(x) L(y) xy is an RE denoting L(x)L(y) x* is an RE denoting L(x)*

Set Operations (review)

Operation Definition

Union of L and MWritten L M

L M = {s | s L or s M }

Concatenation of Land M

Written LM

LM = {st | s L and t M }

Kleene closure of LWritten L*

L* = 0i Li

Positive Closure of LWritten L+

L+ = 1i Li

Examples of Regular Expressions

Identifiers:

Letter (a|b|c| … |z|A|B|C| … |Z)

Digit (0|1|2| … |9)

Identifier Letter ( Letter | Digit )*

Numbers:

Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )

Decimal Integer . Digit *

Real ( Integer | Decimal ) E (+|-|) Digit *

Complex ( Real , Real )

Regular Expressions (the point)Regular expressions can be used to specify the words

to be translated to parts of speech by a lexical analyzer

Using results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressions

We study REs and associated theory to automate scanner construction !

Consider the problem of recognizing ILOC register names

Register r (0|1|2| … | 9) (0|1|2| … | 9)*

Allows registers of arbitrary number Requires at least one digit

RE corresponds to a recognizer (or DFA)

Example

S0 S2 S1

r

(0|1|2| … 9)

accepting state

(0|1|2| … 9)

Recognizer for Register

DFA operation Start in state S0 & take transitions on each input character

DFA accepts a word x iff x leaves it in a final state (S2 )

So, r17 takes it through s0, s1, s2 and accepts

r takes it through s0, s1 and fails

Example

S0 S2 S1

r

(0|1|2| … 9)

accepting state

(0|1|2| … 9)

Recognizer for Register

Example (continued)To be useful, recognizer must turn into code

Char next characterState s0

while (Char EOF) State (State,Char) Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer

Table encoding RE

r0,1,2,3,4,5,6,7,

8,9

All other

s

s0 s1 se se

s1 se s2 se

s2 se s2 se

se se se se

Example (continued)

r0,1,2,3,4,5,6,7,

8,9

All other

s

s0 s1

startse

errorse

error

s1 se

error

s2

addse

error

s2 se

error

s2

addse

error

se se

error

se

errorse

error

Char next characterState s0

while (Char EOF) State (State,Char) perform specified action Char next character

if (State is a final state ) then report success else report failure

Skeleton recognizer Table encoding RE

Algorithm Project 1

Open a file to read fromOpen a file to write to

Open a file to read fromOpen a file to write to

Create a scanner object

Create a scanner object

Call a method from Scanner class to scan, classify each token and write to the output file

Call a method from Scanner class to scan, classify each token and write to the output file

Algorithm scannerRead a line from input fileci= first character of this line

Read a line from input fileci= first character of this line

Recognize the meta characterCurrent Token = meta character

Print out token with new line

Recognize the meta characterCurrent Token = meta character

Print out token with new line

ci==‘#’ || (ci==‘/’ && ci+1 ==‘/’

ci==‘#’ || (ci==‘/’ && ci+1 ==‘/’

false

true

Ci ==‘”’ Ci ==‘”’true Recognize the string (read until you

reach another “)Current Token = string

Print out the token

Recognize the string (read until you reach another “)

Current Token = stringPrint out the token

Ci is a digit Ci is a digit Recognize the number Current Token = number

Print out the token

Recognize the number Current Token = number

Print out the token

Ci is a letter

Ci is a letter Recognize

the id

Recognize the id

token is not a keyword

token is not a keyword

true

false

falsetrue

Token is an ID, print with

tag

Token is an ID, print with

tag

true

falseToken is a keyword

Print the token

Token is a keyword

Print the token

Ci is symbol

Ci is symbol

trueRecognize

the symbol

Print

Recognize the

symbolPrint

false

false

i+= len of token

i+= len of token

i+= len of token

i+= len of token

i+= len of token

i+= len of token

Print ci

i++

Print ci

i++ i++i++

i+= len of token

i+= len of token

Ci is symbol

Ci is symbol

Not the end of file

Not the end of file

the end of line

the end of line

the end of file the end of file

Read another

line

Read another

line

Sample:test1.c

#include <stdio.h>

void sample() {

int b=4;

printf("Helloworld %d",b);

}

int main() {

sample();

}

Token list#include <stdio.h> ---- metavoid ---- keywordsample ---- id( ---- symbol) ---- symbol{ ---- symbolint ---- 2b ---- id= ---- symbol4 ---- number; ---- symbolprintf ---- keyword( ---- symbol"Helloworld %d" ---- string, ---- symbolb ---- id) ---- symbol; ---- symbol} ---- symbolint ---- keywordmain ---- id( ---- symbol) ---- symbol{ ---- symbolsample ---- symbol( ---- symbol) ---- symbol; ---- symbol} ---- symbol

Result

#include <stdio.h>

void CS322sample() {

int CS322b=4;

printf("Helloworld %d",CS322b);

}

int CS322main() {

CS322sample();

}

Documents

Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code source