Upload
rachel-jordan
View
223
Download
2
Embed Size (px)
Citation preview
Lexical Analysis - An Introduction
The Front End
The purpose of the front end is to deal with the input language
Perform a membership test: code source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler
Sourcecode
FrontEnd
Errors
Machinecode
BackEnd
IR
The Front End
Scanner Maps stream of characters into words
Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; >
Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments
Sourcecode Scanner
IRParser
Errors
tokens
Speed is an issue in scanning
use a specialized recognizer
The Front End
Parser Checks stream of classified words (parts of
speech) for grammatical correctness Determines if code is syntactically well-
formed Guides checking at deeper levels than syntax Builds an IR representation of the code
Sourcecode Scanner
IRParser
Errors
tokens
The Big Picture Language syntax is specified with parts of
speech, not words Syntax checking matches parts of speech
against a grammar
1. goal expr
2. expr expr op term3. | term
4. term number5. | id
6. op +7. | –
S = goal
T = { number, id, +, - }
N = { goal, expr, term, op }
P = { 1, 2, 3, 4, 5, 6, 7}
The Big Picture Language syntax is specified with parts of
speech, not words Syntax checking matches parts of speech
against a grammar1. goal expr
2. expr expr op term3. | term
4. term number5. | id
6. op +7. | –
S = goal
T = { number, id, +, - }
N = { goal, expr, term, op }
P = { 1, 2, 3, 4, 5, 6, 7}
No words here!
Parts of speech, not words!
The Big Picture
Scanner
ScannerGenerator
specifications
source code parts of speech & words
tables or code
Specifications written as “regular expressions”
Represent words as indices into a global table
The Big Picture
Why study lexical analysis? Goals:
To simplify specification & implementation of scanners
To understand the underlying techniques and technologies
How to implement a scanner
Regular Expressions
NFA DFA
Regular Expressions
Lexical patterns form a regular language
*** any finite language is regular ***
Regular expressions (REs) describe regular languages
Ever type “rm *.o a.out” ?
Regular Expressions
Regular Expression (over alphabet )
is a RE denoting the set {}
If a is in , then a is a RE denoting {a}
If x and y are REs denoting L(x) and L(y) then x |y is an RE denoting L(x) L(y) xy is an RE denoting L(x)L(y) x* is an RE denoting L(x)*
Set Operations (review)
Operation Definition
Union of L and MWritten L M
L M = {s | s L or s M }
Concatenation of Land M
Written LM
LM = {st | s L and t M }
Kleene closure of LWritten L*
L* = 0i Li
Positive Closure of LWritten L+
L+ = 1i Li
Examples of Regular Expressions
Identifiers:
Letter (a|b|c| … |z|A|B|C| … |Z)
Digit (0|1|2| … |9)
Identifier Letter ( Letter | Digit )*
Numbers:
Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )
Decimal Integer . Digit *
Real ( Integer | Decimal ) E (+|-|) Digit *
Complex ( Real , Real )
Regular Expressions (the point)Regular expressions can be used to specify the words
to be translated to parts of speech by a lexical analyzer
Using results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressions
We study REs and associated theory to automate scanner construction !
Consider the problem of recognizing ILOC register names
Register r (0|1|2| … | 9) (0|1|2| … | 9)*
Allows registers of arbitrary number Requires at least one digit
RE corresponds to a recognizer (or DFA)
Example
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
DFA operation Start in state S0 & take transitions on each input character
DFA accepts a word x iff x leaves it in a final state (S2 )
So, r17 takes it through s0, s1, s2 and accepts
r takes it through s0, s1 and fails
Example
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
Example (continued)To be useful, recognizer must turn into code
Char next characterState s0
while (Char EOF) State (State,Char) Char next character
if (State is a final state ) then report success else report failure
Skeleton recognizer
Table encoding RE
r0,1,2,3,4,5,6,7,
8,9
All other
s
s0 s1 se se
s1 se s2 se
s2 se s2 se
se se se se
Example (continued)
r0,1,2,3,4,5,6,7,
8,9
All other
s
s0 s1
startse
errorse
error
s1 se
error
s2
addse
error
s2 se
error
s2
addse
error
se se
error
se
errorse
error
Char next characterState s0
while (Char EOF) State (State,Char) perform specified action Char next character
if (State is a final state ) then report success else report failure
Skeleton recognizer Table encoding RE
Algorithm Project 1
Open a file to read fromOpen a file to write to
Open a file to read fromOpen a file to write to
Create a scanner object
Create a scanner object
Call a method from Scanner class to scan, classify each token and write to the output file
Call a method from Scanner class to scan, classify each token and write to the output file
Algorithm scannerRead a line from input fileci= first character of this line
Read a line from input fileci= first character of this line
Recognize the meta characterCurrent Token = meta character
Print out token with new line
Recognize the meta characterCurrent Token = meta character
Print out token with new line
ci==‘#’ || (ci==‘/’ && ci+1 ==‘/’
ci==‘#’ || (ci==‘/’ && ci+1 ==‘/’
false
true
Ci ==‘”’ Ci ==‘”’true Recognize the string (read until you
reach another “)Current Token = string
Print out the token
Recognize the string (read until you reach another “)
Current Token = stringPrint out the token
Ci is a digit Ci is a digit Recognize the number Current Token = number
Print out the token
Recognize the number Current Token = number
Print out the token
Ci is a letter
Ci is a letter Recognize
the id
Recognize the id
token is not a keyword
token is not a keyword
true
false
falsetrue
Token is an ID, print with
tag
Token is an ID, print with
tag
true
falseToken is a keyword
Print the token
Token is a keyword
Print the token
Ci is symbol
Ci is symbol
trueRecognize
the symbol
Recognize the
symbolPrint
false
false
i+= len of token
i+= len of token
i+= len of token
i+= len of token
i+= len of token
i+= len of token
Print ci
i++
Print ci
i++ i++i++
i+= len of token
i+= len of token
Ci is symbol
Ci is symbol
Not the end of file
Not the end of file
the end of line
the end of line
the end of file the end of file
Read another
line
Read another
line
Sample:test1.c
#include <stdio.h>
void sample() {
int b=4;
printf("Helloworld %d",b);
}
int main() {
sample();
}
Token list#include <stdio.h> ---- metavoid ---- keywordsample ---- id( ---- symbol) ---- symbol{ ---- symbolint ---- 2b ---- id= ---- symbol4 ---- number; ---- symbolprintf ---- keyword( ---- symbol"Helloworld %d" ---- string, ---- symbolb ---- id) ---- symbol; ---- symbol} ---- symbolint ---- keywordmain ---- id( ---- symbol) ---- symbol{ ---- symbolsample ---- symbol( ---- symbol) ---- symbol; ---- symbol} ---- symbol
Result
#include <stdio.h>
void CS322sample() {
int CS322b=4;
printf("Helloworld %d",CS322b);
}
int CS322main() {
CS322sample();
}