Upload
kathlyn-malone
View
220
Download
0
Embed Size (px)
Citation preview
Lexical Analysis
Mooly [email protected]
Schrierber 31703-640-7606
Wed 10:00-12:00html://www.math.tau.ac.il/~msagiv/courses/wcc.html
Textbook:Modern Compiler Implementation in CChapter 2
A motivating example• Create a program that counts the number of lines in
a given input file
A motivating examplesolution
int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
Subjects• Roles of lexical analysis
• The straightforward solution a manual scanner for C
• Regular Expressions
• Finite automata
• From regular languages into finite automata
• Flex
Basic Compiler PhasesSource program (string)
Fin. Assembly
lexical analysis
syntax analysis
semantic analysis
Translate
Instruction selection
Register Allocation
Tokens
Abstract syntax tree
Intermediate representation
Assembly
Finite automata
Pushdown automata
Memory organization
graph algorithms
Dynamic programming
Example
a\b := 5 + 3 ;\nb := (print(a, a-1), 10 * a) ;\nprint(b)
• Input string
• Tokens
id (“a”) assign num (5) + num(3) ;id(“b”) assign
print(id(“a”) , id(“a”) - num(1)), num(10) * id(“a”)) ;print(id(“b”))
• Functionality
– input
• program text (file)
– output
• sequence of tokens
– Read input file
– Identify language keywords and standard identifiers
– Handle include files and macros
– Count line numbers
– Remove whitespaces
– Report illegal symbols
– Produce symbol table
Lexical Analysis (Scanning)
A simplified scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){
case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;
switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c);
return Plus; } case `<`:case `w`:
}
Automatic Generation of Lexical Analysis
• The matching of input strings can be performed by a finite automaton
• Examples:– An automaton for while– An automaton for C identifier– An automaton for C comment
• The program for the automaton is automatically generated from regular expressions
Flex• Input
– regular expressions and actions (C code)
• Output– A scanner program that reads the input and
applies actions when input regular expression is matched
flex
regular expressions
input program tokensscanner
Regular Expression Notations
a An ordinary character stands for itselfM|N M or NMN M followed by NM* Zero or more times of MM+ One or more times of MM? Zero or one occurrence of M[a-zA-Z] Character set alternation (single character). Any (single) character but newline“a.+” Quotation\ Convert an operator into text
Ambiguity Resolving
• Find the longest matching token
• Between two tokens with the same length use the one declared first
A Flex specification of C Scanner
Letter [a-zA-Z_]Digit [0-9]%%[ \t] {;} [\n] {line_count++;}“;” { return SemiColumn;}“++” { return PlusPlus ;}“+=“ { return PlusEqual ;}“+” { return Plus}“while” { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}“<=” { return LessOrEqual;}“<” { return LessThen ;}
Running Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }
int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... *//* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0}/* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13}/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0}/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0}/* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0}/* state 7 */
.../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}
Pseudo Code for ScannerToken nextToken(){lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {
nextState = edges[currentState][currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition;
}input = inputPositionAtLastFinal ;return action[lastFinal]; }
Example
Input: “if --not-a-com”
Efficient Scanners
• Efficient state representation
• Input buffering
• Using switch and goto instead of tables
Constructing Automaton from Specification
• Create a non-deterministic automaton (NDFA) from every regular expression
• Merge all the automata using epsilon moves(like the | construction)
• Construct a deterministic finite automaton (DFA)
• Minimize the automaton starting with separate accepting states
NDFA Constructionif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }
DFA Construction
Minimization
%{/* C declarations */#include “tokens.h'' /* Mapping of tokens into integers */#include “errormsg.h'' /* Shared by all the phases */union {int ival; string sval; double fval;} yylval;int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng)%}/* Lex Definitions */digits [0-9]+%%if { ADJ; return IF;}[a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; }{digits} {ADJ; yylval.ival=atoi(yytext); return NUM; }({digits}\.{digits}?)|({digits}?\.{digits}) {
ADJ; yylval.fval=atof(yytext); return REAL; }(\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; }. { ADJ; EM_error(“illegal character''); }
Start States• Regular expressions may be more complicated
than automata– C comments
• Solutions– Conversion of automata into regular expressions– Start States
% start s1 s2%%< INITIAL>r1 { action0 ; BEGIN s_1; }<s1>r1 { action1 ; BEGIN s2; }<s2>r2 { action2 ; BEGIN INITIAL};
Realistic Example% start Comment%%<INITIAL>”/*'' { BEGIN Comment; }<INITIAL>r1 { Usual actions; }<INITIAL>r2 { Usual actions; }
...<INITIAL>rk { Usual actions; }<Comment>”*/”’ { BEGIN Initial; }<Comment>.|\n ;
Summary
• For most programming languages lexical analyzers can be easily constructed
• Exceptions:– Fortran– PL/1
• Flex is a useful tool beyond compilers