Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
1
Advanced CompilersLexical Analysis
Fall. 2017
Chungnam National Univ.
Eun-Sun Cho
2
Compiler Front-End Structure
Lexical (어휘) Analysis
Syntax (구문) Analysis
Semantic (의미) AnalysisErrors
abstract syntax tree
Source code
전처리기(preprocessor)
Trivial errorsprocessing
#include, #defines
#ifdef ...
preprocessed source code
3
Lexical Analysis (어휘분석)
Lexical analyzer
(=scanner)
if (by == 0) ax = by;
read char by char
if ( by == 0 ) ax = by ;
• A given source program is considered as a “long” string.
• While looking into each character in sequence, a lexical
analyzer transforms what it read into a stream of “meaningful,
smallest units.”
• Spaces are eliminated– The result would shorter than the source code.
4
Syntax Analysis (구문 분석)
문장(statement)
주어구 (Subject) 동사구 (Verb)
관형어
우리들이 모였습니다.
“똑똑한 우리들이 모였습니다.”
• Check the syntax of the input program
• Check the role of each word (token)
eg) a Korean statement
진 주어 동사
똑똑한
Lexical Analysis
5
6
렉스(Lex)
• A lexical analyzer generator : published in 1975
– Input : user-defined regular expressions and supporting codes
– Output : C program
lex
Lexical analyzer A series of tokensInput stream
(source program)
lex input(regular expression
+ α )
lex cclex.yy.clex inputtest.l
Executables(of lexical analyzer)
lex library
7
Input for lex
<Definitions>
%{
definitions of data structures, variables and constants
for the resulting analyzer codes
}%
definition of names
: each name is assigned to a specific regular expression
%%
<Rules>
a rule = a regular expression (representing a token) +
an action (C codes to be executed when the token is recognized)
%%
<User-defined functions>
: invoked in actions of <Rules>
8
%{ /* file name : example1.l* input: file name
* output: the numbers of lines, words and characters */ unsigned long charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+eol \n
%%
{word} { wordCount++; charCount+=yyleng; }
{eol} {charCount++; lineCount++; }. {charCount++;}
%%
void main() {FILE *file;char fn[20];printf("Type an input file name:");scanf("%s",fn);file=fopen(fn, "r");if (!file) {
fprintf(stderr,"file '%s' could not be opened. \n",fn);exit(1);
}yyin=file;yylex();printf("%d %d %d %s \n", lineCount, wordCount, charCount,fn);
}
yywrap() { return 1; /* end of processing*/ }
$ vi example1.l$ ls
example1.l $ lex example1.l$ ls
example1.l lex.yy.c$ cc lex.yy.c -o example1 –ll$ ls
example1* example1.l lex.yy.c$ vi XXX.c$ example1 Type an input file name: XXX.c20 60 300
9
Regular expressions for lex(usual regular expression+ )
“ : all the letters in between “ and “ are considered as text characters
eg. a“*”b and a*b are different
\ : used to escape a single character
eg. XYZ“++”, “XYZ++” and XYZ\+\+ are all same
[] : to define a class of characters
eg. [abc] : one character among a, b and c
- : an operator representing a range
eg. [a-z] : a lower case character from a to z
^ : a complementary set
eg. [^*] : any character except *
\ : an escape string in C
eg. [ \t\n] : one of a blank, a tab or a newline character
10
* : repeating 0 or more timeseg. [a-zA-Z][a-zA-Z0-9]* : a regular expression for a variable name
+ : repeating 1 or more timeseg. [a-z]+ : a regular expression for all the string of lower-case letters
? : an optional elementeg. ab?c : either abc or ac
| : choice operatoreg. (ab | cd) : either ab or cd
(ab | cd+)?(ef)* : abefef, efefef, cdef, cddd, …
^ : at the begining of a lineeg. ^abc : recognizes abc only if it appears at the beginning of the line
$ : at the end of a line
. : any character except the newline charactereg. “- -”.* : from - - to the end of the line
{} : using a name instead of the corresponding regular expression
11
Lex Expression Matches
abc abc
abc* ab abc abcc abccc ...
abc+ abc, abcc, abccc, abcccc, ...
a(bc)+ abc, abcbc, abcbcbc, ...
a(bc)? a, abc
[abc] one of: a, b, c
[a-z] any letter, a through z
[a\-z] one of: a, -, z
[-az] one of: - a z
[A-Za-z0-9]+ one or more alphanumeric characters
[ \t\n]+ whitespace
[^ab] anything except: a, b
[a^b] a, ^, b
[a|b] a, |, b
a|b a, b
Recognizing Regular Expressions
12
<= { … }
<> { … }
< { … }
= { … }
>= { … }
> { … }
13
Data Structure for Tokens
“Lexeme”: representation for each token in the lexical analyzer
• token number
– internal (unique) number for a token, for efficient processing
• token value– valid if a token has a “value” that a programmer created
token value for a identifiers: the matched string
token value for a constant: the constant value
eg.if X < Y then X :=10;
(29,0) (1,X) (18,0) (1,Y) (35,0) (1,X) (9,0) (2,10) (7,0)
(1,10) : X (1,12) : Ylexeme
identifier itself may be in Symbol table
%{
/* calc.lex */
#include "global.h“
#include "calc.h“
#include <stdlib.h>
%}
white [ \t]+
digit [0-9]
integer {digit}+
exponent [eE]([+-])?{integer}
real {integer}("."{integer})?({exponent})?
%%
{white} {}
{real} { yylval=atof(yytext);
return(NUMBER); }
"+" { return(PLUS); }
"-" { return(MINUS);}
"*" { return(TIMES); }
"/" { return(DIVIDE); }
"^" { return(POWER); }
"(" { return(LEFT_PARENTHESIS); }
")" { return(RIGHT_PARENTHESIS);
}
"\n" { return(END); }
%%
int yywrap(void) {
return 1;
} What is the main difference from the previous wordcount example?
yylval = the token valueyytext = the matched string
References
• Text:Lex & Yacc 2nd Edition, John R. Levine,
Tony Mason, Doug Brown, O'Reilly,1992
Flex & Bison, John R. Levine,
O'Reilly,2009
15
More on Regular Expression
16
정규표현식 (Regular Expressions)
17
참고: 기타 정규 표현식을 쓰는 곳 1
• Unix 명령 중 grep
grep smug files {search files for lines with 'smug'}
grep '^smug' files {'smug' at the start of a line}
grep 'smug$' files {'smug' at the end of a line}
grep '^smug$' files {lines containing only 'smug'}
grep '\^s' files {lines starting with '^s', "\" escapes the ^}
grep '[Ss]mug' files {search for 'Smug' or 'smug'}
grep 'B[oO][bB]' files {search for BOB, Bob, BOb or BoB }
grep '^$' files {search for blank lines}
grep '[0-9][0-9]' file {search for pairs of numeric digits}
18
http://www.robelle.com/smugbook/regexpr.html
참고: 기타 정규 표현식을 쓰는 곳 2
• JavaScript에서
– exec, test, match, search 등의메소드들이사용함
– 정규표현식을사용하는메소드들(일부)
19
/g는, 하나찾고멈추지말고,
match 되는것은전부찾으란뜻
(http://www.w3schools.com/jsref/jsref_regexp_g.asp)
참고: 기타 정규 표현식을 쓰는 곳 3
20
• Java에서
Package java.util.regex
Java Exampleimport java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexTestHarness {
public static void main(String[] args){
Console console = System.console();
if (console == null) ... // error.. exit!
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search:"));
boolean found = false;
while (matcher.find()) {
console.format("I found the text" + " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
found = true; } if(!found){ console.format("No match found.%n"); }
} }
} 21
Java Exampleimport java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexTestHarness {
public static void main(String[] args){
Console console = System.console();
if (console == null) ... // error.. exit!
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search:"));
boolean found = false;
while (matcher.find()) {
console.format("I found the text" + " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
found = true; } if(!found){ console.format("No match found.%n"); }
} }
} 22
Enter your regex: fooEnter input string to search: foofoofooI found the text foo starting at index 0 and ending at index 3.I found the text foo starting at index 3 and ending at index 6.I found the text foo starting at index 6 and ending at index 9.
Enter your regex: a+Enter input string to search: ababaaaabI found the text "a" starting at index 0 and ending at index 1.I found the text "a" starting at index 2 and ending at index 3.I found the text "aaaa" starting at index 4 and ending at index 8.
23
https://inst.eecs.b
erkeley.edu/~cs16
4/sp11/lectures/le
cture2/Lexer.java
Lexical Analyzer in Java
24