Advanced Compilers Lexical Analysis - CNUplas.cnu.ac.kr/courses/2017f/a_compilers/ac 2 lexical... · 2017-08-31 · 렉스(Lex) • A lexical analyzer generator : published in 1975

1

Advanced CompilersLexical Analysis

Fall. 2017

Chungnam National Univ.

Eun-Sun Cho

2

Compiler Front-End Structure

Lexical (어휘) Analysis

Syntax (구문) Analysis

Semantic (의미) AnalysisErrors

abstract syntax tree

Source code

전처리기(preprocessor)

Trivial errorsprocessing

#include, #defines

#ifdef ...

preprocessed source code

3

Lexical Analysis (어휘분석)

Lexical analyzer

(=scanner)

if (by == 0) ax = by;

read char by char

if ( by == 0 ) ax = by ;

• A given source program is considered as a “long” string.

• While looking into each character in sequence, a lexical

analyzer transforms what it read into a stream of “meaningful,

smallest units.”

• Spaces are eliminated– The result would shorter than the source code.

4

Syntax Analysis (구문 분석)

문장(statement)

주어구 (Subject) 동사구 (Verb)

관형어

우리들이 모였습니다.

“똑똑한 우리들이 모였습니다.”

• Check the syntax of the input program

• Check the role of each word (token)

eg) a Korean statement

진 주어 동사

똑똑한

Lexical Analysis

5

6

렉스(Lex)

• A lexical analyzer generator : published in 1975

– Input : user-defined regular expressions and supporting codes

– Output : C program

lex

Lexical analyzer A series of tokensInput stream

(source program)

lex input(regular expression

+ α )

lex cclex.yy.clex inputtest.l

Executables(of lexical analyzer)

lex library

7

Input for lex

<Definitions>

%{

definitions of data structures, variables and constants

for the resulting analyzer codes

}%

definition of names

: each name is assigned to a specific regular expression

%%

<Rules>

a rule = a regular expression (representing a token) +

an action (C codes to be executed when the token is recognized)

%%

<User-defined functions>

: invoked in actions of <Rules>

8

%{ /* file name : example1.l* input: file name

* output: the numbers of lines, words and characters */ unsigned long charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+eol \n

%%

{word} { wordCount++; charCount+=yyleng; }

{eol} {charCount++; lineCount++; }. {charCount++;}

%%

void main() {FILE *file;char fn[20];printf("Type an input file name:");scanf("%s",fn);file=fopen(fn, "r");if (!file) {

fprintf(stderr,"file '%s' could not be opened. \n",fn);exit(1);

}yyin=file;yylex();printf("%d %d %d %s \n", lineCount, wordCount, charCount,fn);

}

yywrap() { return 1; /* end of processing*/ }

$ vi example1.l$ ls

example1.l $ lex example1.l$ ls

example1.l lex.yy.c$ cc lex.yy.c -o example1 –ll$ ls

example1* example1.l lex.yy.c$ vi XXX.c$ example1 Type an input file name: XXX.c20 60 300

9

Regular expressions for lex(usual regular expression+ )

“ : all the letters in between “ and “ are considered as text characters

eg. a“*”b and a*b are different

\ : used to escape a single character

eg. XYZ“++”, “XYZ++” and XYZ\+\+ are all same

[] : to define a class of characters

eg. [abc] : one character among a, b and c

- : an operator representing a range

eg. [a-z] : a lower case character from a to z

^ : a complementary set

eg. [^*] : any character except *

\ : an escape string in C

eg. [ \t\n] : one of a blank, a tab or a newline character

10

* : repeating 0 or more timeseg. [a-zA-Z][a-zA-Z0-9]* : a regular expression for a variable name

+ : repeating 1 or more timeseg. [a-z]+ : a regular expression for all the string of lower-case letters

? : an optional elementeg. ab?c : either abc or ac

| : choice operatoreg. (ab | cd) : either ab or cd

(ab | cd+)?(ef)* : abefef, efefef, cdef, cddd, …

^ : at the begining of a lineeg. ^abc : recognizes abc only if it appears at the beginning of the line

$ : at the end of a line

. : any character except the newline charactereg. “- -”.* : from - - to the end of the line

{} : using a name instead of the corresponding regular expression

11

Lex Expression Matches

abc abc

abc* ab abc abcc abccc ...

abc+ abc, abcc, abccc, abcccc, ...

a(bc)+ abc, abcbc, abcbcbc, ...

a(bc)? a, abc

[abc] one of: a, b, c

[a-z] any letter, a through z

[a\-z] one of: a, -, z

[-az] one of: - a z

[A-Za-z0-9]+ one or more alphanumeric characters

[ \t\n]+ whitespace

[^ab] anything except: a, b

[a^b] a, ^, b

[a|b] a, |, b

a|b a, b

Recognizing Regular Expressions

12

<= { … }

<> { … }

< { … }

= { … }

>= { … }

> { … }

13

Data Structure for Tokens

“Lexeme”: representation for each token in the lexical analyzer

• token number

– internal (unique) number for a token, for efficient processing

• token value– valid if a token has a “value” that a programmer created

token value for a identifiers: the matched string

token value for a constant: the constant value

eg.if X < Y then X :=10;

(29,0) (1,X) (18,0) (1,Y) (35,0) (1,X) (9,0) (2,10) (7,0)

(1,10) : X (1,12) : Ylexeme

identifier itself may be in Symbol table

%{

/* calc.lex */

#include "global.h“

#include "calc.h“

#include <stdlib.h>

%}

white [ \t]+

digit [0-9]

integer {digit}+

exponent [eE]([+-])?{integer}

real {integer}("."{integer})?({exponent})?

%%

{white} {}

{real} { yylval=atof(yytext);

return(NUMBER); }

"+" { return(PLUS); }

"-" { return(MINUS);}

"*" { return(TIMES); }

"/" { return(DIVIDE); }

"^" { return(POWER); }

"(" { return(LEFT_PARENTHESIS); }

")" { return(RIGHT_PARENTHESIS);

}

"\n" { return(END); }

%%

int yywrap(void) {

return 1;

} What is the main difference from the previous wordcount example?

yylval = the token valueyytext = the matched string

References

• Text:Lex & Yacc 2nd Edition, John R. Levine,

Tony Mason, Doug Brown, O'Reilly,1992

Flex & Bison, John R. Levine,

O'Reilly,2009

15

More on Regular Expression

16

정규표현식 (Regular Expressions)

17

참고: 기타 정규 표현식을 쓰는 곳 1

• Unix 명령 중 grep

grep smug files {search files for lines with 'smug'}

grep '^smug' files {'smug' at the start of a line}

grep 'smug$' files {'smug' at the end of a line}

grep '^smug$' files {lines containing only 'smug'}

grep '\^s' files {lines starting with '^s', "\" escapes the ^}

grep '[Ss]mug' files {search for 'Smug' or 'smug'}

grep 'B[oO][bB]' files {search for BOB, Bob, BOb or BoB }

grep '^$' files {search for blank lines}

grep '[0-9][0-9]' file {search for pairs of numeric digits}

18

http://www.robelle.com/smugbook/regexpr.html


• JavaScript에서

– exec, test, match, search 등의메소드들이사용함

– 정규표현식을사용하는메소드들(일부)

19

/g는, 하나찾고멈추지말고,

match 되는것은전부찾으란뜻

(http://www.w3schools.com/jsref/jsref_regexp_g.asp)


20

• Java에서

Package java.util.regex

Java Exampleimport java.io.Console;

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexTestHarness {

public static void main(String[] args){

Console console = System.console();

if (console == null) ... // error.. exit!

while (true) {

Pattern pattern =

Pattern.compile(console.readLine("%nEnter your regex: "));

Matcher matcher =

pattern.matcher(console.readLine("Enter input string to search:"));

boolean found = false;

while (matcher.find()) {

console.format("I found the text" + " \"%s\" starting at "

+ "index %d and ending at index %d.%n",

matcher.group(), matcher.start(), matcher.end());

found = true; } if(!found){ console.format("No match found.%n"); }

} }

} 21

Java Exampleimport java.io.Console;

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexTestHarness {

public static void main(String[] args){

Console console = System.console();

if (console == null) ... // error.. exit!

while (true) {

Pattern pattern =

Pattern.compile(console.readLine("%nEnter your regex: "));

Matcher matcher =

pattern.matcher(console.readLine("Enter input string to search:"));

boolean found = false;

while (matcher.find()) {

console.format("I found the text" + " \"%s\" starting at "

+ "index %d and ending at index %d.%n",

matcher.group(), matcher.start(), matcher.end());

found = true; } if(!found){ console.format("No match found.%n"); }

} }

} 22

Enter your regex: fooEnter input string to search: foofoofooI found the text foo starting at index 0 and ending at index 3.I found the text foo starting at index 3 and ending at index 6.I found the text foo starting at index 6 and ending at index 9.

Enter your regex: a+Enter input string to search: ababaaaabI found the text "a" starting at index 0 and ending at index 1.I found the text "a" starting at index 2 and ending at index 3.I found the text "aaaa" starting at index 4 and ending at index 8.

23

https://inst.eecs.b

erkeley.edu/~cs16

4/sp11/lectures/le

cture2/Lexer.java

Lexical Analyzer in Java

24

Documents

Advanced Compilers Lexical Analysis - CNUplas.cnu.ac.kr/courses/2017f/a_compilers/ac 2 lexical... · 2017-08-31 · 렉스(Lex) • A lexical analyzer generator : published in 1975