Advanced Compilers Lexical Analysis - CNUplas.cnu.ac.kr/courses/2016f/a_compilers/ac 2 lexical analysis 2016.pdf · Syntax Analysis (구문 분석) 문장(statement) 주어구 (Subject)

1

Advanced Compilers Lexical Analysis

Fall. 2016

Chungnam National Univ.

Eun-Sun Cho

2

Compiler Front-End Structure

Lexical (어휘) Analysis

Syntax (구문) Analysis

Semantic (의미) Analysis Errors

abstract syntax tree

Source code

전처리기

(preprocessor) Trivial errors

processing

#include, #defines

#ifdef ...

preprocessed source code

3

Lexical Analysis (어휘분석)

Lexical analyzer

(=scanner)

if (by == 0) ax = by;

read char by char

if ( by == 0 ) ax = by ;

• A given source program is considered as a “long” string.

• While looking into each character in sequence, a lexical

analyzer transforms what it read into a stream of “meaningful,

smallest units.”

• Spaces are eliminated – The result would shorter than the source code.

4

Syntax Analysis (구문 분석)

문장(statement)

주어구 (Subject) 동사구 (Verb)

관형어

우리들이 모였습니다.

“똑똑한 우리들이 모였습니다.”

• Check the syntax of the input program

• Check the role of each word (token)

eg) a Korean statement

진 주어 동사

똑똑한

Lexical Analysis

5

6

렉스(Lex)

• A lexical analyzer generator : published in 1975

– Input : user-defined regular expressions and supporting codes

– Output : C program

lex

Lexical analyzer A series of tokens Input stream

(source program)

lex input (regular expression

+ α )

lex cc lex.yy.c lex input test.l

Executables (of lexical analyzer)

lex library

7

Input for lex

<Definitions>

%{

definitions of data structures, variables and constants

for the resulting analyzer codes

}%

definition of names

: each name is assigned to a specific regular expression

%%

<Rules>

a rule = a regular expression (representing a token) +

an action (C codes to be executed when the token is recognized)

%%

<User-defined functions>

: invoked in actions of <Rules>

8

%{ /* file name : example1.l * input: file name

* output: the numbers of lines, words and characters */ unsigned long charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+ eol \n

%%

{word} { wordCount++; charCount+=yyleng; }

{eol} {charCount++; lineCount++; } . {charCount++;}

%%

void main() { FILE *file; char fn[20]; printf("Type an input file name:"); scanf("%s",fn); file=fopen(fn, "r"); if (!file) { fprintf(stderr,"file '%s' could not be opened. \n",fn); exit(1); } yyin=file; yylex(); printf("%d %d %d %s \n", lineCount, wordCount, charCount,fn);

}

yywrap() { return 1; /* end of processing*/ }

$ vi example1.l $ ls example1.l $ lex example1.l $ ls example1.l lex.yy.c $ cc lex.yy.c -o example1 –ll $ ls example1* example1.l lex.yy.c $ vi XXX.c $ example1 Type an input file name: XXX.c 20 60 300

9

Regular expressions for lex (usual regular expression+ )

“ : all the letters in between “ and “ are considered as text characters eg. a“*”b and a*b are different

\ : used to escape a single character

eg. XYZ“++”, “XYZ++” and XYZ\+\+ are all same

[] : to define a class of characters

eg. [abc] : one character among a, b and c

- : an operator representing a range

eg. [a-z] : a lower case character from a to z

^ : a complementary set

eg. [^*] : any character except *

\ : an escape string in C

eg. [ \t\n] : one of a blank, a tab or a newline character

10

* : repeating 0 or more times eg. [a-zA-Z][a-zA-Z0-9]* : a regular expression for a variable name + : repeating 1 or more times eg. [a-z]+ : a regular expression for all the string of lower-case letters ? : an optional element eg. ab?c : either abc or ac | : choice operator eg. (ab | cd) : either ab or cd (ab | cd+)?(ef)* : abefef, efefef, cdef, cddd, … ^ : at the begining of a line eg. ^abc : recognizes abc only if it appears at the beginning of the line $ : at the end of a line . : any character except the newline character eg. “- -”.* : from - - to the end of the line {} : using a name instead of the corresponding regular expression

11

Lex Expression Matches

abc abc

abc* ab abc abcc abccc ...

abc+ abc, abcc, abccc, abcccc, ...

a(bc)+ abc, abcbc, abcbcbc, ...

a(bc)? a, abc

[abc] one of: a, b, c

[a-z] any letter, a through z

[a\-z] one of: a, -, z

[-az] one of: - a z

[A-Za-z0-9]+ one or more alphanumeric characters

[ \t\n]+ whitespace

[^ab] anything except: a, b

[a^b] a, ^, b

[a|b] a, |, b

a|b a, b

Recognizing Regular Expressions

12

<= { … }

<> { … }

< { … }

= { … }

>= { … }

> { … }

13

Data Structure for Tokens

“Lexeme”: representation for each token in the lexical analyzer

• token number

– internal (unique) number for a token, for efficient processing

• token value – valid if a token has a “value” that a programmer created

token value for a identifiers: the matched string

token value for a constant: the constant value

eg. if X < Y then X :=10;

(29,0) (1,X) (18,0) (1,Y) (35,0) (1,X) (9,0) (2,10) (7,0)

(1,10) : X (1,12) : Y

lexeme

identifier itself may be in Symbol table

%{

/* calc.lex */

#include "global.h“

#include "calc.h“

#include <stdlib.h>

%}

white [ \t]+

digit [0-9]

integer {digit}+

exponent [eE]([+-])?{integer}

real {integer}("."{integer})?({exponent})?

%%

{white} {}

{real} { yylval=atof(yytext);

return(NUMBER); }

"+" { return(PLUS); }

"-" { return(MINUS);}

"*" { return(TIMES); }

"/" { return(DIVIDE); }

"^" { return(POWER); }

"(" { return(LEFT_PARENTHESIS); }

")" { return(RIGHT_PARENTHESIS);

}

"\n" { return(END); }

%%

int yywrap(void) {

return 1;

}

What is the main difference from the previous wordcount example?

… check the position of return statements ..

yylval = the token value yytext = the matched string

References

• Text: Lex & Yacc 2nd Edition, John R. Levine,

Tony Mason, Doug Brown, O'Reilly,1992

• Examples: http://myweb.stedwards.edu/laurab/cosc4342/lex-

examples.html

• lex built-in functions and etc. : http://www.tldp.org/HOWTO/Lex-YACC-HOWTO-

3.html

15

http://myweb.stedwards.edu/laurab/cosc4342/lex-examples.html



http://www.tldp.org/HOWTO/Lex-YACC-HOWTO-3.html







More on Regular Expression

16

정규표현식 (Regular Expressions)

17

참고: 기타 정규 표현식을 쓰는 곳 1

• Unix 명령 중 grep

grep smug files {search files for lines with 'smug'}

grep '^smug' files {'smug' at the start of a line}

grep 'smug$' files {'smug' at the end of a line}

grep '^smug$' files {lines containing only 'smug'}

grep '\^s' files {lines starting with '^s', "\" escapes the ^}

grep '[Ss]mug' files {search for 'Smug' or 'smug'}

grep 'B[oO][bB]' files {search for BOB, Bob, BOb or BoB }

grep '^$' files {search for blank lines}

grep '[0-9][0-9]' file {search for pairs of numeric digits}

18

http://www.robelle.com/smugbook/regexpr.html


• JavaScript에서

– exec, test, match, search 등의 메소드들이 사용함

– 정규표현식을 사용하는 메소드들(일부)

19

/g는, 하나 찾고 멈추지 말고,

match 되는 것은 전부 찾으란 뜻

(http://www.w3schools.com/jsref/jsref_regexp_g.asp)


20

• Java에서

Package java.util.regex

Java Example import java.io.Console;

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexTestHarness {

public static void main(String[] args){

Console console = System.console();

if (console == null) ... // error.. exit!

while (true) {

Pattern pattern =

Pattern.compile(console.readLine("%nEnter your regex: "));

Matcher matcher =

pattern.matcher(console.readLine("Enter input string to search:"));

boolean found = false;

while (matcher.find()) {

console.format("I found the text" + " \"%s\" starting at "

+ "index %d and ending at index %d.%n",

matcher.group(), matcher.start(), matcher.end());

found = true; } if(!found){ console.format("No match found.%n"); } } } } 21

Java Example import java.io.Console;

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexTestHarness {

public static void main(String[] args){

Console console = System.console();

if (console == null) ... // error.. exit!

while (true) {

Pattern pattern =

Pattern.compile(console.readLine("%nEnter your regex: "));

Matcher matcher =

pattern.matcher(console.readLine("Enter input string to search:"));

boolean found = false;

while (matcher.find()) {

console.format("I found the text" + " \"%s\" starting at "

+ "index %d and ending at index %d.%n",

matcher.group(), matcher.start(), matcher.end());

found = true; } if(!found){ console.format("No match found.%n"); } } } } 22

Enter your regex: foo Enter input string to search: foofoofoo I found the text foo starting at index 0 and ending at index 3. I found the text foo starting at index 3 and ending at index 6. I found the text foo starting at index 6 and ending at index 9. Enter your regex: a+ Enter input string to search: ababaaaab I found the text "a" starting at index 0 and ending at index 1. I found the text "a" starting at index 2 and ending at index 3. I found the text "aaaa" starting at index 4 and ending at index 8.

Lexical Analyzer in Java import java.io.Reader; import java.util.Scanner; import java.util.HashMap; import java.util.regex.Pattern;

public class Lexer { public static enum TokenType { GTEQ(">="), LTEQ("<="), GT(">"), LT("<"), ARROW("-->"), PLUS("+"), MINUS("-"), STAR("*"), SLASH("/"), ASSIGN("="), LPAR ("("), RPAR (")"), SEMI(";"), COMMA(","), IF("if"), ELSE("else"), WHILE("while"), IDENT(null), NUMERAL(null), EOF(null), ERROR (null);

final private String lexeme; TokenType (String s) { lexeme = s;} } public String lastLexeme; private static HashMap<String, TokenType> tokenMap = new HashMap<String, TokenType > (); static { for (TokenType c : TokenType.values()) tokenMap.put (c.lexeme, c); }

23 https://inst.eecs.berkeley.edu/~cs164/sp11/lectures/lecture2/Lexer.java

private Scanner inp; private static final Pattern tokenPat = Pattern.compile ("(\\s+|#.*)" + "|>=|<=|-->|if|def|else|fi|while" + "|([a-zA-Z][a-zA-Z0-9]*)|(\\d+)" + "|.");

public Lexer (Reader reader) { inp = new Scanner (reader); } public TokenType nextToken () { if (inp.findWithinHorizon (tokenPat, 0) == null) return TokenType.EOF; else { lastLexeme = inp.match ().group (0); if (inp.match ().start (1) != -1) return nextToken (); else if (inp.match ().start (2) != -1) return TokenType.IDENT; else if (inp.match ().start (3) != -1) return TokenType.NUMERAL;

TokenType result = tokenMap.get (lastLexeme); if (result == null) return TokenType.ERROR; else return result; } } }

24 https://inst.eecs.berkeley.edu/~cs164/sp11/lectures/lecture2/Lexer.java

Documents

Advanced Compilers Lexical Analysis - CNUplas.cnu.ac.kr/courses/2016f/a_compilers/ac 2 lexical analysis 2016.pdf · Syntax Analysis (구문 분석) 문장(statement) 주어구 (Subject)