Transcript
Page 1: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Operating Systems & CompilersCompiler introduction & Lexical Analysis

Frédéric Haziza <[email protected]>

Department of Computer Systems

Uppsala University

Spring 2008

Page 2: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

A. Silberschatz,P. B. Galvin, G. Gagne.Operating SystemConcepts, 7th Edition.Wiley, 2002(ISBN: 0-471-69466-5)

R. Hunter.The Essence of Compilers.Prentice-Hall, 1999(ISBN: 0-13-727835-7)

Page 3: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Schedule

Date Time Room CommentMon 25 Feb 10.00-12.00 P2446 CompilingOns 27 Feb 10.00-12.00 P1211Thu 38 Feb 10.00-12.00 P1211Mon 3 Mar 08.00-10.00 P1211 “Recap”Thu 2 Mar 15.00-17.00 P1211 Björn VictorThu 2 Mar 10.00-12.00 P1111 Anna OttoWed 2 Apr 08.00-17.00 ? Exam

Page 4: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Operating SystemsProcess Management

Memory Management

Storage Management

CompilersCompiling process &Lexical analysis

Parsing

Semantic analysis &Code generation

Page 5: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords
Page 6: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Compiling Process

Expected to produce correct object code for all input in thesource language for which is was designed, and one or moreerror messages, for any other (invalid) input.

6 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 7: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Compiler vs Translator

Typical: gcc , javac

Non-typical compiler:

latex : document compilertransforms to DVI printing commands. Input = document (not program)

C-to-Silicon compilergenerates hardware circuits for C programs, output is lower-level than typcial compilers

Translators:

f2c : fortran to C (both high-level)

latex2html (both document)

dvi2ps : DVI-to-PostScript (both low-level)

7 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 8: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Efficiency

1 efficient compilation

2 minimal compiler size

3 minimal size of object code

4 production of efficient object code

5 ease of portability

6 ease of maintenance

7 Optimal usabilityincluding good error diagnostics and error recovery

8 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 9: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

General structure

High-levelsource code

Compiler Low-levelmachine code

9 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 10: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Detailed structure

Lexicalanalysis

Syntaxanalysis

Semanticsanalysis

Analysis

Machineindepen-dant codegeneration

Optimizationof machine

inde-pendant

code

StorageAllocation

Machinecode

generation

Optimizationof machine

code

Synthesis

10 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 11: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Lexical analysis

Definition of the Language should be:Precise

Concise

Machine readable

All strings that belong to the language: its syntax.Meanings of those strings: its semantics.

If finite, list them all and that’s it.If not, how to represent the strings? How to describe tokens?

11 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 12: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Regular Expressions

Example (Language)

{xn|n > 0}

{xnyn|n > 0} : xy, xxxyyy, xxxxxxxxyyyyyyyy

{xmyn|m, n > 0}

{xmyn|m, n ≥ 0} : xxx, yyyyyyy, ǫ

Can be defined as Regular Expressions

x∗y∗, ǫ

xx∗yy∗ or x+y+

x∗|y∗

(x |y)∗

x |y∗

(aab|ab)∗ : ǫ, aababaab,ababab,aabaabaabab12 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 13: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

RE examples

Example (identifier)l(l |d)∗

l is a letter

d is a digit

Example (Fixed-point number)(d∗d .d∗)|(d∗.dd∗)

d is a digit

Note: requires a digit before and after the point.

13 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 14: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

RE Problem

Problem?Regular Expression for {xnyn|n > 0}(same number of xs and ys, and at least one) ?

Need some extra rules...S→xSyS→xy

Then S ⇒ xSy ⇒ xxSyy ⇒ xxxyyy (derivation)S ∗

−→ xxxyyy

14 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 15: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

RE Definition

Defined inductively by:

a: ordinary character stands for itself

ǫ: the empty string

R|S: either R or S (alternation), where R,S are RE

RS: R followed by S (concatenation), where R,S are RE

R∗: Concatenation of RE R zero or more times(R∗ = ǫ|R|RR|RRR|RRRR|...)

15 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 16: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

RE Shorthands

R+: RR∗ (one or more)

R?: R|ǫ (optional)

[abce]: (a|b|c|e)

[a-z]: (a|b|c|d |e|..|y |z)

[̂ ab]: anything but the listed characters

[̂ a-z]: one character not from this range

16 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 17: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

How to break up text

if (b == 0)a = b;if ( b == 0 ) a = b ;

elsex = 0;else x = 0 ;elsex = 0 ;

Rule: Longest matching token wins

if ties in length: priority over tokens

Lexer - Lexical AnalyzerTool to convert text stream to tokensdefined by REs + priorities + longest-matching token rule

17 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 18: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Concepts - Summary

Tokens: Strings of characters representing the lexical unitsof the program(such as identifiers, numbers, keywords, operators).Unique or not.

Regular expressions: concise description of tokens

Language denoted by a regular expression R: L(R)= set of strings that R represents

18 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 19: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

How to accept tokens

We use an Finite Automaton

M = (K , Σ, δ, S, F )

K: set of states

Σ: alphabet

δ (transition function)

S ∈ K : start state

F ⊆ K : set of final states

Note: Turing Machine

2 regular expressions as examples

identifier: letter(letter |digit)∗

real number: (+| − | )digit∗.digit digit∗

19 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 20: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

C code for RE - letter(letter |digit)∗

#include <stdio.h>#include <ctype.h>

main(){char in;in = getchar();if(isalpha(in)){

in = getchar();} else {

error();}while(isalpha(in) || isdigit(in)){

in=getchar();}

}20 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 21: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Finite Automaton - letter(letter |digit)∗

2

letter,digit

1letter

21 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 22: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

C code for RE - letter(letter |digit)∗

#include <stdio.h>#include <ctype.h>main(){

char in;int state;state = 1;in = getchar();while(isalpha(in) || isdigit(in)){

switch(state){case 1: if(isalpha(in){state = 2}else{error();}break;case 2: state = 2;break;

}in = getchar();

}return (state == 2);

}

22 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 23: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Finite Automaton - (+| − | )digit∗.digit digit∗

4

digit

1

2+,-,digit

3point

digit

pointdigit

23 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 24: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Best Automaton?

Deterministic (DFA) vs Non-deterministic (NFA)

Optimization: Minimize the FA (NFA ⇒ DFA first)

Out of scope for this class.

24 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 25: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Lex

Automata suitable for automation ⇒ Lex

letter [a-z]digit [0-9]identifier {letter}({letter}|{digit})*%%{identifier} {printf(‘‘identifier recognized\n’’);}%%

$ lex first.lex (appears lex.yy.c)$ cc -o firstlex lex.yy.c -ll$ firstlex < cprog$ firstlex < cprog > identifiers.txt

25 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 26: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Conclusion

At Lexical analysis, the lexerbreaks down the source code in tokens

accepts them.

In C, 6 types of symbols:Keywords const, char, if, else, typedef

Identifiers sum, main, printf

Constants 28, 3.1415927, 017 (octal), 0xFF (hexadecimal)

String literals “Tobias”,”Markus”,”eagle”

Operators +,-,++,≫,/=,&&

Ponctuators {, ], ..., ;

Additionaldeletes comments

inserts line numbers26 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)

Page 27: Operating Systems & Compilers - it.uu.se · Optimization: Minimize the FA (NFA ⇒ DFA first) ... breaks down the source code in tokens accepts them. In C, 6 types of symbols: Keywords

Lexer

Only concerned with symbol recognition

64 const char typedef >> +

Perfectly correct for the lexer.Up to the syntax analyser (or parser) now

27 OSKomp’08 | Compilers (Compiler introduction & Lexical Analysis)


Recommended