CD Course Manual

Embed Size (px)

Citation preview

  • 8/18/2019 CD Course Manual

    1/27

    CompilerDesign

    By: HaftuHagos

  • 8/18/2019 CD Course Manual

    2/27

    1

    Compiler Design

    Chapter One

    Compilers

    Introduction Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical

    device and its functions are being controlled by compatible software. Hardware understands

    instructions in the form of electronic charge, which is the counterpart of binary language in

    software programming. Binary language has only two alphabets, 0 and 1. o instruct, the

    hardware codes must be written in binary format, which is simply a series of 1s and 0s. !t would

     be a difficult and cumbersome tas" for computer programmers to write such codes, which is why

    we have compilers to write such codes.

    1.1 Language Processing System#e have learnt that any computer system is made of hardware and software. he hardware

    understands a language, which humans cannot understand. $o we write programs in high%level

    language, which is easier for us to understand and remember. hese programs are then fed into a

    series of tools and &$ components to get the desired code that can be used by the machine. his

    is "nown as 'anguage (rocessing $ystem.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    3/27

    2

    Compiler Design

    he high%level language is converted into binary language in various phases. ) compiler   is a

     program that converts high%level language to assembly language. $imilarly, an assembler   is a

     program that converts the assembly language to machine%level language. 'et us first understand

    how a program, using C compiler , is executed on a host machine.

    • User writes a program in C language (high-level language!

    •  "he C #ompiler #ompiles the program an$ translates it to assem%lyprogram (low-level language!

    • &n assem%ler then translates the assem%ly program into ma#hine #o$e(o%'e#t!

    • & liner tool is use$ to lin all the parts of the program together fore)e#ution (e)e#uta%le ma#hine #o$e!

    • & loa$er loa$s all of them into memory an$ then the program is e)e#ute$!

    Before diving straight into the concepts of compilers, we should understand a few other tools that

    wor" closely with compilers.

    1.1.1 Preprocessor

    ) preprocessor, generally considered as a part of compiler, is a tool that produces input for 

    compilers. hey may perform the following functions.

    1.  Macro processing: ) preprocessor may allow a user to define macros that are short

    hands for longer constructs.

    *.  File inclusion: ) preprocessor may include header files into the program text+.  Rational preprocessor: these preprocessors augment older languages with more modern

    flow%of%control and data structuring facilities..  Language Extensions: hese preprocessor attempts to add capabilities to the language bycertain amounts to build%in macro 

    .

    1.1.2 Interpreter

    )n interpreter, li"e a compiler, translates high%level language into low%level machine language.

    he difference lies in the way they read the source code or input. ) compiler reads the whole

    source code at once, creates to"ens, chec"s semantics, generates intermediate code, executes thewhole program and may involve many passes. !n contrast, an interpreter reads a statement from

    the input, converts it to an intermediate code, executes it, then ta"es the next statement in

    se-uence. !f an error occurs, an interpreter stops execution and reports it whereas a compiler reads the whole program even if it encounters several errors.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    4/27

    *

    Compiler Design

    'anguages such as B)$!C, $/&B&', '!$( can be translated using interpreters. )) also usesinterpreter. he process of interpretation can be carried out in following phases.

    1. 'exical analysis

    *. $yntax analysis+. $emantic analysis

    . 2irect 3xecution

     Advantages:

    • 4odification of user program can be easily made and implemented as execution

     proceeds.

    • ype of object that denotes various may change dynamically.• 2ebugging a program and finding errors is simplified tas" for a program used for 

    interpretation.

    • he interpreter for the language ma"es it machine independent.

     Disadvantages:

    • he execution of the program is slower.

    • 4emory consumption is more.

    1.1.3Assembler (rogrammers found it difficult to write or read programs in machine language. hey begin to use

    a mnemonic 5symbols6 for each machine instruction, which they would subse-uently translateinto machine language. $uch a mnemonic machine language is now called an assembly

    language. (rograms "nown as assembler were written to automate the translation of assemblylanguage in to machine language. he input to an assembler program is called source program,

    the output is a machine language translation 5object program6.

    he output of an assembler is called an object file, which contains a combination of machineinstructions as well as the data re-uired to place these instructions in memory.

    1.1.4 Linker

    'in"er is a computer program that lin"s and merges various object files together in order to ma"e

    an executable file. )ll these files might have been compiled by separate assemblers. he major 

    tas" of a lin"er is to search and locate referenced module7routines in a program and to determinethe memory location where these codes will be loaded, ma"ing the program instruction to have

    absolute references.

    1.1.5 Loader

    'oader is a part of operating system and is responsible for loading executable files into memoryand executes them. !t calculates the si8e of a program 5instructions and data6 and creates memory

    space for it. !t initiali8es various registers to initiate execution.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    5/27

  • 8/18/2019 CD Course Manual

    6/27

    ,

    Compiler Design

    1.!.2 "nthesis Phase

    9nown as the bac"%end of the compiler, the synthesis phase generates the target program with

    the help of intermediate source code representation and symbol table.

    ) compiler can have many phases and passes.

    • Pass:  & pass refers to the traversal of a #ompiler through the entireprogram!

    • Phase: & phase of a #ompiler is a $istinguisha%le stage whi#h taes inputfrom the previous stage pro#esses an$ yiel$s output that #an %e use$ asinput for the ne)t stage! & pass #an have more than one phase!

    1.4 Phases of Compilershe compilation process is a se-uence of various phases. 3ach phase ta"es input from its

     previous stage, has its own representation of source program, and feeds its output to the next

     phase of the compiler. 'et us understand the phases of a compiler.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    7/27

    .

    Compiler Design

    Le#ical anal"sis: his is the initial part of reading and analy8ing the program

    m text: he text is read and divided into to"ens, each of which corresponds to a symbol in the

     programming language, e.g., a variable name, "eyword or number.

    "nta# anal"sis: his phase ta"es the list of to"ens produced by the lexical analysis and arrangesthese in a tree%structure 5called the syntax tree6 that reflects the structure of the program. his phase is often called parsing.

    emantic Anal"sis: $emantic analysis chec"s whether the parse tree constructed follows the

    rules of language. ;or example, assignment of values is between compatible data types, andadding string to an integer. )lso, the semantic analy8er "eeps trac" of identifiers, their types and

    Saint Mary University

  • 8/18/2019 CD Course Manual

    8/27

    /

    Compiler Design

    expressions whether identifiers are declared before use or not, etc. he semantic analy8er 

     produces an annotated syntax tree as an output.

    e.g., if a variable is used but not declared or if it is used in a context that does not ma"e sensegiven the type of the variable, such as trying to use a boolean value as a function pointer.

    Intermediate Code $eneration% )fter semantic analysis, the compiler generates anintermediate code of the source code for the target machine. !t represents a program for some

    abstract machine. !t is in between the high%level language and the machine language. his

    intermediate code should be generated in such a way that it ma"es it easier to be translated into

    the target machine code.

    Code Optimi&ation% he next phase does code optimi8ation of the intermediate code.

    &ptimi8ation can be assumed as something that removes unnecessary code lines, and arranges

    the se-uence of statements in order to speed up the program execution without wasting resources5C(well%cultured? in computer 

    science. b6 ) good craftsman should "now his tools, and compilers are important tools for 

     programmers and computer scientists.c6 he techni-ues used for constructing a compiler are useful for other purposes as well.

    d6 here is a good chance that a programmer or computer scientist will need to write a

    compiler or interpreter for a domain%specific language.

    (e)ie* !uestions16 #hat is the difference between high level languages and machine languages@

    *6 2efine the terms computer hardware and software.+6 #hiter the phases language processing systems neatly.

    6 #hat is the difference between cross%compiler and source%to%source compiler@A6 #rite in detail all the core parts being done in the front%end of the compiler.

    6 #rite in detail all the core parts being done in the bac"%end of the compiler.

    6 2efine in detail about lin"ers and loaders.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    9/27

    0

    Compiler Design

    DDDDDDDDDDDDDDDDDDDDDDDDH3 3/2DDDDDDDDDDDDDDDDDDDD

    Chapter '*o

    Le#ical Anal"sis

    'exical analysis is the first phase of a compiler. !t ta"es the modified source code from language preprocessors that are written in the form of sentences. he lexical analy8er brea"s these

    syntaxes into a series of to"ens, by removing any whitespace or comments in the source code.

    !f the lexical analy8er finds a to"en invalid, it generates an error. he lexical analy8er wor"s

    closely with the syntax analy8er. !t reads character streams from the source code, chec"s for legal

    to"ens, and passes the data to the syntax analy8er when it demands.

    ) lexical analy8er, or lexer for short, will as its input ta"e a string of individual letters and divide

    this string into to"ens. )dditionally, it will filter out whatever separates the to"ens 5the so%called

    white%space6, i.e., lay%out characters 5spaces, newlines etc.6and comments.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    10/27

    Compiler Design

    he main purpose of lexical analysis is to ma"e life easier for the subse-uent syntax analysis

     phase. !n theory, the wor" that is done during lexical analysis can be made an integral part of 

    syntax analysis, and in simple systems this is indeed often $one! However there are

    reasons for eeping the phases separate:

    +,,icienc": ) lexer may do the simple parts of the wor" faster than the more general parser can. ;urthermore, the si8e of a system that is split in two may be smaller than a

    combined system. his may seem paradoxical but, as we shall see, there is a non%linear 

    factor involved which may ma"e a separated system smaller than a combined system. odularit"% he syntactical description of the language need not be cluttered with small

    lexical details such as white%space and comments. 'radition% 'anguages are often designed with separate lexical and syntactical phases in

    mind, and the standard documents of such languages typically separate lexical and

    syntactical elements of the languages.

    ;or lexical analysis, specifications are traditionally written using regular expressions: )nalgebraic notation for describing sets of strings. he generated lexers are in a class of extremely

    simple programs called finite automata.

    2.1'okens Le#emes and Patterns

    #hen discussing lexical analysis, we use three related but distinct terms: ) token is a pair consisting of a to"en name and an optional attribute value. he to"en

    name is an abstract symbol representing a "ind of lexical unit, e.g., a particular "eyword,

    or a se-uence of input characters denoting an identifier. he to"en names are the inputsymbols that the parser processes. !n what follows, we shall generally write the name of a

    to"en in boldface. #e will often refer to a to"en by its to"en name.

    ) pattern is a description of the form that the lexemes of a to"en may ta"e. !n the case of a "eyword as a to"en, the pattern is just the se-uence of characters that form the "eyword.

    ;or identifiers and some other to"ens, the pattern is a more complex structure that is

    matched by many strings. ) le#eme is a se-uence of characters in the source program that matches the pattern for a

    to"en and is identified by the lexical analy8er as an instance of that to"en.

    +#ample !.1% ;igure *.1 gives some typical to"ens, their informally described patterns, and

    some sample lexemes. o see how these concepts are used in practice, in the C statement.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    11/27

    1

    Compiler Design

    print,/0'otal d3n score

     both printf and score are lexemes matching the pattern for the to"en id and >total D EdFn? isa lexeme matching literal.

    2.1.1 'okens

    'exemes are said to be a se-uence of characters 5alphanumeric6 in a to"en. here are some

     predefined rules for every lexeme to be identified as a valid to"en. hese rules are defined by

    grammar rules, by means of a pattern. ) pattern explains what can be a to"en, and these patternsare defined by means of regular expressions.

    !n programming language, "eywords, constants, identifiers, strings, numbers, operators, and

     punctuations symbols can be considered as to"ens.

    7or e#ample, in C language, the variable declaration line.

    int )alue 188contains the to"ens:

    int "keyword #$ value "identifer #$ % "operator#$ 1&& "constant # an' (

    "symbol #.

    'et us understand how the language theory underta"es the following terms:

    Alphabets)ny finite set of symbols G0,1 is a set of binary alphabets, G0,1,*,+,,A,,,I,J,),B,C,2,3,; is

    a set of Hexadecimal alphabets, Ga%8, )%K is a set of 3nglish language alphabets.

    trin9s)ny finite se-uence of alphabets is called a string. 'ength of the string is the total number of 

    occurrence of alphabets, e.g., the length of the string $t. 4ary is I and is denoted by L$t. 4aryL D

    I. ) string having no alphabets, i.e. a string of 8ero length is "nown as an empty string and is

    denoted by M 5epsilon6.

    pecial "mbols) typical high%level language contains the following symbols:%

    Saint Mary University

  • 8/18/2019 CD Course Manual

    12/27

    11

    Compiler Design

    )rithmeticSymols

    )''ition"*#$ Sutraction"+#$ ,o'ulo"-#$,ultiplication"#$ /i0ision"#

    Punctuation Comma/ emicolon/ :ot/. Arro*/-;

    Assi9nment

    pecial Assi9nment -Comparison ? @ @ ; ;

    Preprocessor

    Punctuation Comma/ emicolon/ :ot/. Arro*/-;

    Assi9nment

    pecial Assi9nment -

    Comparison ? @ @ ; ;

    Preprocessor

    Comparison ? @ @ ; ;

    Location peci,ier B

    Lo9ical B BB ?

    hi,t Operator ;; ;;; @@ @@@

    Lan9ua9e) language is considered as a finite set of strings over some finite set of alphabets. Computer 

    languages are considered as finite sets, and mathematically set operations can be performed on

    them. ;inite languages can be described by means of regular expressions.

    2.2(e9ular +#pressions

    he lexical analy8er needs to scan and identify only a finite set of valid string7to"en7lexeme that

     belongs to the language in hand. !t searches for the pattern defined by the language rules.Negular expressions have the capability to express finite languages by defining a pattern for 

    finite strings of symbols. he grammar defined by regular expressions is "nown as regular 

    grammar. he language defined by regular grammar is "nown as regular language.

    Negular expression is an important notation for specifying patterns. 3ach pattern matches a set of 

    strings, so regular expressions serve as names for a set of strings. (rogramming language to"ens

    can be described by regular languages. he specification of regular expressions is an example of 

    a recursive definition. Negular languages are easy to understand and have efficient

    implementation.

    here are a number of algebraic laws that are obeyed by regular expressions, which can be used

    to manipulate regular expressions into e-uivalent forms.

    Operations

    he various operations on languages are:

  • 8/18/2019 CD Course Manual

    13/27

    12

    Compiler Design

    Concatenation of two languages ' and 4 is written as

    '4 D Gst L s is in ' and t is in 4 he 9leene Closure of a language ' is written as

    'O D Kero or more occurrence of language '.

    Dotations!f r and s are regular expressions denoting the languages '5r6 and '5s6, then

    Enion : 5r6L5s6 is a regular expression denoting '5r6 < '5s6

    Concatenation : 5r65s6 is a regular expression denoting '5r6'5s6

     Fleene closure : 5r6O is a regular expression denoting 5'5r66O

    5r6 is a regular expression denoting '5r6

    Precedence and Associati)it"

    O, concatenation 5.6, and L 5pipe sign6 are left associative

    O has the highest precedence

    Concatenation 5.6 has the second highest precedence.

    L 5pipe sign6 has the lowest precedence of all.

    (epresentin9 )alid tokens o, a lan9ua9e in re9ular e#pression  !f x is a regular expression, then:

    xO means 8ero or more occurrence of x.

    i.e., it can generate G e, x, xx, xxx, xxxx, P  xQ means one or more occurrence of x.

    i.e., it can generate G x, xx, xxx, xxxx P or x.xO x@ means at most one occurrence of x

    i.e., it can generate either Gx or Ge. Ra%8S is all lower%case alphabets of 3nglish language.

    R)%KS is all upper%case alphabets of 3nglish language.

    R0%JS is all natural digits used in mathematics.

    2etails:

    ∑¿=∑

    0

    ∪∑1

    ∪∑2…

    +¿=∑0

    ∪∑1

    ∪∑2…

    ∑¿

    +¿∪{Ɛ  }…∑

    ¿=∑¿

    Concatenation:

    T D 01101 U D 110, TU D 01101110.

    ;or any string x, x D x D x.Ɛ Ɛ

    (epresentin9 occurrence o, s"mbols usin9 re9ular e#pressions  letter D Ra V 8S or R) V KS

    digit D 0 L 1 L * L + L L A L L L I L J or R0%JS

    Saint Mary University

  • 8/18/2019 CD Course Manual

    14/27

    1*

    Compiler Design

    sign D R Q L % S

    (epresentin9 lan9ua9e tokens usin9 re9ular e#pressions De#imal 3 (sign4($igit5 6$enti7er 3 (letter(letter 8 $igit9

    he only problem left with the lexical analy8er is how to verify the validity of a regular 

    expression used in specifying the patterns of "eywords of a language. ) well%accepted solution is

    to use finite automata for verification.

    +#ample 2.2% 'et ' be the set of letters GA H . . . , K, a b . . . , z) and let D be the set of digits

    G81.. .J6. #e may thin" of L and D in two, essentially e-uivalent, ways. &ne way is that  L and

     D are, respectively, the alphabets of uppercase and lowercase letters and of digits. he secondway is that L and D are languages, all of whose strings happen to be of length one. Here are some

    other languages that can be constructed from languages L and D.

    1. ' U D is the set of letters and digits % strictly spea"ing the language with * strings oflength one, each of which strings is either one letter or one digit.

    *.  LD is the set df A*0 strings of length two, each consisting of one letter followed by onedigit.

    +.  L4 is the set of all %letter strings..  L* is the set of ail strings of letters, including e, the empty string.

    A.  L(L U D)* is the set of all strings of letters and digits beginning with a letter.

    .  D+ is the set of all strings of one or more digits.

    2.!(e9ular :e,initions

    ;or notational convenience, we may wish to give names to certain regular expressions and use

    those names in subse-uent expressions, as if the names were themselves symbols. !f C is analphabet of basic symbols, then a regular definition is a se-uence of definitions of the form:d 1→r1

    d 2→r 2

    .

    .   .

    dn→rn

    where:

    1. 3ach di is a new symbol, not in C and not the same as any other of the dWs, and

    *. 3ach ri is a regular expression over the alphabet X U Gdl, d*,. . . , di%l6.+#ample 2.! : C identifiers are strings of letters, digits, and underscores. Here is a regulardefinition for the language of C identifiers. #e shall conventionally use italics for the

    symbols defined in regular definitions. Letter !" ) LBLPLK L a L b LPL 8 L

    digit %Y #$%$ &$'$

    id %Y letter 5 letter$ digit 6O+#ample 2.4% :

  • 8/18/2019 CD Course Manual

    15/27

    1+

    Compiler Design

    digit→

     0L1LPLJL

    digits→

     digit digit*

    optionalFraction→  . digits$

    optionalExponent →  ( 3 ( + | - |  ) digits )| 

    nuer→  digits optionalFraction optionalExponent 

    2.4'oken (eco9nition

    !n the previous section we learned how to express patterns using regular expressions. /ow, we

    must study how to ta"e the patterns for all the needed to"ens and build a piece of code thatexamines the input string and finds a prefix that is a lexeme matching one of the patterns. &ur

    discussion will ma"e use of the following running example.

     stt→

    i, expr then stt 

    I i, expr then stt else stt 

    I  .Ɛ 

    expr→  ter relop ter

    I ter

    ter→  id

    I numberhe terminals of the grammar, which are i, then else relop id, and number, are the names of

    to"ens as far as the lexical analy8er is concerned. he patterns for these to"ens are described

    using regular definitions.

    digit →  [0-9] 

    digits→

    digit +

    number→

     digits. digits)! "[+-]! digits)!

    letter→

     [A-#a-z] 

    id→

      letterletter$digit)% 

    i&→

    i& 

    t'en

      →

    t'enelse

    →else

    relop→

    ($$(*$*$*$(

    ;or this language, the lexical analy8er will recogni8e the "eywords i f , then, and else, as well as

    lexemes that match the patterns for relop id and nuer. o simplify matters, we ma"e thecommon assumption that "eywords are also resered ,ords: that is, they are not identifiers, even

    though their lexemes match the pattern for identifiers

    Saint Mary University

  • 8/18/2019 CD Course Manual

    16/27

    1,

    Compiler Design

    !n addition, we assign the lexical analy8er the job of stripping out whitespace, by recogni8ing the

    Zto"enZ ,s defined by:

    s→  ( blank I tab | ne*line )+

    Here, blan", tab, and newline are abstract symbols that we use to express the )$C!! characters of 

    the same names. o"en ,s is different from the other to"ens in that, when we recogni8e it, we donot return it to the parser, but rather restart the lexical analysis from the character that follows thewhitespace. !t is the following to"en that gets returned to the parser.

    &ur goal for the lexical analy8er is summari8ed in ;ig. *.1. hat table shows, for each lexeme or 

    family of lexemes, which to"en name is returned to the parser and what attribute value, asdiscussed in $ection +.1.+, is returned. /ote that for the six relational operators, symbolic

    constants ', '3, and so on are used as the attribute value, in order to indicate which instance of 

    the to"en relop we have found. he particular operator found will influence the code that isoutput from the compiler. 

    2.57inite Automata

    inite &utomata 3 &%stra#t Computing Devi#es!

    ;inite automata are a state machine that ta"es a string of symbols as input and changes its state

    accordingly. ) finite automaton is a recogni8er for regular expressions. #hen a regular 

    expression string is fed into finite automata, it changes its state for each literal. !f the input string

    is successfully processed and the automata reach its final state, it is accepted, i.e., the string just

    fed was said to be a valid to"en of the language in hand.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    17/27

    1.

    Compiler Design

    he mathematical model of finite automata consists of: ;inite set of states 5[6

    ;inite set of input symbols 5\6

    &ne $tart state 5-06

    $et of final states 5-f6

    ransition function 5]6

    he transition function 5]6 maps the finite set of state 5[6 to a finite set of input symbols 5\6,

    [ ^ \➔ [

    2.5.1 7inite Automata Construction

    'et '5r6 is a regular language recogni8ed by some finite automata 5;)6. tates% $tates of ;) are represented by circles. $tate names are written inside circles.

    tart state% he state from where the automata start is "nown as the start state. $tart

    state has an arrow pointed towards it.

    Intermediate states% )ll intermediate states have at least two arrows one pointing toand another pointing out from them.

    7inal state% !f the input string is successfully parsed, the automata are expected to be

    in this state. ;inal state is represented by double circles. !t may have any odd number 

    of arrows pointing to it and even number of arrows pointing out from it. he number of odd arrows are one greater than even, i.e. odd e)en

  • 8/18/2019 CD Course Manual

    18/27

    1/

    Compiler Design

    1. Certain states are said to be accepting, or final. hese states indicate that a lexeme has

     been found, although the actual lexeme may not consist of all positions between the

    lexemeBegin and forward pointers. #e always indicate an accepting state by a doublecircle, and if there is an action to be ta"en % typically returning a to"en and an attribute

    value to the parser % we shall attach that action to the accepting state.

    *. !n addition, if it is necessary to retract the forward pointer one position 5i.e., the lexemedoes not include the symbol that got us to the accepting state6, then we shall additionally

     place a O near that accepting state. !n our example, it is never necessary to retract forward

     by more than one position, but if it were, we could attach any number of OWs to the

    accepting state.+#ample 2.! :

    ;igure *.* is a transition diagram that recogni8es the lexemes matching the to"en relop. #e

     begin in state 0, the start state. !f we see < as the first input symbol, then among the lexemes thatmatch the pattern for relop we can only be loo"ing at

  • 8/18/2019 CD Course Manual

    19/27

    10

    Compiler Design

    2.5.2.1 (eco9ni&in9 reser)ed *ords and Identi,iers

    Necogni8ing "eywords and identifiers presents a problem.

  • 8/18/2019 CD Course Manual

    20/27

    1

    Compiler Design

    +#ample 2.5%he final transition diagram, shown in ;ig. *.A, is for whitespace. !n that diagram, we loo" for one or more

    ZwhitespaceZ characters, represented by delim in that diagram - typically these characters would be blan", tab,newline, and perhaps other characters that are not considered by the language design to be part of any to"en.

    7i9ure 2.5% transition dia9ram ,or *hite space.

     /ote that in state *, we have found a bloc" of consecutive whitespace characters, followed by a

    nonwhite space character. #e retract the input to begin at the nonwhite space, but we do notreturn to the parser. Nather, we must restart the process of lexical analysis after the whitespace.

    2.6Attributes ,or 'okens

    ;hen more than one le)eme #an mat#h a pattern the le)i#al analyuen#es parsing $e#isions while the attri%ute value

    in>uen#es translation of toens after the parse ?ra#ti#ally a toen has one attri%ute: a pointer to the sym%ol ta%le entry in

    whi#h the information a%out the toen is ept  "he sym%ol ta%le entry #ontains various information a%out the toen su#h as the

    le)eme its type the line num%er in whi#h it was 7rst seen et#! or e)ample in assignment statement (in @A"A&: 3 M 9 C 99 2 the toens

    an$ their attri%utes are written as:• i' pointer to sym%ol-ta%le entry for E

    • assignopE

    • i' pointer to sym%ol-ta%le entry for ME

    • multopE

    • i' pointer to sym%ol-ta%le entry for CE

    • epopE

    • numer integer value 2E

    2.Le#ical +rrors

    6t is har$ for a le)i#al analy

  • 8/18/2019 CD Course Manual

    21/27

    2

    Compiler Design

    a le)i#al analy

  • 8/18/2019 CD Course Manual

    22/27

    21

    Compiler Design

    @n#e the le)eme is $etermine$ for8ar' is set to the #hara#ter at its right en$

    (involves retra#ting  "hen after the le)eme is re#or$e$ as an attri%ute value of the toen returne$ to

    the parser leeme6egin is set to the #hara#ter imme$iately after the le)eme

     'ust foun$ &$van#ing for8ar' re=uires that we 7rst test whether we have rea#he$ the en$

    of one of the %uGers an$ if so we must reloa$ the other %uGer from the input

    an$ move forwar$ to the %eginning of the newly loa$e$ %uGerSentinels

    6f we use the previous s#heme we must #he# ea#h time we a$van#e forwar$

    that we have not move$ oG one of the %uGers if we $o then we must also

    reloa$ the other %uGer  "hus for ea#h #hara#ter rea$ we must mae to tests: one for the en$ of the

    %uGer an$ one to $etermine whi#h #hara#ter is rea$ ;e #an #om%ine the %uGer-en$ test with the test for the #urrent #hara#ter if we

    e)ten$ ea#h %uGer to hol$ sentinel #hara#ter at the en$

    3 M 9 eof  C 9 9 2 eo

    eof 

    igure 2!/: Sentinels at the en$ of ea#h %uGer  "he sentinel is a spe#ial #hara#ter that #annot %e part of the sour#e program

    an$ a natural #hoi#e is the #hara#ter eof  ote that eof retains its use as a marer for the en$ of the entire input &ny eof that appears other than at the en$ of %uGer means the input is at an en$ igure 2!* shows the same arrangement as igure 2!2 %ut with the sentinels

    a$$e$

    2.KArchitecture o, a 'ransition-:ia9ram-Hased Le#ical Anal"&er

    here are several ways that a collection of transition diagrams can be used to build a lexicalanaly8er. Negardless of the overall strategy, each state is represented by a piece of code. #e may

    imagine a variable state holding the number of the current state for a transition diagram. )

    switch based on the value of s t a t e ta"es us to code for each of the possible states, where wefind the action of that state. &ften, the code for a state is itself a switch statement or multi way

     branch that determines the next state by reading and examining the next input character.

    +#ample 2.5%

    !n ;ig. 2.J we see a s"etch of getNelop56, a CQQ function whose job is to simulate the transition

    diagram of ;ig. 2.5 and return an object of type &93/, that is, a pair consisting of the to"en

    name 5which must be relop in this case6 and an attribute value 5the code for one of the six

    comparison operators in this case6. getNelop56 first creates a new object reto"en and initiali8esits first component to (+LOP the symbolic code for to"en relop. #e see the typical behavior of 

    a state in case 0, the case where the current

    Saint Mary University

    forwar

    le)emeBe

  • 8/18/2019 CD Course Manual

    23/27

    22

    Compiler Design

    state is 0. ) function nextchar56 obtains the next character from the input and assigns it to local

    variable c. #e then chec" c for the three characters we expect to find, ma"ing the state transition

    dictated by the transition diagram of ;ig. 2.5 in each case. ;or example, if the next inputcharacter is D, we go to state A.

    !f the next input character is not one that can begin a comparison operator, then a function fail 56

    is called. #hat fail 56 does depends on the global error recovery strategy of the lexical analy8er.!t should reset the forward pointer to lexemeBegin, in order to allow another transition diagram to beapplied to the true beginning of the unprocessed input.

     #e also show the action for state I in ;ig. +.1I. Because state I bears a O, we must retract theinput pointer one position 5i.e., put c bac" on the input stream6. hat tas" is accomplished by the

    function r e t r a c t 56 .$ince state I represents the recognition of lexeme YD, we set the second

    component of the returned object, which we suppose is named a t t r i b u t e, to _, the code for this operator.

    Saint Mary University

  • 8/18/2019 CD Course Manual

    24/27

    2*

    Compiler Design

    9e0ie8 !uestionsues% 1 

    !n a compiler the module that chec"s every character of the source text is called

    & "he #o$e generatorB "he #o$e optimi

  • 8/18/2019 CD Course Manual

    25/27

    2+

    Compiler Design

    How many to"ens are there in the following C statement@

    printf (J'3K$ L'3K) 'L'

    )6

    B6 AC6 J

    26 10

    !ues: 5

    !n a compiler, the data structure responsible for the management of information about variablesand their attributes is

    )6 Semanti# sta#B6  ?arser ta%leC6 Sym%ol ta%le26  &%stra#t synta)-tree

    Chapter Summary o/ens. he lexical analy8er scans the source program and produces as output a se-uence

    of to"ens, which are normally passed, one at a time to the parser. $ome to"ens may

    consist only of a to"en name while others may also have an associated lexical value thatgives information about the particular instance of the to"en that has been found on the

    input.  Lexernes. 3ach time the lexical analy8er returns a to"en to the parser, it has an associated

    lexeme % the se-uence of input characters that the to"en represents  0uffering. Because it is often necessary to scan ahead on the input in order to see where

    the next lexeme ends, it is usually necessary for the lexical analy8er to buffer its input.

  • 8/18/2019 CD Course Manual

    26/27

    2,

    Compiler Design

     Extended Regular!Expression 2otation. ) number of additional operators may appear as

    short hands in regular expressions, to ma"e it easier to express patterns. 3xamplesinclude the Q operator 5one%or%more%of6, @ 58ero%or%one%of6, and character classes 5the

    union of the strings each consisting of one of the characters6. ransition Diagras. he behavior of a lexical analy8er can often be described by a

    transition diagram. hese diagrams have states, each of which represents something aboutthe history of the characters seen during the current search for a lexeme that matches one

    of the possible patterns. here are arrows, or transitions, from one state to another, each

    of which indicates the possible next input characters that cause the lexical analy8er toma"e that change of state.

    DDDDDDDDDDDDDDDDDDDDDDDDH3 3/2DDDDDDDDDDDDDDDDDDDD

    Chapter+3

    Synta )nalysis By $esign every programming language has pre#ise rules that pres#ri%e the

    synta#ti# stru#ture of well-forme$ programs  "he synta) of programming language #onstru#ts #an %e spe#i7e$ %y #onte)t-free

    grammars or B notation (%oth are $is#usse$ in the previous #ourse  "he use of CNs has several a$vantages over B:

    helps in i$entifying am%iguities a grammar gives a pre#ise yet easy to un$erstan$ synta#ti# spe#i7#ation of a

    programming language it is possi%le to have a tool whi#h pro$u#es automati#ally a parser using the

    grammar a properly $esigne$ grammar helps in mo$ifying the parser easily when the

    language #hanges

    1 The 9ole of the Parser

    6n our #ompiler mo$el the parser o%tains a string of toens from the

    le)i#al analy

  • 8/18/2019 CD Course Manual

    27/27

    Compiler Design

    6n either #ase the input to the parser is s#anne$ from left to right one

    sym%ol at a time

    igure 3.1: position of parser in compiler mo'el

    here are three general types of parsers for grammars: universal, top%down, and bottom%

    up.