Topic 2: Lexical Analysis - modonnel/Compilers/02_ Compilers Topic 2: Lexical Analysis Mick ODonnell : @uam.es 2.1. Introduction 2 Introduction • The Role of the Lexical Analyser

  • View
    214

  • Download
    1

Embed Size (px)

Text of Topic 2: Lexical Analysis - modonnel/Compilers/02_ Compilers Topic 2: Lexical Analysis Mick ODonnell...

  • 1

    Compilers

    Topic 2: Lexical Analysis

    Mick ODonnell : michael.odonnell@uam.es

    2.1. Introduction

    2

    Introduction

    The Role of the Lexical Analyser

    Source

    Code

    Lexical

    Analyser

    Syntactic

    Analyser

    Semantic

    Analyser

    FRONT END

  • 2

    3

    Also known as tokeniser or scanner. In Spanish, called analizador Morfolgico Purpose: translation of the source code into a sequence of

    symbols.

    The symbols identified by the morphological analyser will be considered terminal symbols in the grammar used by the syntactic analyser.

    Lexical Analyser

    Introduction

    begin

    int A;

    A := 100;

    A := A+A;

    output A

    End

    (reserved-word,begin)

    (type, int)(,A)(,;)

    (,A)(,:=)

    (,100)(,;)

    (,A)(,:=)

    (,A)(,+)(,A)(,;)

    (reserved-word,output)(,A)

    (reserved-word,end)

    4

    Other tasks: Identification of lexical errors,

    e.g., starting an identifier with a digit where the language does not allow this: 2abc

    Deletion of white-space: Usually, the function of white-space is only to separate tokens.

    Exceptions: languages where whitespace indicates code block, e.g., python:

    if 1 == 2:

    print 1

    print 2

    Deletion of comments: not relevant to execution of program.

    Lexical Analyser

    Introduction

  • 3

    5

    What are Symbols?

    How do we determine what are the symbols of a given language?

    Case: Assume we have a language with assignment operator := The assignment statement has syntax:

    STATEMENT ID ASSIGNOP EXPR ;

    The rule for ASSIGNOP could be:

    ASSIGNOP :=

    meaning := is a symbol, and thus a unit of lexical analysis.

    However, the rule might have been:

    ASSIGNOP : =

    meaning : and = are two symbols for lexical analysis

    Lexical Analyser

    Drawing the border between symbols

    A := 1 + 2

    6

    What are Symbols?

    General Rules: A symbol is a sequence of characters that cannot be

    sepaated from each other by white space.

    Symbols can be separated from other symbols by white space

    With A:=1 + 2 := can be separated from A and 1 BUT : cannot be separated from = Thus := should be treated as a symbol

    Lexical Analyser

    Drawing the border between symbols

  • 4

    7

    What token labels to use?

    To determine which token labels we assign to symbols, we first need to derive the syntactic grammar of the language.

    THEN, we extract out the terminal symbols of this grammar, which become the token labels in lexical analysis.

    This ensures that the labels assigned in lexical analysis are what we need in syntactic analysis.

    For example, we might assign the label reserved_word to both begin and end.

    But it is clear we cannot use such a label in parsing:

    Program -> reserved_word Statement* reserved_word

    would allow end A=1 begin as a program. Each token label has to reflect the different roles that the token class can

    serve in a program.

    Lexical Analyser

    Determining the Token set

    8

    1 : ::= begin ; end

    2 : ::=

    3 : | ;

    4 : ::=

    5 : | ;

    6 : ::=

    7 : ::= bool

    8 : | int

    9 : | ref

    10 : ::=

    11 : | ,

    12 : ::=

    13 : |

    14 : |

    15 : |

    15 : |

    16 : | call

    17 : ::= :=

    18 : ::= if then fi

    19 : | if then else

    Lexical Analyser

    Identifying the scope of the lexical analysis in the grammar of the language

  • 5

    9

    20 : ::= while do end

    21 : | repeat until

    22 : ::= input

    23 : | output

    24 : ::=

    25 : | +

    26 : | -

    27 : | -

    28 : ::=

    29 : | *

    30 : ::=

    31 : |

    32 : | ( )

    33 : | ( )

    34 : ::= =

    35 : |

    Lexical Analyser

    Identifying the scope of the lexical analysis in the grammar of the language

    Topic 2

    One and Two Pass Lexical Analysis

  • 6

    11

    Identifies symbols and imediately assigns token label to symbol:

    Lexical Analyser

    One Pass Lexical Analyser

    begin

    int A;

    A := 100;

    A := A+A;

    print A

    end

    (begin,begin)(type, int) (id,A)

    (semic,;) (id,A) (eqsgn,:=)

    (int,100)(semic,;) (id,A)

    (eqsgn,:=) (id,A) (symb,+) (id,A)

    (semic,;)

    (reserved-word,output) (id,A)

    (end,end)

    12

    In a two-pass lexical analyser: First pass groups characters into symbols

    Second pass assigns token labels to symbols

    Lexical Analyser

    Two Pass Lexical Analysis

    begin

    int A;

    A := 100;

    A := A+A;

    print A

    end

    (begin,begin)(type, int) (id,A)

    (semic,;) (id,A) (eqsgn,:=)

    (int,100)(semic,;) (id,A)

    (eqsgn,:=) (id,A) (symb,+) (id,A)

    (semic,;)

    (reserved-word,output) (id,A)

    (end,end)

    begin int, A ; A

    := 100 ; A :=

    A + A ; print

    A end

    begin int, A ; A

    := 100 ; A :=

    A + A ; print

    A end

  • 7

    13

    Most programming languages are designed such that the code can be segmented into tokens without any knowledge at all of the meaning of the token.

    Simple rules are adhered to: White-space ends a symbol Multiple white-space ignored identifiers contain only alphanumeric chars or _ identifiers never start with a number a symbol starting with a number IS a number: 1, 34, 10.0 Some chars are always symbol by themselves: } { ; ( ) , mathematical chars can be solo or followed by =

    =, >, =, . Etc.

    No restriction on starting char

    If char sequence can be interpreted as a number, it is

    Else it is an identifier

    E.g., 1+ is an identifier

    +1 is a number

  • 8

    Topic 3

    Methods of Lexical Analysis

    16

    Three main Approaches:

    1) Ad-Hoc Coding : code is written to recognise each type of token.

    2) Finite expressions: e.g., float: [0-9]*.[0-9]+

    Id: [a-zA-Z_][a-zA-Z_0-9]*

    3) Context free grammar, e.g.,Token :- Id | Int | Literal |

    Id :- Alfa | Alfa Id2

    Id2 :- Alfa | Digit | Alfa Id2|Digit Id2

    Lexical Analyser: Using grammars

    Approaches to Lexical Analysis

  • 9

    Topic 2.1

    Ad Hoc Coding of Lexical Analysis: Recognising Symbols

    18

    Common approach (1): Human writes code to recognise the tokens of the source language:

    Lexical Analyser

    Two Pass Lexical Analysis with ad-hoc code

    def tokenise():

    symbolList = []

    while not eof():

    // process next chars until end of symbol

    // add symbol to symbolList

    . . .

    return symbolList

  • 10

    19

    Lexical Analyser

    def tokenise():

    symbolList = []

    while not eof():

    case type(nextc):

    'whitespace': ...

    'alpha': ...

    'digit': ...

    etc.

    return symbolList

    Two Pass Lexical Analysis with ad-hoc code

    20

    def tokenise():

    symbolList = []

    while not eof():

    case type(nextc):

    'whitespace': ...

    'alpha': ...

    'digit': ...

    etc.

    return symbolList

    def type (char):

    if char in a-zA-z_: return alpha

    if char in 0-9: return digit

    if char in \t\n: return whitespace

    if char in {};,: return sepchar

    if char in >

  • 11

    21

    def tokenise():

    symbolList = []

    while not eof():

    case type(nextc):

    'alpha': // alpha includes here '_'

    symbol = + getc()

    while type(nextc) in ['alpha', 'digit']:

    symbol += getc()

    symbolList.append(symbol)

    'whitespace': getc()

    'digit': ...

    ...

    Lexical Analyser

    22

    def tokenise():

    symbolList = []

    while not eof():

    case type(nextc):

    'alpha': // alpha includes here '_'

    symbol = +getc()

    while type(nextc) in ['alpha', 'digit']:

    symbol += getc()

    symbolList.append(symbol)

    'whitespace': getc()

    'digit': ...

    ...

    Lexical Analyser

  • 12

    23

    . . .

    mathchar': // = > < + - * /

    symbol = + getc()

    if nextc == '=':

    symbol += getc()

    symbolList.append(symbol)

    'sepchar': // { } ; ,

    symbol = + getc()

    symbolList.append(symbol)

    default: print "ERROR: Unknown Char: +getc()

    Lexical Analyser

    24

    Numbers:

    Formats: 1, 34, 34.001, .0

    Procedure1) Read digits until we reach a nondigit2) If nextchar is . then read digits until we reach a nondigit

    Lexical Analyser

  • 13

    25

    Numbers:

    Formats: 1, 34, 34.001, .0

    Procedure1) Read digits until we reach a nondigit2) If nextchar is ., then read digits until we reach a nondigit

    digit': symbol = +getc()

    while nextc in 0123456789:

    symbol += getc()

    if nextc == .:

    symbol += getc()

    while nextc in 0123456789:

    symbol += getc()

    symbolList.append(symbol)

    Lexical Analy