Lexical analysisfaculty.cse.tamu.edu/.../LexicalAndSyntaxAnalysis.docx · Web viewdetermines whether the string in lexeme is a reserved word getchar reads the next char from the input

Lexical analysisLexical analysis – The Big Picture

is a part of the Scanner of the overall compilation procedure happens at the very beginning must be given a file with code to read

Steps in Compiling a text file into a program

1

Lexical Analysis – The smaller Picture a pattern matcher for character strings Front end of the parser Transform character stream to token stream Also called a scanner, lexer, or linear analysis Two parts

o Lexical Analyzero Parser

Parts of Lexical Analysis

functions the lexical analyzer useso lookup

determines whether the string in lexeme is a reserved wordo getchar

reads the next char from the input and places it into a variable “nextChar”

o addChar adds to “nextChar”

2

What we have covered so far We have talked about

o Lexemes and Tokens Physical breakup and categorization of text

o Parsing Determining the syntax (or fit) of the text to the Grammar

Parsing, behind the scenes

3

Basic lexical analysis terms Token

o A classification for a common set of stringso Examples: <identifier>, <number>, <operator>, <open paren>, etc.

Patterno The rules which characterize the set of strings for a tokeno Recall file and OS wildcards (*.java)

Lexemeo Actual sequence of characters that matches pattern and is classified by a

tokeno Identifiers: x, count, name, etc…o Integers: -12, 101, 0, …

Examples of token, lexeme and pattern

4

But how to we analyze all the code? There is no easy of doing this Theory (in pictures) using FSM

Tokenizing the code (Lexical FSM)

Supposed to work with this code:x = a + 5a = -x / yBut what’s missing??

Codeo Store variables (symbol table)o Tokenize the code (from .cpp, .java file) stringo Addchar(), lookup() and getchar() functions

Regular Expressiono Shorter way of expression all above

5

Regular expression (REs) Scanners are based on regular expressions that define simple patterns

o Simpler and less expressive than BNF uses some of the same notation as EBNF Basic operations are set union, concatenation, Kleene closure

o Plus: parentheses, naming patterns No recursion! Why use??

o able to name patterns is just syntactic sugaro use parentheses to group things is just syntactic sugar provided we

specify the precedence and associatively of the operators (i.e., |, * and “concat”)

refers to syntax within a programming language that is designed to make things easier to read or to express.

A regular language is a language that can be defined by a regular expression http://youtu.be/394NxYBDaiA (about 11 minutes)

o does a great job of explaining much of below

Regular expression (REs) Example

6

http://youtu.be/394NxYBDaiA

RE operator “+”, “ε” and “?” The + operator is commonly used to mean “one or more repetitions” of a

patterno We can always do without this

letter+ = = letter letter*

o So the + operator is just syntactic sugar Epsilon ε

o Sometimes we’d like a token that represents nothingo This makes a regular expression matching more complex, but can be

useful The operator “?” means ZERO or ONE!! (Optional)

o This is different than *, which is 0 or MANY

Regular Expression Notation/Operators| 0|1|2|3|4|5|6|7|8|9|

+|-|εmeans "or"

. match any character. WILDCARD (for one char.)* L* = L+ | ε

L+ = L L*means zero or more instances of

? L? = L|ε means optional+ sign digit+ means one or more instances of( ) are used for grouping[..] [abc] = a|b|c

[a-z] = a|b|c...|zCharacter classes

COMPLETE list in FYI Section

Reading Regular Expression NotationExample Breakdown1 Breakdown2 in English Answer(s)[01]+ [0 or 1]+ 1 or more 0s or 1s 10010010001

001001

7

Precedence of operators

( )s* +Concatenation|All the operators are left associative

Example(A) | ((B)* (C)) is equivalent to A | B * C

Kleene Example 1A={grand, ε}, B={father, mother}

What is A*B ?????

A*B={father, mother, grandfather, grandmother, grandgrandfather, …}

Kleene Example 2

(a | b | c)* = {"ε ", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}[a – c]* = …

Union Example 1A={grand, ε}, B={father, mother}

(A is then followed by a B)

What is AB?

AB={father, mother, grandfather, grandmother, …}

8

The dot “.” In Reg Expression matches a single character, without caring what that character is

o dot CANNOT be epsilon The only exception are newline characters.

Complete this exercise. $ is the delimiter character showing where the regular expression begins and ends. Strings to be matched start and end with non-blank characters: there are no leading or trailing blanks.

. match any character. WILDCARD (for one char.)* means zero or more instances of? means optional+ means one or more instances of

There can be more than ONE correct answer per question

1 Which of the following matches regexp $a(ab)*a$a. abababab. aabac. aabbaad. abae. aabababa

Stop here. I will go over. Answerb:

2 Which of the following matches regexp $ab+c?$a. abcb. acc. abbbd. bbc

3 Which of the following matches regexp $a.[bc]+$a. abcb. abbbbbbbbc. azcd. abcbcbcbce. acf. asccbbbbcbcccc

Answers

9

Regular grammar and regular expression They are equivalent

o Every regular expression can be expressed by regular grammaro Every regular grammar can be expressed by regular expression

One Ruleo An identifier must begin with a letter and can be followed by arbitrary

number of letters and digits. Grammar vs. Expression

o expression is one long stringo grammar uses LHS RHS

broken down from one big string

Regular Grammars ExampleRegular Grammar Regular Expression

S --> AS --> BB --> bBA --> aAA --> ƐB --> ƐƐ = empty character

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

ID LETTER ID_RESTID_ REST LETTER ID_REST | DIGIT ID_REST | EMPTY

ID: LETTER (LETTER | DIGIT)*

10

FSM == FA Finite state machine and finite automaton are different names for the same

concept The basic concept is important and useful in almost every aspect of computer

science The concept provides an abstract way (PICTURE FORM!!) to describe a process thato Has a finite set of states it can be ino Gets a sequence of inputso Each input causes the process to go from its current state to a new state

(which might be the same!)o If after the input ends, we are in one of a set of accepting state, the input

is accepted by the FA

Example FA

Notice there are 2 options of 0 from q2What state do these end with?

1. 000 =2. 00 =3. 0111010 =4. 0110 =

11

Why 1+ routes for the same option? In some scenarios you have two options, but the one that is choose is

determined if we “look ahead”o look at last chart, from q0, has two z0 options

we have to look in the scanner string for next inputo if we knew what we needed ahead of time, we could pick between the z0

paths

DFA != FA In a DFA there is only one choice for a given input in every state There are no states with two arcs that match the same input that transition to

different states If there is an input symbol that matches no arc for the current state, the input

is not accepted a DFS can have MULTIPLE accept states

Deterministic finite automaton ExerciseThis FA accepts only binary numbers that are multiples of 310

1. What does an accept state mean?2. What is both the start state and an accept state?3. What state do these end on?

000110111000100010100100101

12

REs can be represented as DFAs Regular expression for a simple identifier might be easier to picture as a DFA, matching to a string

DFMS for a.[bc]+

Equivalence of DFA & RE

13

REs to DFMS Ground Rules

o must have all alphabet in language as links from every state {a,b}, then every state has a and b links HEADING OUT

There is a “convention” is that if the transition is not shown for some given input, then the transition is to an implicit error state.

you are not doing this!!o the “dead” state

state where there is no return!! a and b possibly link back to the same state

o multiple accepts states not always avoidable

Steps for success1. Add “->” to denote MINIMAL edges (optional)2. Make overall notes about the grammar3. Create some examples that fit the grammar4. Create some examples that DO NOT fit the grammar5. Start drawing6. Test

Remember what all of the symbols mean!!! Keep a list somewhere.abb as a DFA

1. Add “->” to denote unionsa -> b -> b

2. Make overall notes about the grammarsuper simple, no other notations

3. Create some examples that fit the grammar

abb (simplest and hardest)

4. Create some examples that DO NOT fit the grammar

14

aaabbaabab

5. Start with State 0, remember, each state MUST have both options a and b, even if it points them in the wrong direction (or not accepted). I used JFlap.

15

a (a | b)* b as a DFA1. Add “->” to denote unions

a -> (a | b )* -> b

2. Make overall notes about the grammarmust start with an amust end with a bmiddle is a POSSIBLE combination of “a”s or “b”s

3. Create some examples that fit the grammar

ab (simplest)aababbaaabbbaaabbabb (super ugly one)

4. Create some examples that DO NOT fit the grammarabaababa

5. Start with State 1, remember, each state MUST have both options “a” and “b”, even if it points them in the wrong direction (or not accepted). I used JFlap.

6. Test

16

Now try:a. abb* (AS A GROUP) b. ab*a*b (individual, if time permits)c. a(ab)*a

Answersb:

Steps for success1. Add “->” to denote MINIMAL edges (optional)2. Make overall notes about the grammar3. Create some examples that fit the grammar4. Create some examples that DO NOT fit the grammar5. Start drawing, remember to handle input that does NOT fit6. Test (in JFlap!!)

RE as a DFA Example 2Regular expression for a simple identifierIdentifier: letter (letter | digit)*letter: a|b|c|...|z|A|B|C...|Zdigit: 0|1|2|3|4|5|6|7|8|9

17

DFA represented as a Table If the string contains a double “aa”, then print string accepted else print string

rejected.

Converting DFA to a Table ExampleDFA

DFA Table

Draw the FA (not a DFA, why?). Answer b

InputStates

a b c d Epsilon

s0 s1, s3s1 {s2}{s2}s3 s4 ,s5s4 s6s5 s7{s6}{s7}

{ } means accept state

18

AnswersDFA Exercise

0001 (= 1)101 (= 5)11000 (= 24)100010 (= 34)100100101 (= 293)

DFA Creation for a[ab]*a

abb* as DFShttps://www.youtube.com/watch?v=b38W-wrOyQo

ab*a*b as DFAhttps://www.youtube.com/watch?v=LIkARVKKgq4

19

https://www.youtube.com/watch?v=LIkARVKKgq4

https://www.youtube.com/watch?v=b38W-wrOyQo

a(ab)*a

20

RE as a DFA Example 1

DFA Table to Graph

(Done in GraphViz)

21

FYI Section

Formal language operationsFormal language operations

Operation Notation Definition ExampleL={a, b} M={0,1}

union of L and M L M L M = {s | s is in L or s is in M} {a, b, 0, 1}

concatenation of L and M LM LM = {st | s is in L and t is in M} {a0, a1, b0, b1}

Kleene closure of L L* L* denotes zero or more concatenations of L

All the strings consists of “a” and “b”, plus the empty string. {ε, a, b, aa, bb, ab, ba, aaa, …}

positive closure L+ L+ denotes “one or more concatenations of “ L

All the strings consists of “a” and “b”. {a, b, aa, bb, ab, ba, aaa, …}

Formal Definition of Reg. Expression Let be an alphabet, r a regular expression then L(r) is the language that is

characterized by the rules of ro Definition of regular expression

ε is a regular expression that denotes the language {ε} If a is in , a is a regular expression that denotes {a}

Let r & s be regular expressions with languages L(r) & L(s) (r) | (s) is a regular expression L(r) L(s) (r)(s) is a regular expression L(r) L(s) (r)* is a regular expression (L(r))*

It is an inductive definition!

22

Formal definition of tokens A set of tokens is a set of strings over an alphabet

{read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …}

A set of tokens is a regular set that can be defined by using a regular expression

For every regular set, there is a finite automaton (FA) that can recognize ito Aka deterministic Finite State Machine (FSM)o i.e. determine whether a string belongs to the set or noto Scanners extract tokens from source code in the same way DFAs

determine membership

23

Entire List of Reg. Expression SymbolsDescription

^ Put a circumflex at the start of an expression to match the beginning of a line.$ Put a dollar sign at the end of an expression to match the end of a line.. Put a period anywhere in an expression to match any character.* Put an asterisk after an expression to match zero or more occurrences of that

expression.+ Put a plus sign after an expression to match one or more occurrences of that

expression._ Put an underscore to match a comma(,), left brace ({), right brace (}), the

beginning of the input string, the end of the input string, or a space.? Put a question mark after an expression to match zero occurrences or one.[ ] Put characters inside square brackets to match any one of the bracketed

characters but no others.[^] Put a leading circumflex inside square brackets with one or more characters to

match any character except those inside the brackets.[ - ] Put a hyphen inside square brackets between characters to designate a range of

characters.< Put a left angle bracket at the start of an expression to match the beginning of a

word.> Put a right angle bracket at the end of an expression to match the end of a word.\b Use backslash b to match the backspace character (# 8).\t Use backslash t to match the tab character (# 9).n Use backslash n to match the new-line character (# 10).\f Use backslash f to match the form-feed character (# 12).\r Use backslash r to match the carriage-return character (# 13).\x00

Use backslash x with a hexadecimal code of \x00 to \xFF to match the corresponding character.

\ Use a backslash to make a regular-expression symbol a literal character.| Use a vertical bar between expressions to match either expression. Use up to

nine vertical bars, separating up to ten expressions, any of which are to be found in a line. NOTE: Spaces before and after the vertical bar are significant. For example, “near | far” represents a regular-expression search for “near “ or “ far”, not “near” or “far”.

& Use an ampersand between expressions to match both expressions. Use up to nine angstroms, joining up to ten expressions, all of which are to be found in a line. NOTE: Spaces before and after the angstrom are significant. Thus, “near & far” is not the same as “near&far”, which is probably what you want.

{ } Use a left curly bracket paired with a right curly bracket to denote a sub-expression within the complete regular expression. You may make and denote multiple sub-expressions within the complete regular expression. You may refer to such sub-expressions by number if you create Replacement Expressions for Replace operations. This denotation of a sub-expression has no effect on Find operations

24

25

Properties of regular expressions We can easily determine some basic properties of the operators involved in

building regular expressions

Property Description

r|s = s|r | is commutative

r|(s|t) = (r|s)|t | is associative

(rs)t=r(st) Concatenation is associative

r(s|t)=rs | rt(s|t)r=sr | tr Concatenation distributes over |

... ...

DFA represented as a C Program(no Table)

#include <stdio.h>main(){ enum State {S1, S2, S3}; enum State currentState = S1; int c = getchar(); while (c != EOF) { switch(currentState) { case S1: if (c == ‘a’) currentState = S2; if (c == ‘b’) currentState = S1; break; case S2: if (c == ‘a’) currentState = S3; if (c == ‘b’) currentState = S1; break; case S3: break; } c = getchar(); } if (currentState == S3) printf(“string accepted\n”); else printf(“string rejected\n”);}

Same Code w/ Table

26

#include <stdio.h>main(){ enum State {S1, S2, S3}; enum Label {A, B}; enum State currentState = S1; enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}}; int label; int c = getchar(); while (c != EOF) { if (c == ‘a’) label = A; if (c == ‘b’) label = B; currentState = table[currentState][label]; c = getchar(); } if (currentState == S3) printf(“string accepted\n”); else printf(“string rejected\n”);}

SourcesFSM:http://www.codeproject.com/Articles/32966/An-Object-oriented-Approach-to-Finite-State-Automa

Lexical DFSMhttp://www.codeproject.com/Articles/17783/Predictive-Parser-to-generate-syntax-tree-and-an-I

Regular Expression:http://users.lmi.net/canepa/subdir/regex_symbols.htmlhttp://www.codeproject.com/Articles/5412/Writing-own-regular-expression-parserhttp://www.cs.washington.edu/education/courses/cse341/03sp/slides/Perl3/sld007.htmhttp://regex.sketchengine.co.uk/http://www.regular-expressions.info/dot.html

27

http://www.regular-expressions.info/dot.html

http://regex.sketchengine.co.uk/

http://www.cs.washington.edu/education/courses/cse341/03sp/slides/Perl3/sld007.htm

http://www.cs.washington.edu/education/courses/cse341/03sp/slides/Perl3/sld007.htm

http://www.codeproject.com/Articles/5412/Writing-own-regular-expression-parser

http://users.lmi.net/canepa/subdir/regex_symbols.html

http://www.codeproject.com/Articles/17783/Predictive-Parser-to-generate-syntax-tree-and-an-I

http://www.codeproject.com/Articles/17783/Predictive-Parser-to-generate-syntax-tree-and-an-I

http://www.codeproject.com/Articles/32966/An-Object-oriented-Approach-to-Finite-State-Automa

http://www.codeproject.com/Articles/32966/An-Object-oriented-Approach-to-Finite-State-Automa

Documents

Lexical analysisfaculty.cse.tamu.edu/.../LexicalAndSyntaxAnalysis.docx · Web viewdetermines whether the string in lexeme is a reserved word getchar reads the next char from the input