Upload
vonga
View
215
Download
0
Embed Size (px)
Citation preview
Lexical analysisLexical analysis – The Big Picture
is a part of the Scanner of the overall compilation procedure happens at the very beginning must be given a file with code to read
Steps in Compiling a text file into a program
1
Lexical Analysis – The smaller Picture a pattern matcher for character strings Front end of the parser Transform character stream to token stream Also called a scanner, lexer, or linear analysis Two parts
o Lexical Analyzero Parser
Parts of Lexical Analysis
functions the lexical analyzer useso lookup
determines whether the string in lexeme is a reserved wordo getchar
reads the next char from the input and places it into a variable “nextChar”
o addChar adds to “nextChar”
2
What we have covered so far We have talked about
o Lexemes and Tokens Physical breakup and categorization of text
o Parsing Determining the syntax (or fit) of the text to the Grammar
Parsing, behind the scenes
3
Basic lexical analysis terms Token
o A classification for a common set of stringso Examples: <identifier>, <number>, <operator>, <open paren>, etc.
Patterno The rules which characterize the set of strings for a tokeno Recall file and OS wildcards (*.java)
Lexemeo Actual sequence of characters that matches pattern and is classified by a
tokeno Identifiers: x, count, name, etc…o Integers: -12, 101, 0, …
Examples of token, lexeme and pattern
4
But how to we analyze all the code? There is no easy of doing this Theory (in pictures) using FSM
Tokenizing the code (Lexical FSM)
Supposed to work with this code:x = a + 5a = -x / yBut what’s missing??
Codeo Store variables (symbol table)o Tokenize the code (from .cpp, .java file) stringo Addchar(), lookup() and getchar() functions
Regular Expressiono Shorter way of expression all above
5
Regular expression (REs) Scanners are based on regular expressions that define simple patterns
o Simpler and less expressive than BNF uses some of the same notation as EBNF Basic operations are set union, concatenation, Kleene closure
o Plus: parentheses, naming patterns No recursion! Why use??
o able to name patterns is just syntactic sugaro use parentheses to group things is just syntactic sugar provided we
specify the precedence and associatively of the operators (i.e., |, * and “concat”)
refers to syntax within a programming language that is designed to make things easier to read or to express.
A regular language is a language that can be defined by a regular expression http://youtu.be/394NxYBDaiA (about 11 minutes)
o does a great job of explaining much of below
Regular expression (REs) Example
6
RE operator “+”, “ε” and “?” The + operator is commonly used to mean “one or more repetitions” of a
patterno We can always do without this
letter+ = = letter letter*
o So the + operator is just syntactic sugar Epsilon ε
o Sometimes we’d like a token that represents nothingo This makes a regular expression matching more complex, but can be
useful The operator “?” means ZERO or ONE!! (Optional)
o This is different than *, which is 0 or MANY
Regular Expression Notation/Operators| 0|1|2|3|4|5|6|7|8|9|
+|-|εmeans "or"
. match any character. WILDCARD (for one char.)* L* = L+ | ε
L+ = L L*means zero or more instances of
? L? = L|ε means optional+ sign digit+ means one or more instances of( ) are used for grouping[..] [abc] = a|b|c
[a-z] = a|b|c...|zCharacter classes
COMPLETE list in FYI Section
Reading Regular Expression NotationExample Breakdown1 Breakdown2 in English Answer(s)[01]+ [0 or 1]+ 1 or more 0s or 1s 10010010001
001001
7
Precedence of operators
( )s* +Concatenation|All the operators are left associative
Example(A) | ((B)* (C)) is equivalent to A | B * C
Kleene Example 1A={grand, ε}, B={father, mother}
What is A*B ?????
A*B={father, mother, grandfather, grandmother, grandgrandfather, …}
Kleene Example 2
(a | b | c)* = {"ε ", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}[a – c]* = …
Union Example 1A={grand, ε}, B={father, mother}
(A is then followed by a B)
What is AB?
AB={father, mother, grandfather, grandmother, …}
8
The dot “.” In Reg Expression matches a single character, without caring what that character is
o dot CANNOT be epsilon The only exception are newline characters.
Complete this exercise. $ is the delimiter character showing where the regular expression begins and ends. Strings to be matched start and end with non-blank characters: there are no leading or trailing blanks.
. match any character. WILDCARD (for one char.)* means zero or more instances of? means optional+ means one or more instances of
There can be more than ONE correct answer per question
1 Which of the following matches regexp $a(ab)*a$a. abababab. aabac. aabbaad. abae. aabababa
Stop here. I will go over. Answerb:
2 Which of the following matches regexp $ab+c?$a. abcb. acc. abbbd. bbc
3 Which of the following matches regexp $a.[bc]+$a. abcb. abbbbbbbbc. azcd. abcbcbcbce. acf. asccbbbbcbcccc
Answers
9
Regular grammar and regular expression They are equivalent
o Every regular expression can be expressed by regular grammaro Every regular grammar can be expressed by regular expression
One Ruleo An identifier must begin with a letter and can be followed by arbitrary
number of letters and digits. Grammar vs. Expression
o expression is one long stringo grammar uses LHS RHS
broken down from one big string
Regular Grammars ExampleRegular Grammar Regular Expression
S --> AS --> BB --> bBA --> aAA --> ƐB --> ƐƐ = empty character
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
ID LETTER ID_RESTID_ REST LETTER ID_REST | DIGIT ID_REST | EMPTY
ID: LETTER (LETTER | DIGIT)*
10
FSM == FA Finite state machine and finite automaton are different names for the same
concept The basic concept is important and useful in almost every aspect of computer
science The concept provides an abstract way (PICTURE FORM!!) to describe a process thato Has a finite set of states it can be ino Gets a sequence of inputso Each input causes the process to go from its current state to a new state
(which might be the same!)o If after the input ends, we are in one of a set of accepting state, the input
is accepted by the FA
Example FA
Notice there are 2 options of 0 from q2What state do these end with?
1. 000 =2. 00 =3. 0111010 =4. 0110 =
11
Why 1+ routes for the same option? In some scenarios you have two options, but the one that is choose is
determined if we “look ahead”o look at last chart, from q0, has two z0 options
we have to look in the scanner string for next inputo if we knew what we needed ahead of time, we could pick between the z0
paths
DFA != FA In a DFA there is only one choice for a given input in every state There are no states with two arcs that match the same input that transition to
different states If there is an input symbol that matches no arc for the current state, the input
is not accepted a DFS can have MULTIPLE accept states
Deterministic finite automaton ExerciseThis FA accepts only binary numbers that are multiples of 310
1. What does an accept state mean?2. What is both the start state and an accept state?3. What state do these end on?
000110111000100010100100101
12
REs can be represented as DFAs Regular expression for a simple identifier might be easier to picture as a DFA, matching to a string
DFMS for a.[bc]+
Equivalence of DFA & RE
13
REs to DFMS Ground Rules
o must have all alphabet in language as links from every state {a,b}, then every state has a and b links HEADING OUT
There is a “convention” is that if the transition is not shown for some given input, then the transition is to an implicit error state.
you are not doing this!!o the “dead” state
state where there is no return!! a and b possibly link back to the same state
o multiple accepts states not always avoidable
Steps for success1. Add “->” to denote MINIMAL edges (optional)2. Make overall notes about the grammar3. Create some examples that fit the grammar4. Create some examples that DO NOT fit the grammar5. Start drawing6. Test
Remember what all of the symbols mean!!! Keep a list somewhere.abb as a DFA
1. Add “->” to denote unionsa -> b -> b
2. Make overall notes about the grammarsuper simple, no other notations
3. Create some examples that fit the grammar
abb (simplest and hardest)
4. Create some examples that DO NOT fit the grammar
14
aaabbaabab
5. Start with State 0, remember, each state MUST have both options a and b, even if it points them in the wrong direction (or not accepted). I used JFlap.
15
a (a | b)* b as a DFA1. Add “->” to denote unions
a -> (a | b )* -> b
2. Make overall notes about the grammarmust start with an amust end with a bmiddle is a POSSIBLE combination of “a”s or “b”s
3. Create some examples that fit the grammar
ab (simplest)aababbaaabbbaaabbabb (super ugly one)
4. Create some examples that DO NOT fit the grammarabaababa
5. Start with State 1, remember, each state MUST have both options “a” and “b”, even if it points them in the wrong direction (or not accepted). I used JFlap.
6. Test
16
Now try:a. abb* (AS A GROUP) b. ab*a*b (individual, if time permits)c. a(ab)*a
Answersb:
Steps for success1. Add “->” to denote MINIMAL edges (optional)2. Make overall notes about the grammar3. Create some examples that fit the grammar4. Create some examples that DO NOT fit the grammar5. Start drawing, remember to handle input that does NOT fit6. Test (in JFlap!!)
RE as a DFA Example 2Regular expression for a simple identifierIdentifier: letter (letter | digit)*letter: a|b|c|...|z|A|B|C...|Zdigit: 0|1|2|3|4|5|6|7|8|9
17
DFA represented as a Table If the string contains a double “aa”, then print string accepted else print string
rejected.
Converting DFA to a Table ExampleDFA
DFA Table
Draw the FA (not a DFA, why?). Answer b
InputStates
a b c d Epsilon
s0 s1, s3s1 {s2}{s2}s3 s4 ,s5s4 s6s5 s7{s6}{s7}
{ } means accept state
18
AnswersDFA Exercise
0001 (= 1)101 (= 5)11000 (= 24)100010 (= 34)100100101 (= 293)
DFA Creation for a[ab]*a
abb* as DFShttps://www.youtube.com/watch?v=b38W-wrOyQo
ab*a*b as DFAhttps://www.youtube.com/watch?v=LIkARVKKgq4
19
a(ab)*a
20
RE as a DFA Example 1
DFA Table to Graph
(Done in GraphViz)
21
FYI Section
Formal language operationsFormal language operations
Operation Notation Definition ExampleL={a, b} M={0,1}
union of L and M L M L M = {s | s is in L or s is in M} {a, b, 0, 1}
concatenation of L and M LM LM = {st | s is in L and t is in M} {a0, a1, b0, b1}
Kleene closure of L L* L* denotes zero or more concatenations of L
All the strings consists of “a” and “b”, plus the empty string. {ε, a, b, aa, bb, ab, ba, aaa, …}
positive closure L+ L+ denotes “one or more concatenations of “ L
All the strings consists of “a” and “b”. {a, b, aa, bb, ab, ba, aaa, …}
Formal Definition of Reg. Expression Let be an alphabet, r a regular expression then L(r) is the language that is
characterized by the rules of ro Definition of regular expression
ε is a regular expression that denotes the language {ε} If a is in , a is a regular expression that denotes {a}
Let r & s be regular expressions with languages L(r) & L(s) (r) | (s) is a regular expression L(r) L(s) (r)(s) is a regular expression L(r) L(s) (r)* is a regular expression (L(r))*
It is an inductive definition!
22
Formal definition of tokens A set of tokens is a set of strings over an alphabet
{read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …}
A set of tokens is a regular set that can be defined by using a regular expression
For every regular set, there is a finite automaton (FA) that can recognize ito Aka deterministic Finite State Machine (FSM)o i.e. determine whether a string belongs to the set or noto Scanners extract tokens from source code in the same way DFAs
determine membership
23
Entire List of Reg. Expression SymbolsDescription
^ Put a circumflex at the start of an expression to match the beginning of a line.$ Put a dollar sign at the end of an expression to match the end of a line.. Put a period anywhere in an expression to match any character.* Put an asterisk after an expression to match zero or more occurrences of that
expression.+ Put a plus sign after an expression to match one or more occurrences of that
expression._ Put an underscore to match a comma(,), left brace ({), right brace (}), the
beginning of the input string, the end of the input string, or a space.? Put a question mark after an expression to match zero occurrences or one.[ ] Put characters inside square brackets to match any one of the bracketed
characters but no others.[^] Put a leading circumflex inside square brackets with one or more characters to
match any character except those inside the brackets.[ - ] Put a hyphen inside square brackets between characters to designate a range of
characters.< Put a left angle bracket at the start of an expression to match the beginning of a
word.> Put a right angle bracket at the end of an expression to match the end of a word.\b Use backslash b to match the backspace character (# 8).\t Use backslash t to match the tab character (# 9).n Use backslash n to match the new-line character (# 10).\f Use backslash f to match the form-feed character (# 12).\r Use backslash r to match the carriage-return character (# 13).\x00
Use backslash x with a hexadecimal code of \x00 to \xFF to match the corresponding character.
\ Use a backslash to make a regular-expression symbol a literal character.| Use a vertical bar between expressions to match either expression. Use up to
nine vertical bars, separating up to ten expressions, any of which are to be found in a line. NOTE: Spaces before and after the vertical bar are significant. For example, “near | far” represents a regular-expression search for “near “ or “ far”, not “near” or “far”.
& Use an ampersand between expressions to match both expressions. Use up to nine angstroms, joining up to ten expressions, all of which are to be found in a line. NOTE: Spaces before and after the angstrom are significant. Thus, “near & far” is not the same as “near&far”, which is probably what you want.
{ } Use a left curly bracket paired with a right curly bracket to denote a sub-expression within the complete regular expression. You may make and denote multiple sub-expressions within the complete regular expression. You may refer to such sub-expressions by number if you create Replacement Expressions for Replace operations. This denotation of a sub-expression has no effect on Find operations
24
25
Properties of regular expressions We can easily determine some basic properties of the operators involved in
building regular expressions
Property Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t=r(st) Concatenation is associative
r(s|t)=rs | rt(s|t)r=sr | tr Concatenation distributes over |
... ...
DFA represented as a C Program(no Table)
#include <stdio.h>main(){ enum State {S1, S2, S3}; enum State currentState = S1; int c = getchar(); while (c != EOF) { switch(currentState) { case S1: if (c == ‘a’) currentState = S2; if (c == ‘b’) currentState = S1; break; case S2: if (c == ‘a’) currentState = S3; if (c == ‘b’) currentState = S1; break; case S3: break; } c = getchar(); } if (currentState == S3) printf(“string accepted\n”); else printf(“string rejected\n”);}
Same Code w/ Table
26
#include <stdio.h>main(){ enum State {S1, S2, S3}; enum Label {A, B}; enum State currentState = S1; enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}}; int label; int c = getchar(); while (c != EOF) { if (c == ‘a’) label = A; if (c == ‘b’) label = B; currentState = table[currentState][label]; c = getchar(); } if (currentState == S3) printf(“string accepted\n”); else printf(“string rejected\n”);}
SourcesFSM:http://www.codeproject.com/Articles/32966/An-Object-oriented-Approach-to-Finite-State-Automa
Lexical DFSMhttp://www.codeproject.com/Articles/17783/Predictive-Parser-to-generate-syntax-tree-and-an-I
Regular Expression:http://users.lmi.net/canepa/subdir/regex_symbols.htmlhttp://www.codeproject.com/Articles/5412/Writing-own-regular-expression-parserhttp://www.cs.washington.edu/education/courses/cse341/03sp/slides/Perl3/sld007.htmhttp://regex.sketchengine.co.uk/http://www.regular-expressions.info/dot.html
27