Upload
vladimirkulyukin
View
587
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Theory of Computation
Formalism & Computation, Finite Automata, CFGs, & Compilation: Tokenization, Syntactic Analysis,
RecursiveDescent Parsing
Vladimir Kulyukin
Outline
Formalism & Computation
Finite Automata, CFGs, and Compilation
Tokenization
Syntactic Analysis
RecursiveDescent Parsing
Formalism & Computation
Outline
To a software developer, the question “why do we need
programming languages?” seems silly: we need
programming languages to develop software, of course!
To a CS theorist, there is a different answer: in order to
study computation, one must have a formalism in which
that computation can be expressed
Is there the best formalism to work with? Unlikely,
because the formalism we use is inseparable from the
computation we study
Church’s Thesis
On first look, the previous answer seems circular and it
is: before deciding on a formalism we must have a pretty
good idea of the computation we want to study and, vice
versa, we cannot begin to study computation until we
have a formalism that makes our ideas precise
Chickenandegg conundrum: which comes first –
formalism or computation?
This is the heart of what is known as Church's thesis
Alonzo Church (1903  1995)
Alonzo Church developed λcalculus, a formal system for defining functions, applying functions, and recursion
Church’s Thesis
The commonsense formulation of Church's Thesis:
Everything computable can be computed by a
formalism X
X can be replaced by λcalculus or Turing machine or
some other formalism
Another subtle and often unstated assumption is that
there is a device that can mechanically execute
computational instructions expressed in that formalism
Choice of Formalism
Choice of formalism is both objective and subjective
It is objective in that many formalisms (at least, on natural
numbers) have been shown to be equivalent: any computation
that can be expressed in one can be expressed in another, and
vice versa
Similarly, programming languages are equivalent in the sense that
an algorithm implemented in one language, can be implemented
in a different one without any loss of generality
It is subjective in that people always have their own personal
preferences
Choice of Formalism
We will use the programming language L developed in Chapter 2
of Computability, Complexity, and Languages by Davis,
Weyuker, and Sigal
While L is a theoretical construct, it can be thought of as
a higher level assembly language
Since L is a programming language, it is, in my humble
opinion, more appealing to the programmatically inclined
than more formal constructs such as λcalculus or Turing
machine
L’s Tokens
1
1
1
211111
321
321
as same theis
as same theis
as same theis
example,For 1. be toassumed isit omitted, issubscript theIf
,...,,,,, :Labels
:iableOutput var
,...,, : variablesLocal
,...,, :ablesInput vari
AA
ZZ
XX
AEDCBA
Y
ZZZ
XXX
L’s Basic Instructions (Primitives)
same theare side handright theand
side handleft on the variables the3 2, 1, nsinstructioIn :NOTE
branch) (cond. GOTO 0 IF 4.
opp)(no .3
)(decrement 1 .2
)(increment 1 .1
LV
VV
VV
VV
Instruction V V + 1
● These instructions are primitives:
X1 X1 + 1
Z10 Z10 + 1
Y Y + 1
X102 X102 + 1
● These instructions are NOT primitives:
X1 X10 + 1
Z10 X1 + 1
Y X102 + 1
Instruction V V  1
●These instructions are primitives:
X1 X1  1
Z10 Z10  1
Y Y  1
X102 X102 – 1
●These instructions are NOT primitives:
X1 X10  1
Z10 X1  1
Y X102  1
Instruction V V
●These instructions are primitives:
X1 X1
Z10 Z10
X120 X120
Y Y
●These instructions are NOT primitives:
X1 Y
X120 Z10
Z10 X1
L’s Labeled Primitives
GOTO.after
dropped are brackets square thedispatches lconditionain However,
brackets. squarein is label theline theof beginning At the :NOTE
branch) (cond. GOTO 0 IF L 4.
opp)(no L .3
)(decrement 1 L .2
)(increment 1 L .1
LV
VV
VV
VV
Labeled Primitives: Examples
● [A1] X1 X1 + 1
● [B1] X23 X23 – 1
● [C10] Z12 Z12 + 1
● [E1] Y Y
● [D101] IF X1 != 0 GOTO E1
Increments and Decrements
● Since there is no upper limit on variable values, the increment instruction always succeeds (there are no buffer overflows):
–V V + 1 –In the above instruction V’s value is always incremented by 1
● Since variable values are natural numbers, the decrement instruction has no effect if the value of the variable is 0 –V V – 1 –if V is 0 before the instruction, V remains 0 after the instruction –If V > 0 before the instruction, V’s value is decremented by 1
The Output Value of L’s Program
● The output value of an L program is the value of the Y variable
● If an L program goes into an infinite loop, the value is undefined
● Thus, an L program implements a function that maps the values of the input variables into the value of Y
Exit Label E
● We will assume that each L program has a unique exit label E or (E1)
● If conditional dispatch with GOTO E or GOTO E1 is executed, the control exits the program and its execution terminates
● If we want to be explicit about this, we can assume that the implicit last statement of every Lprogram is [E1] return Y
Example
otherwise
0 if 1)(
x
xxf
Implementing f(x) in L
AX
YY
XXA
AX
YY
XXA
GOTO 0 IF
1
1 ][
:subscripts use onot want t do weif Or,
GOTO 0 IF
1
1 ][
11
111
Compiling LPrograms
Three Stages of Compilation
● Syntactic Analysis: The source program is processed to determine its conformity to the language grammar and its structure
● Contextual Analysis: The output of the syntactic analysis (a parse tree) is checked for its conformity to the language’s contextual constraints
● Code Generation: The checked parse tree is used to generate the target code, e.g. Java byte code or assembly or some other target language
Components of Syntactic Analysis
● Syntactic Analysis consists of Tokenization and Parsing
● Tokenization: We have to define a set of FA’s (regular expressions) to tokenize input statements (primitive instructions)
● Parsing: We have to define a CFG to map tokenized input statements (primitive instructions) into parse trees
Tokenization: Two Basic Design Principles
● Zero Token Ambiguity: Each sequence of nonwhitespace characters must be mapped to at most one token
● Zero Statement (Instruction) Ambiguity: Each sequence of tokens recognized in between the beginning of a line and a newline character must have at most one parse tree
Tokenization of LPrograms
Sample L Program
Here is a sample program in L:
[A1] X1 <= X1 – 1
Y <= Y + 1
IF X1 != 0 GOTO A1
Tokenization: Input Variables (InputVarToken)
Input variables are tokens of the form X1, X2, X3, etc. In general, an input variable is Xk, where k is a natural number greater than 0. An NFA is as follows:
X [1 – 9]
[0 – 9]
Tokenization: Output Variables (OutputVarToken)
L has only one output variable: Y. Here is an NFA:
Y
Tokenization: Local Variables (LocalVarToken)
Local variables are tokens of the form Z1, Z2, Z3, etc. In general, a local variable is Zk, where k is a natural number greater than 0. An NFA is as follows:
Z [1 – 9]
[0 – 9]
Tokenization of Labels
● There are two places where a label can occur in a primitive instruction: at the beginning of a line and at the end of a line
● At the beginning of a line a label is bracketed; at the end of a line it is not
● Furthermore, labels that start with A, B, C, D are nonexit labels; labels that start with E are exit labels
Tokenization: NonExit NonBracketed Labels (NExLblToken)
Nonexit labels that occur at the end of a line are tokens of the form Λ1, Λ 2, Λ3, etc. In general, a label is Λk, where k is a natural number greater than 0 and Λ is in {A, B, C, D}. An NFA is as follows:
A,B,C,D [1 – 9]
[0 – 9]
Tokenization: NonExit Bracketed Labels (NExBrLblToken)
Nonexit labels that occur at the end of a line are tokens of the form [Λ1] , [Λ2] , [Λ3] , etc. In general, a label is [Λk] , where k is a natural number greater than 0 and Λ is in {A, B, C, D}. An NFA is as follows:
A,B,C,D [1 – 9]
[0 – 9]
[ ]
Tokenization: Exit NonBracketed Label (ExLblToken)
Every L program has a unique exit label (E1). If the exit label occurs at the end of a line, it is not bracketed. An NFA is as follows (this assumes that we always use the numeral 1):
E 1
Tokenization: Exit Bracketed Label (ExBrLblToken)
Every L program has a unique exit label (E1). If the exit label occurs at the beginning of a line is it bracketed. An NFA is as follows:
E 1 [ ]
Tokenization of Operators
There are four operator tokens in L: <=, +, , != . Here is possible NFAs for operators:
< =
! =
+

AssignOperToken
NotEqOperToken
PlusOperToken
MinusOperToken
Tokenization of Keywords
L has two keywords: IF and GOTO. Two possible NFAs:
I F
G O T O
IFToken
GOTOToken
Tokenization of Literals
L has 2 literals: 0 and 1. Two possible NFAs:
0
1
ZeroLitToken
OneLitToken
Complete List of Tokens 1.InputVarToken 2.OutputVarToken 3.LocalVarToken 4.NExLblToken 5.ExLblToken 6.NExBrLblToken 7.ExBrLblToken 8.AssignOperToken 9.NotEqOperToken 10.PlusOperToken 11.MinusOperToken 12.IFToken 13.GOTOToken 14.ZeroLitToken 15.OneLitToken
Tokenization Algorithm: Outline
● Read in a line of text
● Partition the line into substrings on white space
● Run each substring through all possible NFAs
● Each substring can be recognized by at most one NFA
● If a substring is not recognized by an NFA, report an error; otherwise, create an appropriate token, depending on what NFA recognized the substring
● The output is a sequence of tokens
Back to Sample L Program
Here is a sample program in L:
[A1] X1 <= X1 – 1
Y <= Y + 1
IF X1 != 0 GOTO A1
Tokenization of Line 1
● "[A1] X1 <= X1 – 1" ● White space partitioning gives us the following substrings: "[A1]", "X1", "<=", "X1", "", "1" ● "[A1]" is recognized by the NonExit Bracketed Label NFA; so create NExBrLblToken("A1") ● "X1" is recognized by the Input Variable NFA; so create InputVarToken("X1") ● "<=" is recognized by the Assignment Operator NFA; so create AssignOperToken("<=") ● "X1" is recognized by the InputVariable NFA; so create InputVarToken("X1") ● "" is recognized by the Minus Operator NFA; so create MinusOperToken("") ● "1" is recognized by the One Literal NFA; so create OneLitToken("1") ● The output is this sequence of tokens: <NExBrLblToken("A1"), InputVarToken("X1"), AssignOperToken("<="), InputVarToken("X1"), MinusOperToken(""), OneLitToken("1")>
Tokenization of Line 1
The line "[A1] X1 <= X1 – 1" gives us the following sequences of tokens:
NExBrLblToken InputVarToken AssigOperToken InputVarToken MinusOperToken OneLitToken
“A1” “X1” “<=“ “X1” “” “1”
Tokenization of Line 2
The line "Y <= Y + 1" gives us the following sequences of tokens:
OutputVarToken AssigOperToken OutputVarToken PlusOperToken OneLitToken
“Y” “<=“ “Y” “+” “1”
Tokenization of Line 3
The line "IF X1 != 0 GOTO A1" gives us the following sequences of tokens:
IFToken InputVarToken NotEqOperToken ZeroLitToken GOTOToken NExLblToken
“IF” “X1” “!=“ “0” “GOTO” “A1”
Parsing
Recursive Descent Parsing
● Recursive Descent Parsing is an algorithm that should be considered for any unambiguous CF grammar
● All programming languages are specified either with unambiguous CF grammars or with ambiguous CF grammars where ambiguity can be easily handled (e.g., lookahead)
● The basic step in designing an RDP parser is to design a parsing procedure parseN for every nonterminal symbol N in the grammar
Developing RecursiveDescent Parser for L
● To develop a recursivedescent parser for L we need to accomplish three tasks:
– Develop a CFG G for L
– Derive a set of RD parsing procedures from G
– Implement the rules in a programming language (Java, Python, C/C++, C#, etc.)
A CFG Grammar for L
CFG Productions L
● Incrmnt VarToken AssignOperToken VarToken PlusOperToken OneLitToken
Note: this rule is simplified, because, technically speaking, VarToken is not present in the list of tokens. So, we have to write additional productions of the form:
VarToken InputVarToken  OutputVarToken  LocalVarToken
● Decrmnt VarToken AssignOperToken VarToken MinusOperToken OneLitToken
● NOP VarToken AssignOperToken VarToken
● CDisp IFToken VarToken NotEqOperToken ZeroLitToken GOTOToken DispLBL
● DispLBL NExLblToken  ExLblToken
CFG Productions
● LProgram LInstructSEQ
To recognize a L Program is to recognize a sequence of L instructions
● LInstructSEQ ε
A sequence of L instructions can be empty
● LInstructSEQ LInstruct LInstructSEQ
A nonempty sequence of L instructions starts with an L instructions and is followed by a sequence of L instructions
CFG Productions
● Linstruct LblStmnt  Stmnt
To recognize a L instruction is to recognize a labeled statement or an unlabeled A sequence of L instructions can be empty
● LblStmnt BrLBL Stmnt
To recognize a labeled statement is to recognize a bracketed label and then to recognize a statement
● BrLBL NExBrLblToken  ExBrLblToken
To recognize a bracketed label is to recognize a nonexit bracketed label token or to recognize exit bracketed label token (note that NExBrLblToken and ExBrLblToken are tokens, not syntactic categories)
RecursiveDescent Parsing Procedures
Parsing Procedures for L
● Let us agree that each parsing procedure returns a ParseTree data structure (the base class)
● Consider the first rule in our grammar: LProgram LInstructSEQ
ParseTree parseLProgram(input, start_pos)
{
ParseTree progTree = parseLInstructSEQ(input, start_pos);
return progTree;
}
ParseLinstructSEQ Procedure
●There are 2 productions:
1) LInstructSEQ ε 2) LInstructSEQ LInstruct LInstructSEQ
ParseTree parseLInstructSEQ(input, start_pos) {
if ( input is empty )
return the empty LInstructSEQ;
else {
ParseTree firstIns = parseLInstruct(input, start_pos);
ParseTree restInstructs = parseLInstructSEQ(input, firstIns.getNextPos());
return new LInstructSEQ(firstInstruct, restInstructs);
}
}
ParseLInstruct Procedure
●Two productions for LInstruct: LInstruct LblStmnt  Stmnt
ParseTree parseLInstruct(input, start_pos) {
ParseTree lblSt = parseLblStmnt(input, start_pos);
if ( lblSt == null )
return parseStmnt(input, start_pos);
else
return lblSt;
}
ParseLblStmnt Procedure
● G has one production for LblStmnt: LblStmnt BrLBL Stmnt
ParseTree parseLblStmnt(input, start_pos) {
ParseTree brLbl = parseBrLbl(inut, start_pos);
if ( brLbl == null ) return null;
else {
ParseTree stmnt = parseStmnt(input, brLbl.getNextPos();
if ( stmnt == null ) return null;
else
return new LblStmnt(brLbl, stmnt);
}
ParseLbl Procedure
● G has two productions for BrLbl:
BrLBL NExBrLblToken  ExBrLblToken
● Note that both righthand sides consist of tokens; they do not need to be parsed, because they are terminals to the parser
● So, in this case, instead of parsing we have to make sure that these terminals are in the input
ParseLbl Procedure
ParseTree parseLbl(input, start_pos) {
if (input[start_pos] == NExBrLblToken )
return new Lbl(input[start_pos]);
else if (input[start_pos] == ExBrLblToken)
return new Lbl(input[start_pos]);
else
return null;
}
ParseIncrmnt Procedure
● The rest of the parsing procedures can be derived in a similar fashion
● There is one rule for Incrmnt: Incrmnt VarToken AssignOperToken VarToken PlusOperToken OneLitToken
● This rule does not require any parsing; it requires only matching of tokens
ParseIncrmnt Procedure
ParseTree parseIncrmnt(input, start_pos) { if ( input[start_pos] != VarToken ) return null; else if ( input[start_pos+1] != AssignOperToken ) return null; else if ( input[start_pos+2] != VarToken) return null; else if ( input[start_pos+3] != PlusOperToken) return null; else if ( input[start_pos+4] != OneLitToken) return null; else return new Incrmnt(VarToken, AssignOperToken, VarToken, PlusOperToken, OneLitToken); }
Back to Sample LProgram
Let us parse the following L program:
[A1] X1 <= X1 – 1
Y <= Y + 1
IF X1 != 0 GOTO A1
Parsing Example: Line 1 Tokenized
The line "[A1] X1 <= X1 – 1" gives us the following sequences of tokens:
NExBrLblToken InputVarToken AssigOperToken InputVarToken MinusOperToken OneLitToken
“A1” “X1” “<=“ “X1” “” “1”
Parsing Example: Line 1 ParseTree
LInstruct
LblStmnt
BrLbl Stmnt
NExBrLblToken
“[A1]”
Decmnt
InputVarToken AssignOperToken InputVarToken MinusOperToken OneLitToken
“X1” “<=“ “X1” “” “1”
Parsing Example: Line 2 Tokenized
The line "Y <= Y + 1" gives us the following sequences of tokens:
OutputVarToken AssigOperToken OutputVarToken PlusOperToken OneLitToken
“Y” “<=“ “Y” “+” “1”
Parsing Example: Line 2 ParseTree
LInstruct
Stmnt
Incmnt
OutputVarToken AssignOperToken OutputVarToken PlusOperToken OneLitToken
“Y” “<=“ “Y” “+” “1”
Parsing Example: Line 3 Tokenized
The line "IF X1 != 0 GOTO A1" gives us the following sequences of tokens:
IFToken InputVarToken NotEqOperToken ZeroLitToken GOTOToken NELblToken
“IF” “X1” “!=“ “0” “GOTO” “A1”
Parsing Example: Line 3 ParseTree
LInstruct
Stmnt
CDisp
IFToken NotEqOperToken InputVarToken ZeroLitToken GOTOToken
“IF” “X1“ “!=” “GOTO” “A1”
NExLblToken
“0”
Parsing Example: LProgram ParseTree
LProgram
LInstructSEQ
LInstruct LInstruct LInstruct
“[A1] X1 <= X1 – 1” “Y <= Y + 1” “IF X1 != 0 GOTO A1”
References & Reading Suggestions
Hopcroft and Ullman. Introduction to Automata
Theory, Languages, and Computation, Narosa
Publishing House
Moll, Arbib, and Kfoury. An Introduction to Formal
Language Theory
Davis, Weyuker, Sigal. Computability, Complexity,
and Languages, 2nd Edition, Academic Press
Brooks Webber. Formal Language: A Practical
Introduction, Franklin, Beedle & Associates, Inc