1
CS 1622 Lecture 2 1
CS 1622
Lecture 2Lexical Analysis
CS 1622 Lecture 2 2
Lecture 2 Review of last lecture and finish up
overview The first compiler phase: lexical
analysis Reading: Chapter 2 in text (by 1/18)
CS 1622 Lecture 2 3
The Front End
Responsibilites: Recognize legal and illegal programs Report errors meaningfully Produce IR and initial storage map Shape the code for the backend
Typically automatically constructed From a lexical specification Based on finite automata (meet theory) Very well understood
Sourcecode Scanner
IRParser
Errors
tokens
2
CS 1622 Lecture 2 4
Maps characters into tokens - basic lexicalunits x = y + z becomes <id> <assign> <id> <binop>
<id> Lexeme = string that matches the token
x, y, and z are lexemes that match <id> Some tokens have attributes
<id, x> or <binop, plus>
Eliminates whitespace In some languages performs preprocessing
(in C done by the preprocessor)
Sourcecode
ScannerIRParser
Errors
tokens
CS 1622 Lecture 2 5
Recognizes syntactic structure & errors Directs semantic analysis (type checking) Builds IR for source program For some languages (more precisely:
grammars) can be easily built by hand More flexible: use parser generators
Can change language more easily Typically very fast Well undestood theory (“Push-down automata”
Sourcecode Scanner
IRParser
Errors
tokens
CS 1622 Lecture 2 6
Grammars A concise and precise way to specify
languages For context-free grammars can build
efficient parsers Can typically write a CFG for a
programming language Tool of choice for specifying syntactic
structure
3
CS 1622 Lecture 2 7
Grammars
Formally, a grammar G = (S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is a set of terminal symbols or words P is a set of productions or rewrite
rules (P : N → N ∪ T )
CS 1622 Lecture 2 8
CFG Example
1. goal → expr
2. expr → expr op term
3. | term4. term → number
5. | id6. op → +
7. | -
S = goal
T = { number, id, +, - }
N = { goal, expr, term, op }
P = { 1, 2, 3, 4, 5, 6, 7}
CS 1622 Lecture 2 9
To recognize a valid sentence for some CFG,we reverse this process and build up a parse
Deriving SentencesProduction Result goal
1 expr2 expr op term5 expr op y7 expr - y2 expr op term - y4 expr op 2 - y6 expr + 2 - y3 term + 2 - y5 x + 2 - y
4
CS 1622 Lecture 2 10
Parse Tree
x + 2 - y
Lots of superfluous detail.
term
op termexpr
termexpr
goal
expr
op
<id,x>
<number,2>
<id,y>
+
-
1. goal → expr
2. expr → expr op term3. | term4. term → number
5. | id6. op → +
7. | -
CS 1622 Lecture 2 11
Abstract Syntax Tree (AST)
This is much more conciseASTs are one form of intermediate
representation (IR)
+
-
<id,x> <number,2>
<id,y> The AST summarizesgrammatical structure,without including detailabout the derivation
CS 1622 Lecture 2 12
The Back End - instruction selection
Responsibilities: Translates IR to target code Selects target instructions for IR (trivial for RISC) Allocates machine resources (registers, memory) Typically implemented manually
For CISC some automated pattern matchingapproaches
Lots of hand-crafting done for good backends -- mustknow target architecture well!
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
5
CS 1622 Lecture 2 13
Back end - instruction scheduling
Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables Optimal scheduling is NP-Complete in
nearly all cases but good heuristictechniques are well understood
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
CS 1622 Lecture 2 14
Back end - register allocation
Have each value in a register when it isused
Manage a limited set of resources Can change instruction choices & insert
LOADs & STOREs Optimal allocation is NP-Complete
approximate
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
CS 1622 Lecture 2 15
Traditional Three-passCompiler
Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the
compiled code May also improve space, power
consumption, … Must preserve “meaning” of the code
Errors
SourceCode
MiddleEnd
FrontEnd
Machinecode
BackEnd
IR IR
6
CS 1622 Lecture 2 16
The Optimizer
Discover & propagate some constant value Move a computation to a less frequently
executed place Specialize some computation based on
context Discover a redundant computation & remove
it Remove useless or unreachable code Encode an idiom in some particularly efficient
form
Errors
Opt1
Opt3
Opt2
Optn
...IR IR IR IR IR
CS 1622 Lecture 2 17
The Scanner: Overview Task:
translate the sequence of characters to acorresponding sequence of tokens - essentiallygrouping characters into words -removingirrevelant characters - e.g., white space
Each time the scanner is called, it should find the longest sequence of characters
in the input starting with the current character … that corresponds to a token, and
return that token.
CS 1622 Lecture 2 18
How to write a scanner? write it from scratch, or automatically
generate it with a scanner generator lex or flex (produce C code), or jlex (produces Java code).
input to a scanner generator: one regular expression for each token
output of a scanner generator: a finite state machine
so, you need to understand: regular expressions finite automata.
7
CS 1622 Lecture 2 19
Lexical analyzersGoals:
To simplify specification & implementationof scanners
To understand the underlying techniquesand technologies
Scanner
ScannerGenerator
specifications
source code parts of speech
tablesor code
CS 1622 Lecture 2 20
Regular Expressions to FiniteAutomata Generating a scanner
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
CS 1622 Lecture 2 21
Recognizing words Example - “begin”
c= next char;if c != b then errorc = next char;if c!= e the error;c = next char;if c != g then error;….
s0s1 s2 s3 s4
b e g i ns5
Transition diagrams - serve as abstractions for code that wouldbe written - finite automata
8
CS 1622 Lecture 2 22
Finite Automata A compiler recognizes legal programs
in some (source) language. A finite-state machine recognizes legal strings
in some language. Example: Identifiers
sequences of one or more letters or digits,starting with a letter:
letterletter | digit
S A
CS 1622 Lecture 2 23
Finite-Automata State Graphs A state
• The start state
• An accepting/final state
• A transitiona
CS 1622 Lecture 2 24
Finite Automata Transition
s1 →a s2
Is readIn state s1 on input “a” go to state s2
If end of input or no transition possible If in accepting state => accept Otherwise => reject
9
CS 1622 Lecture 2 25
Language defined by FSM The language defined by a FSM is the set of
strings accepted by the FSM.
in the language of the FSM on previous slide: x, tmp2, XyZzy, position27.
not in the language of the FSM on previous slide: 123, a?, 13apples.
CS 1622 Lecture 2 26
Example: Integer Literals FA that accepts integer literals with an
optional + or - sign:
+
digit
S
B
A-
digitdigit
CS 1622 Lecture 2 27
Formal FSA Definition A finite automaton is a 5-tuple (Σ, S, δ,
s0, SF) where: An input alphabet Σν A set of states Sν A start state s0ν A set of accepting states SF ⊆ Sν δ is the state transition function: S x Σ S
(i.e., encodes transitions state →input state)
10
CS 1622 Lecture 2 28
FA for the integer-literalexample
Σ = {digit, +, - )A set of states S = {S, A and B}A start state S00 = SA set of accepting states SF ⊆ S = {B}δ is the state transition function =
(S,digit) -> B(S, + ) -> A(S, - ) -> A(B, digit) -> B(A, digit) -> B
CS 1622 Lecture 2 29
Two kinds of AutomataDeterministic (DFA):
No state has more than one outgoing edge withthe same label.
Non-Deterministic (NFA): States may have more than one outgoing edge
with same label. Edges may be labeled with ε (epsilon), the empty
string. The automaton can take an ε epsilon transition
without looking at the current input character.
CS 1622 Lecture 2 30
Example of NFA integer-literal example:
+
digit
S
B
A-
εdigit
11
CS 1622 Lecture 2 31
Non-deterministic automata(NFA) often simpler (e.g. smaller) than DFA can be in multiple states at the same time NFA accepts a string is if
there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. Think about it as pursuing all choices in parallel or
having an oracle that says what to do. Example:
the integer-literal NFA on input "+75":
CS 1622 Lecture 2 32
Equivalence of DFA and NFA Theorem:
For every non-deterministic finite-state machine M,there exists a deterministic machine M' such thatM and M' accept the same language.
Why is the theorem important for scannergeneration?
Theorem is not enough: what do we need forautomatic scanner generation?
CS 1622 Lecture 2 33
How to Implement a FSMA table-driven approach: table:
one row for each state in the machine, and one column for each possible character.
Table[j][k] which state to go to from state j on character k, an empty entry corresponds to the machine
getting stuck.
12
CS 1622 Lecture 2 34
The table-driven program for aDFA
state = S // S is the start staterepeat {
k = next character from the inputif k == EOF the // end of input
if state is a final state then acceptelse reject
state = T[state,k]if state = empty then reject // got stuck
}
CS 1622 Lecture 2 35
Generating a scanner
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
CS 1622 Lecture 2 36
Regular Expressions
FA’s not good way to specify tokens -diagrams hard to write down
regular expressions are another specificationtechnique a compact way to define a language that can be
accepted by an automaton. used as the input to a scanner generator
define each token, and define white-space, comments, etc
these do not correspond to tokens, but must be recognized and ignored.
13
CS 1622 Lecture 2 37
Example: Simple identifier English: A letter, followed by zero or
more letters or digits. RE: letter . (letter | digit)*Operators:
| means "or"
. means "followed by” (usually just use position)
* means zero or more instances
() are used for grouping
CS 1622 Lecture 2 38
Operands of a regularexpression Operands are same as labels on the edges of
an FSM single characters, or the special character ε (the empty string)
"letter" is a shorthand for a | b | c | ... | z | A | ... | Z
"digit“ is a shorthand for 0 | 1 | … | 9
sometimes we put the characters in quotes necessary when denoting characters: | . *
CS 1622 Lecture 2 39
Precedence of | . * operators.
Consider regular expressions: letter.letter | digit* letter.(letter | digit)*
highestexponentiation*middletimes.lowestplus|
PrecedenceAnalogousArithmeticOperator
RegularExpressionOperator
14
CS 1622 Lecture 2 40
Examples Describe (in English) the language defined by
each of the following regular expressions: letter (letter | digit*)
digit digit* "." digit digit*
CS 1622 Lecture 2 41
Example: Integer Literals An integer literal with an optional sign can be
defined in English as: “(nothing or + or -) followed by one or more digits”
The corresponding regular expression is: (+|-|epsilon).(digit.digit*)
A new convenient operator ‘+’ digit.digit* is the same asdigit+ which means "one or more digits”
CS 1622 Lecture 2 42
Language Defined by aRegular Expression Recall: language = set of strings Language defined by an automaton / RE
the set of strings accepted by the automaton the set of strings that match the expression.
Regular Exp. Corresponding Set of Strings
epsilon {""}
a {"a"}
a.b.c {"abc"}
a | b | c {"a", "b", "c"}
(a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
15
CS 1622 Lecture 2 43
REs describe regularlanguagesPatterns form a regular language
*** any finite language is regular ***
Regular Expression (RE) (over alphabet Σ)
ε is a RE denoting the set {ε}
If a is in Σ, then a is a RE denoting {a}
If x and y are REs denoting L(x) and L(y) then
x is an RE denoting L(x); y is a RE denoting L(y);
x | y is an RE denoting L(x) ∪ L(y)
xy is an RE denoting L(x)L(y)
x* is an RE denoting L(x)*
Can combine RE to form other REs