34
Chapter 2: Finite-State Machines Heshaam Faili [email protected] University of Tehran

Chapter 2: Finite-State Machines Heshaam Faili [email protected] University of Tehran

Embed Size (px)

Citation preview

Page 1: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

Chapter 2: Finite-State Machines

Heshaam [email protected]

University of Tehran

Page 2: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

2

Overview Regular

Expressions FSAs Properties of

Regular Languages

Page 3: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

3

Regular Expressions A regular expression (RE) is a formula in a

specialized language, used to characterize strings. A string is a sequence of characters REs allow us to search for patterns

A finite-state machine is a device for recognizing/generating regular expressions

We’ll use a “Perlish” notation for writing regular expressions, based on regular expressions in the Perl programming language.

The concepts are the important thing NB: Perlish isn’t exactly the same as Perl

We will write REs between slashes: /…/

Page 4: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

Regular expression inventory (1) Character Literals and Classes

Characters: /abcd/ Set: /p[aeiou]p/ Range: /ab[a-z]d/

Operators (disjunction, negation) Disjunction:

Set elements: /[Aa]ardvark/ Sequences of characters: /ant(eater|farm)/

Negation: Single item: /[^a]/ (any character but a) Range: [^a-z] (not a lowercase letter)

Page 5: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

Regular expression inventory (2) Counters

?: Optionality (0 or 1 occurrence): /colou?r/ * (Kleene star): Any number of

occurrences: /[0-9]*/ +: At least one occurrence: /[0-9]+/ {n}: n number of occurrences: /[0-9]{4}/

Wildcard: matches any single character (.) /beg.n/

Page 6: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

6

Regular expression inventory (3)

Parentheses: used to group items together /ant(farm)?/: all of farm is optional

Escaped characters: needed to specify characters that have a special meaning: *, +, ?, (, ), |, [, ]: Use a backslash: /why\?/ Period expressed as: _

Page 7: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

7

Regular expression inventory (4)

Anchors: anchor expressions to various parts of the string ^ start of line

do not confuse with [^..] used to express negation; anywhere else it’s a start of line

$ end of line \b non-word character

word characters are digits, underscores, or letters, i.e., [0-9A-Za-z\_]

Page 8: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

8

Examples of Regular Expressions

/fire/ a sequence of f followed immediately by i, then immediately by r, then immediately by e

/fires?/ matches fire or fires /fires\?/ matches fires ? /[abcd]/ matches a, b, c, or d /[0-9]/ matches any character in the range 0 to 9

(inclusive) /[^0-9]/ matches any non-digit character, i.e., any

character except those in the set 0 thru 9 /[0-9]+/ matches 0, 1, 11, 12, 367, … /[0-9]*/ matches 0, 1, 11, 12, 367, … and matches no

string /fir./ matches fire, fir9, firm, firp, … /fir.*/ matches fir, fire, fir987, firppery, … /[fFHhs]ire/ matches fire, Fire, Hire, hire, sire /f|Fire/ matches f and Fire

Page 9: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

9

Precedence /fire|ings?/ the sequence fire or the

sequence ing (the latter optionally followed by s)

Why? Because sequences have precedence over

disjunction To override precedence, use parentheses /fir(e|ings)/ the sequence fire followed by

either the sequence e or the sequence ings

Page 10: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

10

Precedence Rules

1) Parentheses have the highest precedence.2) Then come counters, *, +, ?, {}3) Then come sequences and anchors

• so, /good.*/ matches goodies, etc., and not (just) goodgood

• /echo{3}/ the sequence ech followed by ooo

• /(echo){3}/ the sequence echoechoecho

4) Then comes disjunction

Page 11: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

11

Aliases Use aliases to designate particular recurrent

sets of characters \d [0-9]: digit \D [^\d]: non-digit \w [a-zA-Z0-9\_]: alphanumeric \W [^\w]: non-alphanumeric \s [~\r\t\n\f]: whitespace character

\r: space, \t: tab\n: newline, \f: formfeed

\S [^\s]: non-whitespace

Page 12: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

12

Example 1

/\$[0-9]+(\.[0-9][0-9])?/

Page 13: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

13

Example 2

Times on a digital watch (hours and minutes)

/[1-9]|(1[012]):[0-5][0-9]/

Page 14: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

14

Overgeneration

/\d\d:\d\d/

recognizes watch times, but also other sequences. In other words, the pattern over generates, covering expressions which aren’t in the target

Page 15: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

15

Undergeneration

/1[012]:[0-5][0-9]/

undergenerates, i.e., does not cover all watch times.

Page 16: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

16

Representing sentences

‘handling’ agreement:/the (student solves|students solve) the problem/

an optional adjective: /the clever?(student solves|students solve) the

problem/

generating an infinite number of sentences /the clever?(student solves|students solve) the

problem (and (the clever?(student solves|students solve) the problem)*/

NOTE: here the symbols are words, not characters! Be sure to define the symbol type

Page 17: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

17

Overview Regular

Expressions FSAs Properties of

Regular Languages

Page 18: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

18

A Simple Finite State Analyzer (or FSA) Example: FSA to recognize strings of the

form: /[ab]+/ i.e., L ={a, b, ab, ba, aab, bab, aba, bba, …}

Transition Tableinitial =0; final = {1}0–>a-> 1 0->b->1 1->a->11->b->1

Page 19: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

19

How an FSA accepts or rejects a string The behavior of an FSA is completely determined by its

transition table. The assumption is that there is a tape, with the input symbols are read off consecutive cells of the tape.

The machine starts in the start (initial) state, about to read the contents of the first cell on the input ‘tape’.

The FSA uses the transition table to decide where to go at each step

A string is rejected in exactly two cases: 1. a transition on an input symbol takes you nowhere 2. the state you’re in after processing the entire input is not

an accept (final) state Otherwise. the string is accepted.

Page 20: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

20

FSA formally

Finite state automaton defined by the following parameters: Q: finite set of (N) states: q0, q1, …,

qN : finite input alphabet q0: designated start state F: set of final states (subset of Q) (q, i): transition function

Page 21: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

21

More Examples of FSA’s

Let’s design FSA’s to recognize the set of zero or more a’s the set of all lowercase alphabetic

strings ending in a b. the set of all strings in [ab]* with

exactly two a’s. simple NPs, PPs, Ss etc.

Page 22: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

22

The set of zero or more a’s

L ={, a, aa, aaa, aaaa, …}

Transition Tableinitial =0; final = {0}0–>a-> 0

Page 23: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

23

FSA for set of all lowercase alphabetic strings ending in b

/[a-z]*b/ initial =0; final ={1} 0->[a, c-z]->0 0->b->1 1->b->1 1->[a, c-z]->0

Page 24: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

24

The set of all strings in [ab]* with exactly 2 a’s

Do this yourself It might help to first rewrite a more

precise regular expression for this

Page 25: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

25

FSA for simple NPs, PPs, S, …

initial=0; final ={2}0->D->10->->11->N->2

Another FSA for NPs:initial=0; final ={2}0->N->20->D->11->N->22->N->2

• D is an alias for [the, a, an, all,…], N for [dog, cat, robin,…]• What if we wanted to add adjectives? Or recognize PPs?• What about one for simple sentences?

• /(Prep D? A* N+)* (D? N) (Prep D? A* N+)* (V_tns|Aux V_ing) (Prep D? A* N+)*/• Note: FSA1 concat FSA2 recognizes L(FSA1) concat L(FSA2)

Page 26: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

26

Deterministic and Non-Deterministic FSA’s An FSA is non-deterministic (NFSA) when, for some state

and input, there is more than one state it can go to Occurs when transition table allows for a transition to

two or more states from one state on a given input symbol.

e.g., 1->a->2, 1->a->4 Whenever epsilon-transitions occur, these can be taken

without consuming input. So, whenever epsilon-transitions occur, the machine could

either take the epsilon-transition, or consume an input symbol, introducing non-determinism.

Any NFSA can be reduced to a DFSA (deterministic) (at the expense of possibly more states).

Page 27: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

27

FAQ: Why Are These Machines Finite-State? Finite number of states Number of states bounded in advance -- determined by its

transition table Therefore, the machine has a limit to the amount of memory

it uses. Its behavior at each stage is based on the transition table,

and depends just on the state it’s in, and the input. So, the current state reflects the history of the processing so far.

Certain classes of formal languages (and linguistic phenomena) which are not regular require additional memory to keep track of previous information (beyond current state and input)

e.g., center-embedding constructions (discussed later)

Page 28: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

28

Overview Regular

Expressions FSAs Properties of

Regular Languages

Page 29: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

29

Formal Languages Revisited We will view any formal language as a set of

expressions The language will use a finite vocabulary (called an

alphabet), and a set of expression-combining operations Regular languages are the simplest class of formal

languages

Note: Kleene closure of a setLet L = {a, b}. Then L* = the set of a’s and b’s concatenated zero or more

times = {, a, b, ab, aab, aaab, aaaab, ba, baa, ….}.

Page 30: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

30

Properties of Regular Languages The class of regular languages over is defined as

follows:1. (the empty set) is a regular language.2. a U , {a} is a regular language.

( = alphabet of symbols)3. If L1 and L2 are regular languages, so are:

a. L1 U L2, the union (or disjunction) of L1 and L2b. L1.L2 = {xy | x L1, yL2}, concatenation of L1 and L2c. L1*, the Kleene closure of L1 (set formed by concatenating members

of L1 zero or more times)

So, if the language L is a regular language, any expression in L must be expressible by the three operations of concatenation, disjunction, and Kleene closure.

Page 31: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

31

General Closure Properties of Regular Languages

Concatenation, Union, Kleene Closure Intersection: If L1 and L2 are regular

languages, so are L1 L2. Set Difference: If L1 and L2 are

regular languages, so are L1- L2. Reversal: If L1 is a regular language,

so is L1R, the language formed by reversing all the strings in L1

Page 32: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

32

What sorts of expressions aren’t regular In natural language, examples include

center-embedding constructions. The cat loves Mozart.The cat the dog chased loves Mozart.The cat the dog the rat bit chased loves Mozart.The cat the dog the rat the elephant admired bit

chased loves Mozart.(the noun)n (transitive-verb)n-1 loves Mozart

These aren’t regular though /A*B*loves Mozart/ is regular

Page 33: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

33

Regular Expressions and FSAs Regular expressions are equivalent to

FSA’s So, any FSA can be constructed by just

concatenation, union, and Kleene * Question: how would you (graphically)

combine FSA’s using: Concatenation Union Kleene *

Page 34: Chapter 2: Finite-State Machines Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

34

Exercises

2.1,2.4,2.8,2.10,2.11