23
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Embed Size (px)

Citation preview

Page 1: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Language Recognizer

Connecting Type 3 languages and Finite State Automata

Copyright © 2008-2014 – Curt Hill

Page 2: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Copyright © 2008-2014 – Curt Hill

Introduction• Kleene showed that a Finite State Automaton can recognize a class of languages

• This is Kleene’s Theorem

• This set may be built up using only the following:

• The empty set • The empty string • All single characters from the alphabet• Union• Concatenation• Kleene closure

– Three operations, three starting points

Page 3: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Regular Sets• A regular set is any set that can be

constructed using the three starting points and three operations just given

• Thus every regular set is the language accepted by a regular grammar (type 3) and a FSA

• Another way to specify these regular sets is by using regular expressions

Copyright © 2008-2014 – Curt Hill

Page 4: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Regular Expressions• There are two common

understandings of regular expressions– These two are fundamentally related

but have different purposes

• A means of specifying a set of strings– This will be the principle meaning for

this class

• A means of specifying a string to be searched for within a document– Much more common

Copyright © 2008-2014 – Curt Hill

Page 5: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Set of Strings• In the text are the :• Concatenation

– Merely the writing of two items next to each other

• Union– Symbol: signifying that either of

two sets may be used

• Kleene Closure– Symbol: * signifying that zero or more

copies may be concatenated together

• Parentheses for grouping Copyright © 2008-2014 – Curt Hill

Page 6: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Examples• An alphabet contains a, b, c• The string aac is the concatenation of

three letters• The string a(bc) represents two

strings ab and ac• The string a(b)* represents every

string starting with an a and followed by zero or more cs

• a(abc)*c represents all the strings that start with a end with c

• (abc)* is the set of all strings

Copyright © 2008-2014 – Curt Hill

Page 7: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Search Strings

• Fundamentally the same but modified to the task at hand– Mathematics is not concerned with

beginning and end of lines, special characters or characters not on a keyboard

• The is replaced by the | • Concatenation and Kleene Closure

is similar• Many special characters

Copyright © 2008-2014 – Curt Hill

Page 8: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Specials• The special characters include

– [ ]\^|*$.?+(){}

• Any other character just matches itself• Since many of these characters are

valuable in strings the escape is used to match them

• Most of these are for the special requirements of finding an element of this set in a much larger piece of text or a document

Copyright © 2008-2014 – Curt Hill

Page 9: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Escape• The backslash character is the escape• Thus to look for an asterisk (a special)

in a string it must be escaped: \*– This allows a search to find the asterisk

• The C family uses some of the same escape sequences:– \n newline or linefeed– \t tab– \r carriage return

Copyright © 2008-2014 – Curt Hill

Page 10: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Positioning

• There are two specials that force a position

• ^ matches the beginning of the line

• $ matches the end of the line• Both of these match a position

rather than a character• Without these a pattern could

match anywhere within a string

Copyright © 2008-2014 – Curt Hill

Page 11: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Repetition

• There are three repetition characters which are more general

• Closure is the *– It represents zero or more repetitions

of the previous item– Kleene star

• The + represents one or more repetitions of the previous item

• The ? represent zero or one occurrences of the previous item

Copyright © 2008-2014 – Curt Hill

Page 12: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Examples

• ~* matches any number (including zero) of successive tildes

• \-* matches zero or more dashes• .+ matches one or more of any

character• hats? matches either hat or hats

Copyright © 2008-2014 – Curt Hill

Page 13: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Grouping

• The repetitions could only be applied to a single character

• What is next needed is some type of grouping

• This is provided by the parenthesis• Enclosing a pattern in parenthesis

makes it a group• This group can then be followed by

a repetition character

Copyright © 2008-2014 – Curt Hill

Page 14: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Examples

• (\*\-)* will match– *-– *-*-– *-*-*- etc

• The * is greedy – it will try to match as many of these as is possible

Copyright © 2008-2014 – Curt Hill

Page 15: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

More interesting patterns

• A number is pretty easy to understand from our perspective but not so easy to describe – Except in regular expressions

• An integer is a string of digits– Possibly preceded by a plus or minus

• So how is this done?• With sets and repetition

Copyright © 2008-2014 – Curt Hill

Page 16: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

A set

• A pair of brackets may be filled with character

• This will match any one of them• Thus the digits could be done with:[0123456789]

• An integer could then be:[-+]? [0123456789]+

• Any single vowel is:[aeiouAEIOU]

Copyright © 2008-2014 – Curt Hill

Page 17: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Alternation• A set provides intuitive alternation• The match process may choose any

character within the set to use• The alternation is only applied to

number of single characters• There is also an alternation

character – The vertical bar |

• This allows either simple or complicated patterns to alternate

Copyright © 2008-2014 – Curt Hill

Page 18: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Alternation• Thus:

A|E|I|O|U is equivalent to [AEIOU]

• However, more interesting alternations are possible and useful– (abc)|(123) will match either of the two

strings– ([-+]?\d)+|(\w+) will match any string of

characters that looks like a number or word

Copyright © 2008-2014 – Curt Hill

Page 19: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Audience Participation• Suppose the following expression:

^ab(cde)*f$• Which of the following lines match

this?• abf• abcdecdef• abcdeaf• abcdecdecdecdef• acdef• abcdefa

Copyright © 2008-2014 – Curt Hill

Page 20: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Limitations

• What kind of sets are not regular?• Consider the following language:

0n1n

– The number of zeros and one are the same

• We know that 0m1n is regular, why is 0n1nnot?

Copyright © 2008-2014 – Curt Hill

Page 21: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

We Really Do Know

Copyright © 2008-2014 – Curt Hill

s0s1

01

1

• This accepts 0m1n and is clearly a FSA

• Why is 0n1n harder?• Counter-intuitive since 0n1n is a

subset of 0m1n • Shouldn’t it be harder to generate

a full set than a subset?

Page 22: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Memory

• An FSA determines its next state only based on input and current state

• Since it has no memory, it cannot remember how many zeros we processed so that we can process that many ones

• Next we consider those machines stronger than these

Copyright © 2008-2014 – Curt Hill

Page 23: Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © 2008-2014 – Curt Hill

Exercises

• 13.4– 3, 5, 15

Copyright © 2008-2014 – Curt Hill