4/14/20151 REGULAR EXPRESSIONS AND AUTOMATA Lecture 3: REGULAR EXPRESSIONS AND AUTOMATA Husni...

Preview:

Citation preview

04/18/23 1

REGULAR EXPRESSIONS AND AUTOMATA

Lecture 3: REGULAR EXPRESSIONS AND AUTOMATA

Husni Al-Muhtaseb

04/18/23 2

الرحيم الرحمن الله بسمICS 482: Natural Language

ProcessingLecture 3: REGULAR EXPRESSIONS

AND AUTOMATA

Husni Al-Muhtaseb

NLP Credits and Acknowledgment

These slides were adapted from presentations of the Authors of the

bookSPEECH and LANGUAGE PROCESSING:

An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

and some modifications from presentations found in the WEB by

several scholars including the following

NLP Credits and AcknowledgmentIf your name is missing please contact me

muhtasebAt

Kfupm.Edu.sa

NLP Credits and AcknowledgmentHusni Al-Muhtaseb

James MartinJim MartinDan JurafskySandiway FongSong young inPaula MatuszekMary-Angela PapalaskariDick Crouch Tracy KinL. Venkata

SubramaniamMartin Volk Bruce R. MaximJan HajičSrinath SrinivasaSimeon NtafosPaolo PirjanianRicardo VilaltaTom Lenaerts

Heshaam Feili Björn GambäckChristian Korthals Thomas G.

DietterichDevika

SubramanianDuminda

Wijesekera Lee McCluskey David J. KriegmanKathleen McKeownMichael J. CiaraldiDavid FinkelMin-Yen KanAndreas Geyer-

Schulz Franz J. KurfessTim FininNadjet BouayadKathy McCoyHans Uszkoreit Azadeh Maghsoodi

Khurshid Ahmad

Staffan Larsson

Robert Wilensky

Feiyu Xu

Jakub Piskorski

Rohini Srihari

Mark Sanderson

Andrew Elks

Marc Davis

Ray Larson

Jimmy Lin

Marti Hearst

Andrew McCallum

Nick KushmerickMark CravenChia-Hui ChangDiana MaynardJames Allan

Martha Palmerjulia hirschbergElaine RichChristof Monz Bonnie J. DorrNizar HabashMassimo PoesioDavid Goss-GrubbsThomas K HarrisJohn HutchinsAlexandros PotamianosMike RosnerLatifa Al-Sulaiti Giorgio Satta Jerry R. HobbsChristopher ManningHinrich SchützeAlexander GelbukhGina-Anne Levow Guitao GaoQing MaZeynep Altan

04/18/23 6

Agenda: REGULAR EXPRESSIONS AND AUTOMATA

• Why to study it?– Talk to ALICE

• Regular expressions

• Finite State Automata

• Assignments

04/18/23 7

NLP Example: Chat with Alice• http://www.pandorabots.com/pandora/talk

?botid=f5d922d97e345aa1&skin=custom_input

• A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) is an award-winning free natural language artificial intelligence chat robot. The software used to create A.L.I.C.E. is available as free ("open source") Alicebot and AIML software.

• http://www.alicebot.org/about.html

04/18/23 8

NLP Representations

• State Machines– FSAs: Finite State Automata– FSTs: Finite State Transducers– HMMs: Hidden Markov Model– ATNs: Augmented Transition Network– RTNs: Recursive Transition Network

04/18/23 9

NLP Representations

• Rule Systems: – CFGs: Context Free Grammar– Unification Grammars– Probabilistic CFGs

• Logic-based Formalisms– 1st Order Predicate Calculus– Temporal and other Higher Order Logics

• Models of Uncertainty– Bayesian Probability Theory

04/18/23 10

NLP Algorithms

• Most are transducers: accept or reject input, and construct new structure from input– State space search

• To manage the problem of making choices during processing when we lack the information needed to make the right choice

– Dynamic programming• To avoid having to redo work during the course

of a state-space search

04/18/23 11

State Space Search

• States represent pairings of partially processed inputs with partially constructed answers

• Goals are exhausted inputs paired with complete answers that satisfy some criteria

• The spaces are normally too large to exhaustively explore

04/18/23 12

Dynamic Programming

• Don’t do the same work over and over

• Avoid this by building and making use of solutions to sub-problems that must be invariant across all parts of the space

04/18/23 13

Regular Expressions and Text Searching

• Regular expression (RE): A formula (in a special language) for specifying a set of strings

• String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation)

04/18/23 14

Regular Expression Patterns

• Regular Expression can be considered as a pattern to specify text search strings to search a corpus of texts

• What is Corpus?

• For text search purpose: use Perl syntax

• Show the exact part of the string in a line that first matches a Regular Expression pattern

04/18/23 15

Regular Expression Patterns

RE String matched

/woodchucks/ “interesting links to woodchucks and lemurs”

/a/ “Sarah Ali stopped by Mona’s”

/Ali says,/ “My gift please,” Ali says,”

/book/ “all our pretty books”

/!/ “Leave him behind!” said Sami

04/18/23 16

RE Match

/[wW]oodchuck/ Woodchuck or woodchuck

/[abc]/ “a”, “b”, or “c”

/[0123456789]/ Any digit

04/18/23 17

RE Description

/a*/ Zero or more a’s

/a+/ One or more a’s

/a?/ Zero or one a’s

/cat|dog/ ‘cat’ or ‘dog’

/^cat$/ A line containing only ‘cat’

/\bun\B/ Beginnings of longer strings starts by ‘un’

04/18/23 18

Example

• Find all instances of the word “the” in a text.– /the/

•What About ‘The’

– /[tT]he/•What about ‘Theater”, ‘Another’

– /\b[tT]he\b/

04/18/23 19

Sidebar: Errors

• The process we just went through was based on two fixing kinds of errors– Matching strings that we should not have

matched (there, then, other)• False positives

– Not matching things that we should have matched (The)

• False negatives

04/18/23 20

Sidebar: Errors

• Reducing the error rate for an application often involves two efforts – Increasing accuracy (minimizing false

positives)– Increasing coverage (minimizing false

negatives)

04/18/23 21

Regular expressions• Basic regular expression patterns• Perl-based syntax (slightly different from

other notations for regular expressions)• Disjunctions [abc]• Ranges [A-Z]• Negations [^Ss]• Optional characters ? and *• Wild cards .• Anchors ^ and $, also \b and \B• Disjunction, grouping, and precedence |

04/18/23 22

04/18/23 23

04/18/23 24

04/18/23 25

Writing correct expressions

• Exercise: write a Perl regular expression to match the English article “the”:

/the/ missed ‘The’

included ‘the’ in ‘others’/[tT]he/

/\b[tT]he\b/ Missed ‘the25’ ‘the_’

/[^a-zA-Z][tT]he[^a-zA-Z]/Missed ‘The’ at the beginning of a line

/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

04/18/23 26

A more complex example

• Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:

04/18/23 27

Example• Price

– /$[0-9]+/ # whole dollars – /$[0-9]+\.[0-9][0-9]/ # dollars and cents – /$[0-9]+(\.[0-9][0-9])?/ #cents optional – /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries

• Specifications for processor speed – /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/

• Memory size – /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ – /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

• Vendors – /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/

– /\b(Mac|Macintosh|Apple)\b/

04/18/23 28

Advanced Operators

Underscore: Correct figure 2.6

04/18/23 29

04/18/23 30

04/18/23 31

Assignment: Try regular expressions in MS WORD in both

Arabic & English

04/18/23 32

baa! baaa! baaaa! baaaaa! ...

Finite State Automata

• FSAs recognize the regular languages represented by regular expressions– SheepTalk: /baa+!/

• Directed graph with labeled nodes and arc transitions

•Five states: q0 the start state, q4 the final state, 5 transitions

q0 q4q1 q2 q3

b a aa !

04/18/23 33

Formally

• FSA is a 5-tuple consisting of– Q: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!}

– q0: A start state

– F: a set of final states in Q {q4} (q,i): a transition function mapping Q x

to Q

q0 q4q1 q2 q3

b a

a

a !

04/18/23 34

• FSA recognizes (accepts) strings of a regular language– baa!– baaa!– baaaa!– …

• Tape Input: a rejected input

a b a ! b

q0 q4q1 q2 q3b a

aa !

04/18/23 35

State Transition Table for SheepTalk

StateInput

b a !

0 1 Ø Ø

1 Ø 2 Ø

2 Ø 3 Ø

3 Ø 3 4

4 Ø Ø Ø

baa! baaa! baaaa! baaaaa! ...

q0 q4q1 q2 q3

b aa

a !

04/18/23 36

Non-Deterministic FSAs for SheepTalk

q0 q4q1 q2 q3

b a a a !

q0 q4q1 q2 q3

b a a !

04/18/23 37

• A language is a set of strings

• String: A sequence of letters

– Examples: “cat”, “dog”, “house”, …

– Defined over an alphabet:

Languages

zcba ,,,,

04/18/23 38

Alphabets and Strings

• We will use small alphabets:

• Strings

abbaw

bbbaaav

abu

ba,

baaabbbaaba

baba

abba

ab

a

04/18/23 39

Finite AutomatonInput

String

Output

String

FiniteAutomaton

04/18/23 40

Finite Accepter

• Input

“Accept” or“Reject”

String

FiniteAutomaton

Output

04/18/23 41

Transition Graph

initialstate

final state“accept”state

transition

abba -Finite Accepter

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

04/18/23 42

Initial Configuration

1q 2q 3q 4qa b b a

5q

a a bb

ba,

Input Stringa b b a

ba,

0q

04/18/23 43

Reading the Input

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/18/23 44

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/18/23 45

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/18/23 46

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/18/23 47

0q 1q 2q 3q 4qa b b a

Output: “accept”

5q

a a bb

ba,

a b b a

ba,

04/18/23 48

Rejection

1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

0q

04/18/23 49

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

04/18/23 50

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

04/18/23 51

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

04/18/23 52

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

Output:“reject”

a b a

ba,

04/18/23 53

Another Example

a

b ba,

ba,

0q 1q 2q

a ba

04/18/23 54

a

b ba,

ba,

0q 1q 2q

a ba

04/18/23 55

a

b ba,

ba,

0q 1q 2q

a ba

04/18/23 56

a

b ba,

ba,

0q 1q 2q

a ba

04/18/23 57

a

b ba,

ba,

0q 1q 2q

a ba

Output: “accept”

04/18/23 58

Rejection

a

b ba,

ba,

0q 1q 2q

ab b

04/18/23 59

a

b ba,

ba,

0q 1q 2q

ab b

04/18/23 60

a

b ba,

ba,

0q 1q 2q

ab b

04/18/23 61

a

b ba,

ba,

0q 1q 2q

ab b

04/18/23 62

a

b ba,

ba,

0q 1q 2q

ab b

Output: “reject”

04/18/23 63

Formalities

• Deterministic Finite Accepter (DFA) FqQM ,,,, 0

Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

04/18/23 64

About Alphabets

• Alphabets means we need a finite set of symbols in the input.

• These symbols can and will stand for bigger objects that can have internal structure.

04/18/23 65

Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

04/18/23 66

Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

04/18/23 67

Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

04/18/23 68

Set of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

04/18/23 69

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

04/18/23 70

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

04/18/23 71

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

04/18/23 72

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

04/18/23 73

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

04/18/23 74

Extended Transition Function(Reads the entire string)

*

QQ *:*

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

04/18/23 75

20 ,* qabq

3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q 2q

04/18/23 76

40 ,* qabbaq

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

04/18/23 77

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

04/18/23 78

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Observation: There is a walk from to with label

0q 5qabbbaa

04/18/23 79

Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaML M

accept

04/18/23 80

Another Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML ,, M

acceptacceptaccept

04/18/23 81

More Examples

a

b ba,

ba,

0q 1q 2q

}0:{ nbaML n

accept trap state

04/18/23 82

ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

04/18/23 83

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

04/18/23 84

Regular Languages

• A language is regular if there is

• a DFA such that

• All regular languages form a language family

LM MLL

04/18/23 85

Example• The language

• is regular:

*,: bawawaL

a

b

ba,

a

b

ba

0q 2q 3q

4q

04/18/23 86

Finite State Automata

• Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata.

04/18/23 87

More Formally

• You can specify an FSA by enumerating the following things.– The set of states: Q– A finite alphabet: Σ– A start state– A set of accept/final states– A transition function that maps QxΣ to Q

04/18/23 88

Dollars and Cents

04/18/23 89

Assignment 2 - Part 1

• A windows-based version of Python interpreter is available at the supplementary material section of the course website. Please download the interpreter and practice it. Use the help, tutorials and available documentation to investigate the possibility of using Arabic text. summarize your findings.

04/18/23 90

Assignment 2 - Part 2

• Practice search in Ms Word using regular expressions (Wildcards) for both Arabic and English. Submit at least 5 nontrivial examples.

04/18/23 91

Assignment 2 - Part 3

• You have been asked to participate in writing an exam about chapter 2 of the textbook. Write one question to check student understanding of chapter two material. Include the answer in your submission.

04/18/23 92

Thank you

الله ورحمة عليكم السالم

Recommended