51
Computational Language Finite State Machines and Regular Expressions

Computational Language Finite State Machines and Regular Expressions

  • View
    223

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Computational Language Finite State Machines and Regular Expressions

Computational Language

Finite State Machines and Regular Expressions

Page 2: Computational Language Finite State Machines and Regular Expressions

Plan Regular expressions

Introduction Operators Disjunction, precedence, substitution

Finite State Machines Link with regular expressions Determinisitic FSA Non-deterministic FSA

Lab session reg ex. implementation in UNIX (egrep)

Page 3: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Basis of all web-based and word-

processor-based searches Definition 1. An algebraic notation

for describing a string Definition 2. A set of rules that you

can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

Page 4: Computational Language Finite State Machines and Regular Expressions

Regular Expressions regular expression, text corpus regular expression algebra has

variants: Perl, Unix tools Unix tools: egrep, sed, awk

Page 5: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Find occurrences of /Nokia/ in the

text egrep -n ‘Nokia’ nokia_corpus.txt

Page 6: Computational Language Finite State Machines and Regular Expressions

Regular Expressionsegrep -n ‘Nokia’ nokia_corpus.txt

1:.Nokia shares slide after warning 4:HELSINKI (Reuters) - Nokia has cut its sales growth forecast for 7:markets sharply down.Nokia warned group sales would grow only 13:better than expected first-quarter profits from Nokia, 15:Finland's Nokia and rivals have been hit by debt-laden telecoms 19:Nokia said in a statement. "The speed of this transition has been 20:slower than was anticipated earlier this year." Nokia saw its market 26:"The problem with Nokia is that it looks like its going ex-growth," 29:with a raft of new functions, was hurting. "Nokia had been perceived 36:Nokia cast another shadow over the sector by slashing its forecast for 41:be sold this year. "Nokia now believes that general weakness in all key 43:Nokia said. The market was caught by surprise, especially as Nokia had 46:said Nokia had been "a bit optimistic overall" in its forecasts. "We 49:adjust to weaker demand, Nokia followed the path of rivals in announcing 51:thousands of jobs in the group last year. Despite the bleak outlook, Nokia 57:Nokia also warned second quarter sales would grow only between two and 61:operating efficiencies, strong brand and leading product portfolio," Nokia 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 67:protecting the margins -- but Nokia has to be a top-line growth story as well, 69:analyst Susan Anthony.But Nokia, known for its strength in forecasting the 79:Nokia's own forecast. Nokia's January-March net sales came in worse than the

Page 7: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Suppress case distinctions

Nokia or nokia

Page 8: Computational Language Finite State Machines and Regular Expressions

Regular Expressions set operatoregrep -n ‘[Nn]okia’

nokia_corpus.txt

Page 9: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Suppress other features, for

example singular share or plural shares

Page 10: Computational Language Finite State Machines and Regular Expressions

Regular Expressions optional operatoregrep -n ‘shares?’

nokia_corpus.txt

Page 11: Computational Language Finite State Machines and Regular Expressions

Regular Expressions

egrep -n ‘shares?’ nokia_corpus.txt

1:.Nokia shares slide after warning 6:weak demand, sending its shares 12 percent lower and European 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 85:lion share of the company's sales and earnings, saw sales fall seven percent

Page 12: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Kleene operators:

/string*/ “zero or more occurrences of previous character”

/string+/ “1 or more occurrences of previous character”

Page 13: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Wildcard operator:

/string./ “any character after the previous character”

Page 14: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Wildcard operator:

/string./ “any character after the previous character”

Combine wildcard and kleene: /string.*/ “zero or more instances of any

character after the previous character” /string.+/ “one or more instances of any

character after the previous character”

Page 15: Computational Language Finite State Machines and Regular Expressions

Regular Expressions

egrep –n ‘profit.*’ nokia_corpus.txt

13:better than expected first-quarter profits from Nokia, 52:remains the only profitable handset maker among the "big three" suppliers 60:company's profitability outlook remains strong, driven by increasing 81:Pre-tax profit was 1.31 billion euros.The company's struggling networks unit

Page 16: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Anchors

Beginning of line operator: ^egrep ‘^said’ nokia_corpus.txt End of line operator: $egrep ‘$said’ nokia_corpus.txt

Page 17: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Disjunction:

set operator/[Ss]tring/ “a string which begins with either S

or s” Range/[A-Z]tring/ “a string beginning with a capital

letter” pipe |/string1|string2/ “either string 1 or string 2”

Page 18: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Disjunction

egrep –n ‘weak|warning|drop’ nokia_corpus.txt

egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

Page 19: Computational Language Finite State Machines and Regular Expressions

Regular Expressions

Negation: /[^a-z]tring“ any strings that does not begin

with a small letter”

Page 20: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Precedence

1. Parantheses2. Kleene and optional operators * . ?3. Anchors and sequences4. Disjunction operator |

(a) /supply | iers/ /supply/ /iers/(b) /suppl(y|iers)/ /supply/ suppliers/

Page 21: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Substitution

sed ‘s/word1/word2/ corpus.txt

Me: I am feeling a bit depressed todaysed ‘s/I am/sorry to hear that you are/’

corpus.txt

Page 22: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Substitution

sed ‘s/word1/word2/ corpus.txt

Me: I am feeling a bit depressed todaysed ‘s/I am/sorry to hear that you are/’

corpus.txt

Eliza: sorry to hear that you are feeling a bit depressed today

Page 23: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Substitution

sed ‘s/word1/word2/ corpus.txt

Me: I wish I could shake this depressionsed

Eliza: I am sure you could shake this depression

Page 24: Computational Language Finite State Machines and Regular Expressions

Regular Expressions Substitution

sed ‘s/word1/word2/’ corpus.txt

Me: I wish I could shake this depressionsed ‘s/wish I/am sure you/’ corpus.txt

Eliza: I am sure you could shake this depression

Page 25: Computational Language Finite State Machines and Regular Expressions

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of stringse.g. egrep –n ‘baa+!’ corpus.txt

Page 26: Computational Language Finite State Machines and Regular Expressions

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph

Page 27: Computational Language Finite State Machines and Regular Expressions

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph Set of nodes representing states

Page 28: Computational Language Finite State Machines and Regular Expressions

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph Set of nodes representing states Set of arcs, links between nodes,

representing transitions between states

Page 29: Computational Language Finite State Machines and Regular Expressions

Finite State Transition Networks

Finite State Automata (FSA) Just as a regular expression, used to

recognise a set of strings Represented as a directed graph Set of nodes representing states Set of arcs, links between nodes,

representing transitions between states Arcs are labelled

Page 30: Computational Language Finite State Machines and Regular Expressions

Finite State Automata How does it work?

used to recognise a set of strings

Page 31: Computational Language Finite State Machines and Regular Expressions

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as

a segmented tape with a symbol for each cell

Page 32: Computational Language Finite State Machines and Regular Expressions

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as

a segmented tape with a symbol for each cell

String slowly fed into machine

Page 33: Computational Language Finite State Machines and Regular Expressions

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as a

segmented tape with a symbol for each cell String slowly fed into machine If symbol on input matches symbol on arc,

then A) move to next state B) advance one symbol on input string C) keep going till final state or input ends

Page 34: Computational Language Finite State Machines and Regular Expressions

Finite State Automata How does it work?

used to recognise a set of strings Candidate input string represented as a

segmented tape with a symbol for each cell String slowly fed into machine If symbol on input matches symbol on arc,

then A) move to next state B) advance one symbol on input string C) keep going till final state or input ends

Otherwise: stop and reject string

Page 35: Computational Language Finite State Machines and Regular Expressions

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 2 3 4:

Page 36: Computational Language Finite State Machines and Regular Expressions

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 3 4:

Page 37: Computational Language Finite State Machines and Regular Expressions

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 4:

Page 38: Computational Language Finite State Machines and Regular Expressions

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4:

Page 39: Computational Language Finite State Machines and Regular Expressions

Finite State Automata State Transition Table

State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø

Page 40: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Algorithm for FSA (Jurafsky and Martin, p. 37)

function D-RECOGNIZE(tape, machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table [current-state, tape [index]] is empty then return reject else Current-state <- transition-table [current-state, tape [index]] Index <- index + 1 end

Page 41: Computational Language Finite State Machines and Regular Expressions

Finite State Automata FSAs and recognition

Page 42: Computational Language Finite State Machines and Regular Expressions

Finite State Automata FSAs and recognition FSAs and generation

At each transition print out label of arc At final state stop printing

Page 43: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Deterministic FSAs

An FSA whose recognition behaviour is fully determined by the state it is in and the input symbol it is looking at

Page 44: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Deterministic FSAs

An FSA whose recognition behaviour is fully determined by the state it is in and the input symbol it is looking at

Non-deterministic FSAs An FSA with decision points

Page 45: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Deterministic FSAs Non-deterministic FSAs

An FSA with decision points Self-loop may be in a particular state Arcs may have ε transitions

Page 46: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Deterministic FSAs Non-deterministic FSA

Backup: set a marker that can be returned to

Look-ahead: look ahead at input Parallelism: look at alternative paths in

parallel

Page 47: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Non-deterministic FSA: state transition

table State Input b a ! ε 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2, 3 Ø Ø 3 Ø Ø 4 Ø 4: Ø Ø Ø Ø

Page 48: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Formal language Set of strings Finite symbol set, alphabet

Page 49: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Formal language Set of strings Finite symbol set, alphabet

Σ = {a, b, !}

Page 50: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Formal language Set of strings Finite symbol set, alphabet L(m) = {baa!, ba!, baaa!,…}“formal language characterised by m”

m = model L = formal language

Page 51: Computational Language Finite State Machines and Regular Expressions

Finite State Automata Formal language Set of strings Finite symbol set, alphabet L(m) = {baa!, ba!, baaa!,…} A formal language models a

fragment of a natural language