34
600.465 - Intro to NLP - J. Eisner 1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

Embed Size (px)

Citation preview

Page 1: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 1

Finite-State Programming

Some Examples

Page 2: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 2

Finite-state “programming”

Function Function on (set of) strings Source code Regular expression Object code Finite state machine Compiler Regexp compiler Optimization of object code

Determinization, minimization, pruning

Page 3: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 3

Finite-state “programming”

Function composition (Weighted) composition

Higher-order function Operator

Function inversion(available in Prolog)

Function inversion

Structuredprogramming

Ops + small regexps

Page 4: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 4

Finite-state “programming”

Parallelism Apply to set of strings

Nondeterminism Nondeterminism

Stochasticity Prob.-weighted arcs

Page 5: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 5

Some Xerox Extensions

$ containment=> restriction-> @-> replacement

Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

slide courtesy of L. Karttunen (modified)

Page 6: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 6

Containment

$[ab*c]$[ab*c]

““Must contain a substringMust contain a substringthat matches that matches ab*cab*c.”.”

Accepts Accepts xxxacyyxxxacyyRejects Rejects bcbabcba

?* [ab*c] ?* ?* [ab*c] ?*

Equivalent expression Equivalent expression

bb

cc

aa

a,b,c,?a,b,c,?

a,b,c,?a,b,c,?

Warning: ?? in regexps means “any character at all.”

But ?? in machines means “any character not explicitly

mentioned anywhere in the machine.”

Page 7: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 7

Restriction

??cc

bb

bb

cc?? aa

cc

a => b _ ca => b _ c

““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”

Accepts Accepts bacbbacdebacbbacdeRejects Rejects bacbacaa

~[~[~[?* b] a ?*~[?* b] a ?*]] & & ~[~[?* a ~[c ?*]?* a ~[c ?*]]] Equivalent expression Equivalent expression

slide courtesy of L. Karttunen (modified)

contains aa not preceded by bb

contains aa not followed by cc

Page 8: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 8

Replacement

a:ba:b

bb

aa

??

??

b:ab:a

aa

a:ba:b

a b -> b a

““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”

Transduces Transduces abcdbaba to to bacdbbaa

[[~$[a b] ~$[a b] [[a b] .x. [b a]][[a b] .x. [b a]]]*]* ~$[a b]~$[a b] Equivalent expression Equivalent expression

slide courtesy of L. Karttunen (modified)

Page 9: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 9

Replacement is Nondeterministic

a b -> b a | x

““Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”

Transduces Transduces abcdbaba to { to {bacdbbaa,, bacdbxa,, xcdbbaa,, xcdbxa}}

Page 10: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 10

Replacement is Nondeterministic

[ a b -> b a | x ] .o. [ x => _ c ]

““Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”

Transduces Transduces abcdbaba to { to {bacdbbaa,, bacdbxa,, xcdbbaa,, xcdbxa}}

Page 11: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 11

Replacement is Nondeterministic

a b | b | b a | a b a -> x

applied to “aba”

Four overlapping substrings match; we haven’t told it which one to replace so it chooses nondeterministically

a b a a b a a b a a b a

a x a a x x a x

slide courtesy of L. Karttunen (modified)

Page 12: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 12

More Replace Operators

Optional replacement: a b (->) b a

Directed replacement guarantees a unique result by

constraining the factorization of the input string by Direction of the match (rightward or

leftward) Length (longest or shortest)

slide courtesy of L. Karttunen

Page 13: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 13

@-> Left-to-right, Longest-match Replacement

a b | b | b a | a b a @-> x

applied to “aba”

a b a a b a a b a a b a

a x a a x x a x

slide courtesy of L. Karttunen

@-> left-to-right, longest match @> left-to-right, shortest match ->@ right-to-left, longest match >@ right-to-left, shortest match

Page 14: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 14

Using “…” for marking

0:[0:[

[[

0:]0:]

??

aa

ee

iioo

uu]]

a|e|i|o|u -> [ ... ]

p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]

Note: actually have to write as Note: actually have to write as -> %[ ... %] or or -> “[” ... “]”

since [] are parens in the regexp languagesince [] are parens in the regexp language

slide courtesy of L. Karttunen (modified)

Page 15: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 15

Using “…” for marking

0:[0:[

[[

0:]0:]

??

aa

ee

iioo

uu]]

a|e|i|o|u -> [ ... ]

p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]

Which way does the FST transduce potatoe?

slide courtesy of L. Karttunen (modified)

How would you change it to get the other answer?

p o t a t o ep o t a t o ep[o]t[a]t[o][e]p[o]t[a]t[o][e]

p o t a t o ep o t a t o ep[o]t[a]t[o e]p[o]t[a]t[o e]

vs.

Page 16: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 16

Example: Finnish Syllabification

define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];

s t r u k t u r a l i s m s t r u k t u r a l i s m iis t r u k - t u - r a - l i s - m s t r u k - t u - r a - l i s - m ii

[C* V+ C*] @-> ... "-" || _ [C V][C* V+ C*] @-> ... "-" || _ [C V]

““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the

C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”

slide courtesy of L. Karttunen

why?why?

Page 17: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 17

Conditional Replacement

The relation that replaces A by B between L and R leaving everything else unchanged.

A -> BA -> B

Replacement

L _ RL _ R

Context

Sources of complexity:

Replacements and contexts may overlap

Alternative ways of interpreting “between left and right.”

slide courtesy of L. Karttunen

Page 18: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 18

Hand-Coded Example: Parsing Dates

Today is [Tuesday, July 25, 2000].

Today is Tuesday, [July 25, 2000].

Today is [Tuesday, July 25], 2000.

Today is Tuesday, [July 25], 2000.

Today is [Tuesday], July 25, 2000.

Best result

Bad results

Need left-to-right, longest-match constraints.

slide courtesy of L. Karttunen

Page 19: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 19

Source code: Language of Dates

Day = Monday | Tuesday | ... | Sunday

Month = January | February | ... | December

Date = 1 | 2 | 3 | ... | 3 1

Year = %0To9 (%0To9 (%0To9 (%0To9))) - %0?* from 1 to 9999

AllDates = Day | (Day “, ”) Month “ ” Date (“, ” Year))

slide courtesy of L. Karttunen

Page 20: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 20

actually represents 7 arcs, each labeled by a string

Object code: All Dates from 1/1/1 to 12/31/9999

, ,

FebJan

Mar

MayJunJul

Apr

Aug

OctNovDec

Sep

3

,

,

123456789

0123456789

0

123456789

0123456789

123456789

0

10

21

TueMon

Wed

FriSatSun

Thu 456789

MayJan Feb Mar Apr Jun

Jul Aug Oct Nov DecSep

13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions

slide courtesy of L. Karttunen

Page 21: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 21

Parser for Dates

AllDates @-> “[DT ” ... “]”Compiles into an

unambiguous transducer (23 states,

332 arcs).

Today is Today is [DT Tuesday, July 25, 2000][DT Tuesday, July 25, 2000] because yesterday because yesterday

was was [DT Monday][DT Monday] and it was and it was [DT July 24][DT July 24] so tomorrow must so tomorrow must

be be [DT Wednesday, July 26][DT Wednesday, July 26] and not and not [DT July 27][DT July 27] as it says as it says

on the program.on the program.

Xerox left-to-right replacement operator

slide courtesy of L. Karttunen (modified)

Page 22: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 22

Problem of Reference

Valid dates

Tuesday, July 25, 2000

Tuesday, February 29, 2000

Monday, September 16, 1996Invalid dates

Wednesday, April 31, 1996

Thursday, February 29, 1900

Tuesday, July 26, 2000

slide courtesy of L. Karttunen

Page 23: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 23

Refinement by IntersectionAllDatesAllDates

ValidValidDatesDates

WeekdayDateWeekdayDate

MaxDaysMaxDaysIn MonthIn Month

“ 31” => Jan|Mar|May|… _“ 30” => Jan|Mar|Apr|… _

LeapYearsLeapYears

Feb 29, => _ …

slide courtesy of L. Karttunen (modified)

Xerox contextual restriction operator

Q: Why do these rulesstart with spaces?(And is it enough?)

Q: Why does this ruleend with a comma?Q: Can we write the whole rule?

Page 24: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 24

Defining Valid Dates

AllDates

&

MaxDaysInMonth

&

LeapYears

&

WeekdayDates

= ValidDates

AllDates: 13 states, 96 arcs29 760 007 date expressions

ValidDates: 805 states, 6472 arcs7 307 053 date expressions

slide courtesy of L. Karttunen

Page 25: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 25

Parser for Valid and Invalid Dates

[AllDates - ValidDates] @-> “[ID ” ... “]”,

ValidDates @-> “[VD ” ... “]”

Today is [VD Tuesday, July 25, 2000],

not [ID Tuesday, July 26, 2000].

valid date

invalid date

2688 states,20439 arcs

slide courtesy of L. Karttunen

Comma creates a single FST that does left-to-right

longest match against either pattern

Page 26: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 26

More Engineering Applications

Markup Dates, names, places, noun phrases; spelling/grammar

errors? Hyphenation Informative templates for information extraction (FASTUS) Word segmentation (use probabilities!) Part-of-speech tagging (use probabilities – maybe!)

Translation Spelling correction / edit distance Phonology, morphology: series of little fixups? constraints? Speech Transliteration / back-transliteration Machine translation?

Learning …

Page 27: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 27

FASTUS – Information Extraction Appelt et al, 1992-?

Input: Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with …

Output:Relationship: TIE-UPEntities: “Bridgestone Sports Co.”

“A local concern”“A Japanese trading house”

Joint Venture Company: “Bridgestone Sports Taiwan Co.”Amount: NT$20000000

Page 28: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 28

FASTUS: Successive Markups(details on subsequent slides)

Tokenization.o.

Multiwords.o.

Basic phrases (noun groups, verb groups …).o.

Complex phrases.o.

Semantic Patterns.o.

Merging different references

Page 29: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 29

FASTUS: Tokenization

Spaces, hyphens, etc. wouldn’t would not their them ’s company. company .

butCo. Co.

Page 30: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 30

FASTUS: Multiwords

“set up” “joint venture” “San Francisco Symphony Orchestra,”

“Canadian Opera Company”

… use a specialized regexp to match musical groups.

... what kind of regexp would match company names?

Page 31: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 31

FASTUS : Basic phrases

Output looks like this (no nested brackets!):… [NG it] [VG had set_up] [NG a joint_venture] [Prep in]

Company Name: Bridgestone Sports Co.Verb Group: saidNoun Group: FridayNoun Group: itVerb Group: had set upNoun Group: a joint venturePreposition: inLocation: TaiwanPreposition: withNoun Group: a local concern

Page 32: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 32

FASTUS: Noun Groups

Build FSA to recognize phrases likeapproximately 5 kgmore than 30 peoplethe newly elected presidentthe largest leftist political forcea government and commercial project

Use the FSA for left-to-right longest-match markup

What does FSA look like? See next slide …

Page 33: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 33

FASTUS: Noun Groups

Described with a kind of non-recursive CFG …(a regexp can include names that stand for other regexps)

NG Pronoun | Time-NP | Date-NPNG (Det) (Adjs) HeadNouns…Adjs sequence of adjectives maybe with commas,

conjunctions, adverbs…Det DetNP | DetNonNPDetNP detailed expression to match “the only five,

another three, this, many, hers, all, the most …”…

Page 34: 600.465 - Intro to NLP - J. Eisner1 Finite-State Programming Some Examples

600.465 - Intro to NLP - J. Eisner 34

FASTUS: Semantic patterns

BusinessRelationship =NounGroup(Company/ies) VerbGroup(Set-up) NounGroup(JointVenture) with NounGroup(Company/ies) | …

ProductionActivity = VerbGroup(Produce) NounGroup(Product)

NounGroup(Company/ies) NounGroup & … is made easy by the processing done at a previous level

Use this for spotting references to put in the database.