Upload
lilian-george
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
CMSC 723 / LING 723: Computational Linguistics I
September 5, 2007: Dorr
Part I: MT (cont), MT Evaluation (J & M, 24)Part II: Reg. Expressions, FSAutomata (J&M 2)
Prof. Bonnie J. DorrCo-Instructor: Nitin Madnani
TA: Hamid Shahri
MT Challenges: Ambiguity
Syntactic AmbiguityI saw the man on the hill with the telescope
Lexical AmbiguityE: bookS: libro, reservar
Semantic Ambiguity– Homography:
ball(E) = pelota, baile(S)– Polysemy:
kill(E), matar, acabar (S)– Semantic granularity
esperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)
Language Typology: MT Divergences (Dorr, 1990)
Meaning of two translationally equivalent phrases is distributed differently in the two languages
Example:– English: [RUN INTO ROOM]– Spanish: [ENTER IN ROOM RUNNING]
Divergence E/E’ (Spanish) E/E’ (Arabic)
Categorial be jealous when he returns have jealousy [tener celos] upon his return [ رجوعه [عند
Conflational float come again go floating [ir flotando] return [عاد]
Structural enter the house seek enter in the house [entrar en la casa] search for [ عن [بحثHead Swap run in do something quickly
enter running [entrar corriendo] go-quickly in doing something [اسرع]Thematic I have a headache my-head hurts me [me duele la cabeza] —
Spanish/Arabic Divergences (Dorr, Habash, and Hwa, 2002)
[Arg1 [V]] [Arg1 [MotionV] Modifier(v)]“The boat floated’’ “The boat went floating’’
Approximate IL Approach
Tap into richness of TL resources
Use some, but not all, components of IL representation
Generate multiple sentences that are statistically pared down
Approximating IL: Handling Divergences
PrimitivesSemantic RelationsLexical Information
Generation Heavy Hybrid MT (GHMT): Nizar Habash 2003
Interlingual vs. Approximate IL
Interlingual MT:– primitives & relations– bi-directional lexicons– analysis: compose IL– generation: decompose IL
Approximate IL– hybrid symbolic/statistical design– overgeneration with statistical ranking– uses dependency rep input and structural expansion
for “deeper” overgeneration
Mapping from Input Dependency to English Dependency Tree
Knowledge Resources in English only: (LVD, Dorr, 2001;CATVAR, Habash & Dorr, 2003).
Goal
GIVEV
MARY KICKN JOHN
ThemeAgent
[CAUSE GO]
Goal
KICKV
MARY JOHN
Agent
[CAUSE GO]
Mary le dio patadas a John → Mary kicked John
Statistical Extraction
Mary kicked John . [-0.670270 ]
Mary gave a kick at John . [-2.175831]
Mary gave the kick at John . [-3.969686]
Mary gave an kick at John . [-4.489933]
Mary gave a kick by John . [-4.803054]
Mary gave a kick to John . [-5.045810]
Mary gave a kick into John . [-5.810673]
Mary gave a kick through John . [-5.836419]
Mary gave a foot wound by John . [-6.041891]
Mary gave John a foot wound . [-6.212851]
Benefits of Approximate IL Approach
Explaining behaviors that appear to be statistical in nature
“Re-sourceability”: Re-use of already existing components for MT from new languages.
Application to monolingual alternations
What Resources are Required?
Deep TL resourcesRequires SL parser and tralexTL resources are richer: LVD
representations, CatVar database
Constrained overgenerationhttp://clipdemos.umiacs.umd.edu/catvar/http://clipdemos.umiacs.umd.edu/englcslex/
Divergence Frequency (as measured by Habash and others, 2003)
32% of sentences in UN Spanish/English Corpus (5K) 35% of sentences in TREC El Norte Corpus (19K) Divergence Types
– Categorial (X tener hambre X have hunger) [98%]
– Conflational (X dar puñaladas a Z X stab Z) [83%]
– Structural (X entrar en Y X enter Y) [35%]
– Head Swapping (X cruzar Y nadando X swim across Y) [8%]
– Thematic (X gustar a Y Y like X) [6%]
Language Divergences Impact Bilingual Alignment for Statistical MT
Word-level alignments of bilingual texts are an integral part of Statistical MT models
Divergences present a great challenge to the alignment task
Common divergence types can be found in multiple language pairs, systematically identified, and resolved
Why is this a hard problem?
I run into the room
Yo entro en el cuarto corriendo
Divergences!English: [RUN INTO ROOM]Spanish: [ENTER IN ROOM RUNNING]
What can be done?
Divergence Detection:Increase the number of aligned
wordsDecrease multiple alignments
DUSTer Approach: Divergence Unraveling
I run into the roomE:
I move-in running the roomE:
Yo entro en el cuarto corriendoS:
Word-Level Alignment (1): Test Setup
run
John into
room
John
enter
room
running
Ex: John ran into the room → John entered the room running
Divergence Detection: Categorize English sentences into one of 5 divergence types
Divergence Correction: Apply appropriate
structural transformation [E → E]
Word-Level Alignment Results
Number of aligned words:– English-Spanish: aligned words increased from
82.8% to 86%– English-Arabic: aligned words increased from
61.5% to 88.1%Multiple Alignments:
– English-Spanish: number of links went from 1.35 to 1.16
– English-Arabic: number of links increased from 1.48 to 1.72
Divergence Unraveling Conclusions
Divergence handling shows promise for improvement of automatic alignment
Conservative lower bound on divergence frequency
Effective solution: syntactic transformation of English
Validity of solution shown through alignment experiments
How do we evaluate MT?
Human-based Metrics– Invariance: Semantic, Pragmatic, Lexical, Structural, Spatial– Fluency– Accuracy– Adequacy– Edit cost of post-editing– Informativeness: “Do you get it?”
Automatic Metrics:– Bleu– NIST– METEOR– Precision & Recall– TER, HTER– GTM
BiLingual Evaluation Understudy (BLEU —Papineni, 2001)
Automatic Technique, but ….Requires the pre-existence of Human (Reference)
TranslationsApproach:
– Produce corpus of high-quality human translations– Judge “closeness” numerically (word-error rate)– Compare n-gram matches between candidate translation
and 1 or more reference translations
http://www.research.ibm.com/people/k/kishore/RC22176.pdf
Bleu Comparison
Chinese-English Translation Example:
Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
How Do We Compute Bleu Scores?
Key Idea: A reference word should be considered exhausted after a matching candidate word is identified.
• For each word compute: (1) candidate word count(2) maximum ref count
• Add counts for each candidate word using the lower of the two numbers .
• Divide by number of candidate words..
Modified Unigram Precision: Candidate #1
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1)
What’s the answer??????
17/18
Modified Unigram Precision: Candidate #2
It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0)
What’s the answer??????
8/14
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
Modified Bigram Precision: Candidate #1It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1)
What’s the answer??????
10/17
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
Modified Bigram Precision: Candidate #2
Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0)
What’s the answer??????
1/13
Catching Cheaters
Reference 1: The cat is on the mat
Reference 2: There is a cat on the mat
the(2) the the the(0) the(0) the(0) the(0)
What’s the unigram answer?
2/7
What’s the bigram answer?
0/7
Regular Expressions and Finite State Automata
REs: Language for specifying text stringsSearch for document containing a string
– Searching for “woodchuck”
• Finite-state automata (FSA)(singular: automaton)
• How much wood would a woodchuck chuck if a woodchuck would chuck wood?
– Searching for “woodchucks” with an optional final “s”
Regular Expressions
Basic regular expression patternsPerl-based syntax (slightly different from
other notations for regular expressions)Disjunctions /[wW]oodchuck/
Regular Expressions
Optional characters ? ,* and +– ? (0 or 1)
• /colou?r/ color or colour
– * (0 or more)• /oo*h!/ oh! or Ooh! or Ooooh!
*+
Stephen Cole Kleene
– + (1 or more)
• /o+h!/ oh! or Ooh! or Ooooh!
Wild cards .- /beg.n/ begin or began or begun
Regular Expressions
Anchors ^ and $– /^[A-Z]/ “Ramallah, Palestine”
– /^[^A-Z]/ “¿verdad?” “really?”
– /\.$/ “It is over.”
– /.$/ ?
Boundaries \b and \B– /\bon\b/ “on my way” “Monday”
– /\Bon\b/ “automaton”
Disjunction |– /yours|mine/ “it is either yours or mine”
Disjunction, Grouping, Precedence
Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ *)*/
Precedence– Parenthesis ()– Counters * + ? {}– Sequences and anchors the ^my end$– Disjunction |
REs are greedy!
Writing correct expressions
Exercise: Write a Perl regular expression to match the English article “the”:
/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example
Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Substitutions and Memory
Substitutionss/colour/color/s/colour/color/g
s/([0-9]+)/<\1>/
/the (.*)er they were, the \1er they will be/
/the (.*)er they (.*), the \1er they \2/
Substitute as many times as possible!
Case insensitive matching
s/colour/color/i
Memory (\1, \2, etc. refer back to matches)
Eliza [Weizenbaum, 1966]
User: Men are all alike
ELIZA: IN WHAT WAY
User: They’re always bugging us about something or other
ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?
User: Well, my boyfriend made me come here
ELIZA: YOUR BOYFRIEND MADE YOU COME HERE
User: He says I’m depressed much of the time
ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Step 1: replace first person references with second person referencess/\bI(’m| am)\b /YOU ARE/g
s/\bmy\b /YOUR/g
S/\bmine\b /YOURS/g
Step 2: use additional regular expressions to generate replies
Step 3: use scores to rank possible transformations
Finite-state Automata (Machines)
/baa+!/
q0 q1 q2 q3 q4
b a a !
a
state transitionfinalstate
baa! baaa! baaaa! baaaaa! ...
Finite-state Automata
Q: a finite set of N states – Q = {q0, q1, q2, q3, q4}
: a finite input alphabet of symbols = {a, b, !}
q0: the start stateF: the set of final states
– F = {q4}(q,i): transition function
– Given state q and input symbol i, return new state q' (q3,!) q4
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end
Languages and Automata
Can use FSA as a generator as well as a recognizer
Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. – L(M) = {baa!, baaa!, baaaa!, …}
Regular languages vs. non-regular languages
Using NFSAs to accept strings
Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.
Look-ahead: look ahead in inputParallelism: look at alternatives in parallel