42
Machine Translation: Challenges and Approaches Based on a talk by Nizar Habash http://www.nizarhabash.co m/ Associate Research Scientist Center for Computational Learning Systems Natural Language Processing

Machine Translation: Challenges and Approaches

  • Upload
    zamora

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Natural Language Processing. Machine Translation: Challenges and Approaches. Based on a talk by Nizar Habash http://www.nizarhabash.com/ Associate Research Scientist Center for Computational Learning Systems Columbia University, New York. Google offers translations between many languages: - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Translation: Challenges and Approaches

Machine Translation:Challenges and Approaches

Based on a talk by Nizar Habashhttp://www.nizarhabash.com/

Associate Research Scientist

Center for Computational Learning Systems

Columbia University, New York

Natural Language Processing

Page 2: Machine Translation: Challenges and Approaches

• Google offers translations between many languages:

Afrikaans, Albanian, Arabic, Belarusian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh, Yiddish

Page 3: Machine Translation: Challenges and Approaches

WorldMapperThe world as you’ve never seen it before

List of languages with maps showing world-wide distribution:

http://www.worldmapper.org/extraindex/text_language.html

Eg Arabic:

Page 4: Machine Translation: Challenges and Approaches

“BBC found similar support”!!!

Page 5: Machine Translation: Challenges and Approaches

Road Map

• Multilingual Challenges for MT

• MT Approaches

• MT Evaluation

Page 6: Machine Translation: Challenges and Approaches

Multilingual Challenges

• Complex Orthography (writing system)– Ambiguous spelling, eg Arabic vowels omitted

• اشعارا االوالد األو الد� كتب ك�ت�ب�اشع�ارا�

– Ambiguous word boundaries, eg Chinese•

• Lexical Ambiguity– Bank بنك (financial) vs. ضفة(river)– Eat essen (human) vs. fressen (animal)

Page 7: Machine Translation: Challenges and Approaches

Multilingual Challenges Morphological complexity and variation

• Affixation vs. Root+Pattern

write written كتب

بوكتم

kill killed قتل لوقتم

do done فعل لوفعم

conj

noun

pluralarticle

• Tokenization: a “word” can be a whole phrase

And the cars and the cars

اتسيارالو w Al SyArAt

Et les voitures et les voitures

Page 8: Machine Translation: Challenges and Approaches

هنا لستI-am-not here

am

I here

I am not here

not

تلس

هنا

Translation Divergencesconflation

Je ne suis pas iciI not am not here

suis

Je icine pas

Page 9: Machine Translation: Challenges and Approaches

Translation Divergences head swap and categorial

English John swam across the river quickly

Spanish Juan cruzó rapidamente el río nadandoGloss: John crossed fast the river swimming

Arabic سباحة النهر عبور جون اسرعGloss: sped john crossing the-river swimming

Chinese 约翰 快速 地 游 过 这 条 河Gloss: John quickly  (DE) swam  cross  the (Quantifier)    river

Russian Джон быстро переплыл реку Gloss: John quickly cross-swam river

Page 10: Machine Translation: Challenges and Approaches

Corpus resources for training

Need a Corpus (eg web-as-corpus: BootCat)and DICTIONARY, other NLP resources.

BUT: really need PARALLEL corpus, with

SOURCE and TARGET setentence ALIGNED.

Some languages have few resources, esp non-European languages: Bengali, Amharic, …

WorldMapper : many International languages

Page 11: Machine Translation: Challenges and Approaches

Road Map

• Multilingual Challenges for MT

• MT Approaches

• MT Evaluation

Page 12: Machine Translation: Challenges and Approaches

MT ApproachesMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Gisting

Page 13: Machine Translation: Challenges and Approaches

MT ApproachesGisting Example

Sobre la base de dichas experiencias se estableció en 1988 una metodología.

Envelope her basis out speak experiences them settle at 1988 one methodology.

On the basis of these experiences, a methodology was arrived at in 1988.

Page 14: Machine Translation: Challenges and Approaches

MT ApproachesMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Gisting

Transfer

Page 15: Machine Translation: Challenges and Approaches

MT ApproachesTransfer Example

• Transfer Lexicon – Map SL structure to TL structure

poner

X mantequilla en

Y

:obj:mod:subj

:obj

butter

X Y

:subj :obj

X puso mantequilla en Y X buttered Y

Page 16: Machine Translation: Challenges and Approaches

MT ApproachesMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Gisting

Transfer

Interlingua

Page 17: Machine Translation: Challenges and Approaches

MT ApproachesInterlingua Example: Lexical Conceptual Structure

(Dorr, 1993)

Page 18: Machine Translation: Challenges and Approaches

MT ApproachesMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Interlingua

Gisting

Transfer

Page 19: Machine Translation: Challenges and Approaches

MT ApproachesMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Interlingual Lexicons

Dictionaries/Parallel Corpora

Transfer Lexicons

Page 20: Machine Translation: Challenges and Approaches

MT ApproachesStatistical vs. Rule-based

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Page 21: Machine Translation: Challenges and Approaches

Statistical MT Noisy Channel Model

Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

Page 22: Machine Translation: Challenges and Approaches

Statistical MT Automatic Word Alignment

• GIZA++– A statistical machine translation toolkit used to train word alignments.– Uses Expectation-Maximization with various constraints to bootstrap

alignments

Slide based on Kevin Knight’s http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Mary

did

not

slap

the

green

witch

Maria no dio una bofetada a la bruja verde

Page 23: Machine Translation: Challenges and Approaches

Statistical MT IBM Model (Word-based Model)

http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

Page 24: Machine Translation: Challenges and Approaches

Phrase-Based Statistical MT

• Foreign input segmented in to phrases– “phrase” is any sequence of words

• Each phrase is probabilistically translated into English– P(to the conference | zur Konferenz)– P(into the meeting | zur Konferenz)

• Phrases are probabilistically re-ordered

See [Koehn et al, 2003] for an intro.

This is state-of-the-art!

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 25: Machine Translation: Challenges and Approaches

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 26: Machine Translation: Challenges and Approaches

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap the)

Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 27: Machine Translation: Challenges and Approaches

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap the)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch)

Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 28: Machine Translation: Challenges and Approaches

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap the)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)

(a la bruja verde, the green witch) …

Word Alignment Induced PhrasesSlide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 29: Machine Translation: Challenges and Approaches

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap the)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)

(a la bruja verde, the green witch) …

(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

Word Alignment Induced PhrasesSlide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 30: Machine Translation: Challenges and Approaches

Advantages of Phrase-Based SMT

• Many-to-many mappings can handle non-compositional phrases

• Local context is very useful for disambiguating– “Interest rate” …– “Interest in” …

• The more data, the longer the learned phrases– Sometimes whole sentences

Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

Page 31: Machine Translation: Challenges and Approaches

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

MT ApproachesStatistical vs. Rule-based vs. Hybrid

Page 32: Machine Translation: Challenges and Approaches

MT Approaches Practical Considerations

• Resources Availability– Parsers and Generators

• Input/Output compatability

– Translation Lexicons• Word-based vs. Transfer/Interlingua

– Parallel Corpora• Domain of interest

• Bigger is better

• Time Availability– Statistical training, resource building

Page 33: Machine Translation: Challenges and Approaches

Road Map

• Multilingual Challenges for MT

• MT Approaches

• MT Evaluation

Page 34: Machine Translation: Challenges and Approaches

MT Evaluation

• More art than science

• Wide range of Metrics/Techniques– interface, …, scalability, …, faithfulness, ...

space/time complexity, … etc.

• Automatic vs. Human-based– Dumb Machines vs. Slow Humans

Page 35: Machine Translation: Challenges and Approaches

5 contents of original sentence conveyed (might need minor corrections)

4 contents of original sentence conveyed BUT errors in word order

3 contents of original sentence generally conveyed BUT errors in relationship between phrases, tense, singular/plural, etc.

2 contents of original sentence not adequately conveyed, portions of original sentence incorrectly translated, missing modifiers

1 contents of original sentence not conveyed, missing verbs, subjects, objects, phrases or clauses

Human-based Evaluation ExampleAccuracy Criteria

Page 36: Machine Translation: Challenges and Approaches

5 clear meaning, good grammar, terminology and sentence structure

4 clear meaning BUT bad grammar, bad terminology or bad sentence structure

3 meaning graspable BUT ambiguities due to bad grammar, bad terminology or bad sentence structure

2 meaning unclear BUT inferable

1 meaning absolutely unclear

Human-based Evaluation ExampleFluency Criteria

Page 37: Machine Translation: Challenges and Approaches

Automatic Evaluation ExampleBleu Metric

(Papineni et al 2001)

• Bleu – BiLingual Evaluation Understudy

– Modified n-gram precision with length penalty

– Quick, inexpensive and language independent

– Correlates highly with human evaluation

– Bias against synonyms and inflectional variations

Page 38: Machine Translation: Challenges and Approaches

Test Sentence

colorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep iratelydrab emerald concepts sleep furiously

colorless immature thoughts nap angrily

Automatic Evaluation ExampleBleu Metric

Page 39: Machine Translation: Challenges and Approaches

Test Sentence

colorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep iratelydrab emerald concepts sleep furiously

colorless immature thoughts nap angrily

Unigram precision = 4/5

Automatic Evaluation ExampleBleu Metric

Page 40: Machine Translation: Challenges and Approaches

Test Sentence

colorless green ideas sleep furiouslycolorless green ideas sleep furiouslycolorless green ideas sleep furiouslycolorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep iratelydrab emerald concepts sleep furiously

colorless immature thoughts nap angrily

Unigram precision = 4 / 5 = 0.8Bigram precision = 2 / 4 = 0.5

Bleu Score = (a1 a2 …an)1/n

= (0.8 ╳ 0.5)½ = 0.6325 63.25

Automatic Evaluation ExampleBleu Metric

Page 41: Machine Translation: Challenges and Approaches

Summary

• Multilingual Challenges for MT:– Different writing systems, morphology, segmentation,

word order; – Need parallel corpus, dictionaries, NLP tools

• MT Approaches: statistical v rule-based– Gisting: word-for-word; – Transfer: source-to-target phrase mapping; – Interlingua: map to/from “semantic representation”

• MT Evaluation: Accuracy, Fluency– IBM BLEU method:

count overlaps with Reference (human) translations

Page 42: Machine Translation: Challenges and Approaches

Interested in MT??

• Contact [email protected]• Research courses, projects• Languages of interest:

– English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, ….

• Topics– Statistical, Hybrid MT

• Phrase-based MT with linguistic extensions• Component improvements or full-system improvements

– MT Evaluation– Multilingual computing