57
Machine Translation: Learning without Word Alignments Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst David Smith With slides from Charles Schafer 1

Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Machine Translation:Learning without Word

Alignments

Introduction to Natural Language ProcessingComputer Science 585—Fall 2009

University of Massachusetts Amherst

David SmithWith slides from Charles Schafer

1

Page 2: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Specialized Translation Models:Named Entities

2

Page 3: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Translating Words in a Sentence

- Models will automatically learn entries in probabilistic translation dictionaries, for instance p(elle|she), from co-occurrences in aligned sentences of a parallel text.

- For some kinds of words/phrases, thisis less effective. For example: numbers dates named entities (NE)The reason: these constitute a large open class of words that will not all occur even in the largest bitext. Plus, there are regularities in translation of numbers/dates/NE.

3

Page 4: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Handling Named Entities

- For many language pairs, and particularlythose which do not share an alphabet, transliteration of person and place names is the desired method of translation.

- General Method: 1. Identify NE’s via classifier 2. Transliterate name 3. Translate/reorder honorifics

- Also useful for alignment. Consider thecase of Inuktitut-English alignment, whereInuktitut renderings of European names are highly nondeterministic.

4

Page 5: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Transliteration

Inuktitut rendering of English names changes the string significantly but not deterministically

5

Page 6: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Transliteration

Inuktitut rendering of English names changes the string significantly but not deterministically

Train a probabilistic finite-state transducer to model this ambiguous transformation

6

Page 7: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Transliteration

Inuktitut rendering of English names changes the string significantly but not deterministically

… Mr. Williams … … mista uialims …

7

Page 8: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Useful Types of Word Analysis

- Number/Date Handling

- Named Entity Tagging/Transliteration

- Morphological Analysis - Analyze a word to its root form (at least for word alignment) was -> is believing -> believe ruminerai -> ruminer ruminiez -> ruminer - As a dimensionality reduction technique - To allow lookup in existing dictionary

8

Page 9: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Learning Word Translation DictionariesUsing Minimal Resources

9

Page 10: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Learning Translation Lexicons for Low-Resource Languages

{Serbian Uzbek Romanian Bengali} English

Problem: Scarce resources . . . –Large parallel texts are very helpful, but often unavailable–Often, no “seed” translation lexicon is available–Neither are resources such as parsers, taggers, thesauri

Solution: Use only monolingual corpora in source, target languages–But use many information sources to propose and rank

translation candidates

10

Page 11: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Bridge Languages

Punjabi

Gujarati

UkrainianSerbian

SloveneBulgarian

Nepali

Marathi

Bengali

HINDI

Polish

Russian

Slovak

CZECH

ENGLISH

Dictionary Intra-family stringtransduction

11

Page 12: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

* Constructing translation candidate sets12

Page 13: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

some cognates

Cognate Selection

Spanish

Galician

RomanianCatalan

Italian

Tasks

13

Page 14: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Arabic

Inuktitut

The Transliteration ProblemTasks

14

Page 15: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

(Ristad & Yianilos 1997)

Memoryless TransducerExample Models for Cognate and Transliteration Matching

15

Page 16: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Two-State Transducer (“Weak Memory”)

Example Models for Cognate and Transliteration Matching

16

Page 17: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Unigram Interlingua TransducerExample Models for Cognate and Transliteration Matching

17

Page 18: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Examples: Possible Cognates Ranked byVarious String Models

Romanian inghiti (ingest)Uzbek avvalgi (previous/former)

* Effectiveness of cognate models18

Page 19: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Uzbek

Turkish

Kazakh

Kyrgyz

Farsi

Russian

ENGLISH

* Multi-family bridge languages19

Page 20: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Similarity Measuresfor re-ranking cognate/transliteration hypotheses

1. Probabilistic string transducers

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual word properties

20

Page 21: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Similarity Measures

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual word properties

1. Probabilistic string transducers

21

Page 22: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Compare Vectors

independence vectorConstruction of context term vector

3 010 479 836 01911

justice

expression

majesty

sovereignty

country

declaration

ornamental

religion

0 1.52 1.5 1.5 1.540justice

expression

majesty

sovereignty

country

declaration

ornamental

religion

nezavisnost vector Projection of context vector from Serbian toEnglish term space

681 0104 21 4 0141184freedom vectorConstruction of context term vector

Compute cosine similarity between nezavisnost and “independence” … and between nezavisnost and “freedom”

22

Page 23: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Similarity Measures

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual word properties

1. Probabilistic string transducers

23

Page 24: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Date Distribution Similarity

• Topical words associated with real-world events appear within news articles in bursts following the date of the event

• Synonymous topical words in different languages, then, display similar distributions across dates in news text: this can be measured

• We use cosine similarity on date term vectors, with term values p(word|date), to quantify this notion of similarity

24

Page 25: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Date Distribution Similarity - Examplep(word|date)

p(word|date)

DATE (200-Day Window)

independence

freedom

nezavisnost

nezavisnost

(correct)

(incorrect)

25

Page 26: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Similarity Measures

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual word properties

1. Probabilistic string transducers

26

Page 27: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Relative Frequency

Precedent in Yarowsky & Wicentowski (2000);used relative frequency similarity for morphological analysis

fCF(wF)

|CF|

fCE(wE)

|CE|

rf(wF)=

rf(wE)=

Cross-Language Comparison:

rf(wF) rf(wE)rf(wE) rf(wF)( )min ,

[min-ratio method]

27

Page 28: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Combining Similarities: Uzbek

28

Page 29: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Combining Similarities:Romanian, Serbian, & Bengali

29

Page 30: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Observations

* With no Uzbek-specific supervision,we can produce an Uzbek-Englishdictionary which is 14% exact-match correct

* Or, we can put a correct translation in the top-10 list 34% of the time (useful for end-to-end machine translation or cross-language information retrieval)

* Adding more bridge languageshelps

30

Page 31: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Polylingual Topic Models

31

Page 32: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Automated Analysis of Text

! Previously: analyzing trends in text collections (Hall et al., ’08)

! Monolingual models often work well: collections in English only

! Multilingual text collections are increasingly common

! Automated tools are most important for multilingual collections:

! Don’t know the language ! cannot eyeball the data! Humans typically only know a few languages! New data will appear in other languages

! Simultaneously analyze document content in many languages

Polylingual Topic Models Hanna M. Wallach

32

Page 33: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Multiple Languages

! Most statistical topic models are implicitly monolingual

! Why model multiple languages explicitly?

graph problem rendering algebra und lagraphs problems graphics algebras von desedge optimization image ring die le

vertices algorithm texture rings der duedges programming scene modules im les

! Hodgepodge of English, German, French topics

! Imbalanced corpus: maybe only one or two French topics

Polylingual Topic Models Hanna M. Wallach

33

Page 34: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Parallel vs. Comparable Corpora

! A set of aligned documents is a “document tuple”

! Fully parallel corpora: documents are direct translations

! Corpora with a few parallel “glue” document tuples

! Comparable corpora: documents have similar semantic content

Polylingual Topic Models Hanna M. Wallach

34

Page 35: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Polylingual Topic Model

! Generates a document tuple w = w1, . . . ,wL by drawing...

! For real-world data, only the word tokens are observed

Polylingual Topic Models Hanna M. Wallach

35

Page 36: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Key Characteristics

! A topic is a set of distributions over words, e.g., !t = !1t , . . . ,!

Lt

! Works on aligned document tuples, rather than documents

! Each tuple can consist of only a subset of languages

! Tuple-specific distributions over topics

! Ensure cross-language consistency: e.g., topic 13 in French issemantically similar to topic 13 in English

! Simple, Gibbs sampling inference algorithm

! No more complicated than latent Dirichlet allocation

Polylingual Topic Models Hanna M. Wallach

36

Page 37: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Example Topics (T = 400)

Polylingual Topic Models Hanna M. Wallach

37

Page 38: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Example Topics (T = 400)

Polylingual Topic Models Hanna M. Wallach

38

Page 39: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Example Topics (T = 400)

Polylingual Topic Models Hanna M. Wallach

39

Page 40: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Analysis of Trained Model

! Is the model genuinely learning mixtures of topics?

Polylingual Topic Models Hanna M. Wallach

40

Page 41: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Analysis of Trained Model

! Is the model genuinely learning mixtures of topics?

Polylingual Topic Models Hanna M. Wallach

41

Page 42: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Analysis of Trained Model

! Is the model genuinely learning mixtures of topics?

Polylingual Topic Models Hanna M. Wallach

42

Page 43: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

EuroParl: Analysis of Trained Model

Polylingual Topic Models Hanna M. Wallach

43

Page 44: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Parallel Corpora: “Glue” Tuples

! How many aligned documents are needed to get aligned topics?

topics with 1% “glue” tuples

DE rußland russland russischen tschetschenien sicherheitEN china rights human country s burmaIT ho presidente mi perche relazione votato

topics with 25% “glue” tuples

DE rußland russland russischen tschetschenien ukraineEN russia russian chechnya cooperation region belarusIT russia unione cooperazione cecenia regione russa

Polylingual Topic Models Hanna M. Wallach

44

Page 45: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Generating Bilingual Lexica

! Bilingual lexicon: word pairs (e.g., English word, translation)

! High probability words in di!erent languages for a topic are likely toinclude translations – can use these to generate lexica

! Form candidate translations: Cartesian product of most probable Kwords in English and in each translation language

! No morphological variants: e.g., rules/vorschriften, rule/vorschrift

! Count # of lexicon pairs that are in the candidate set

! Advantages: unsupervised; all words, not just nouns

Polylingual Topic Models Hanna M. Wallach

45

Page 46: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Generating Bilingual Lexica (K = 1)

Polylingual Topic Models Hanna M. Wallach

46

Page 47: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Finding Translations

! Train model on aligned document tuples

! Output: set of polylingual topics, e.g., !t = !1t , . . . ,!

Lt

! Map each test document to the low-dimensional space defined by thepolylingual topics ! document-topic distributions

! For each query/target language pair:

! Compute similarities for all query/target document pairs! For each query document, rank target documents by similarity

! Jensen-Shannon divergence, cosine distance

Polylingual Topic Models Hanna M. Wallach

47

Page 48: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Finding Translations (Jensen-Shannon)

Polylingual Topic Models Hanna M. Wallach

48

Page 49: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Comparable Corpora

! Directly parallel translations are rare, expensive to produce

! Comparable corpora more common: e.g., Wikipedia, web pages

! Our data set: all Wikipedia articles in English, Farsi, Finnish, French,German, Greek, Hebrew, Italian, Polish, Russian, Turkish, Welsh

! Documents are topically similar but not direct translations

! More interesting questions, more real-world applications:

! Do comparable document tuples support alignment of topics?! Do di!erent languages have di!erent perspectives?! Which topics do particular languages focus on?

Polylingual Topic Models Hanna M. Wallach

49

Page 50: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Wikipedia: Example Topics (T = 400)

Polylingual Topic Models Hanna M. Wallach

50

Page 51: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Wikipedia: Example Topics (T = 400)

Polylingual Topic Models Hanna M. Wallach

51

Page 52: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Wikipedia: Example Topics (T = 400)

Polylingual Topic Models Hanna M. Wallach

52

Page 53: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

Topic Divergence Between Languages

! Estimate document-specific distributions over topics

! Compute Jensen-Shannon divergence between documents in a tuple

! Average document-document divergences for each language pair:

! “Disagreement” score for each language pair

! Almost all language pairs have divergences consistent with EuroParldata, even languages that have historically been in conflict

! Although individual articles may have high between-languagedivergence, Wikipedia is on average consistent between languages

Polylingual Topic Models Hanna M. Wallach

53

Page 54: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

54

Page 55: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

world ski km won

54

Page 56: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

actor role television actressworld ski km won

54

Page 57: Machine Translation: Learning without Word Alignmentsdasmith/inlp/lect21-cs585.pdf · -Named Entity Tagging/Transliteration-Morphological Analysis - Analyze a word to its root form

actor role television actressworld ski km won

ottoman empire khan byzantine

54