Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

Source Language Adaptationfor Resource-Poor Machine Translation

Pidong Wang, National University of SingaporePreslav Nakov, QCRI, Qatar Foundation

Hwee Tou Ng, National University of Singapore

Introduction

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

3Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 3Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Overview

Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts).

ProblemSuch training bi-texts do not exist for most languages.

IdeaAdapt a bi-text for a related resource-rich language.


4Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Idea: reuse bi-texts from related resource-rich languages to improve resource-poor SMT

Related languages have overlapping vocabulary (cognates)

e.g., casa (‘house’) in Spanish, Portuguese

similarword ordersyntax

Idea & Motivation



Related EU – nonEU languages Swedish – Norwegian Bulgarian – Macedonian

Related EU languages Spanish – Catalan Czech – Slovak Irish – Gaelic Scottish Standard German – Swiss German

Related languages outside Europe MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi) Hindi – Urdu Turkish – Azerbaijani Russian – Ukrainian Malay – Indonesian

Resource-rich vs. Resource-poor Languages

We will explorethese pairs.


Our Main focus:

ImprovingIndonesian-English SMT

Using Malay-English



Malay vs. Indonesian

MalaySemua manusia dilahirkan bebas dan samarata dari segi

kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati dan

hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.

IndonesianSemua orang dilahirkan merdeka dan mempunyai martabat

dan hak-hak yang sama.Mereka dikaruniai akal dan hati nurani dan hendaknya

bergaul satu sama lain dalam semangat persaudaraan.

~50% exact word overlap

from Article 1 of the Universal Declaration of Human Rights



Malay Can Look “More Indonesian”…

MalaySemua manusia dilahirkan bebas dan samarata

dari segi kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati

dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.

~75% exact word overlap

Post-edited Malay to look “Indonesian” (by an Indonesian speaker).

IndonesianSemua manusia dilahirkan bebas dan mempunyai martabat

dan hak-hak yang sama.Mereka mempunyai pemikiran dan perasaan dan hendaklah

bergaul satu sama lain dalam semangat persaudaraan.

from Article 1 of the Universal Declaration of Human Rights

We attempt to do this automatically:adapt Malay to look IndonesianThen, use it to improve SMT…



Indonesian

Malay

Englishpoor

rich

Method at a Glance

Indonesian

“Indonesian”

Englishpoor

rich

Step 1:Adaptation

Indonesian + “Indonesian” EnglishStep 2:

Combination

Adapt

Note that we have no Malay-Indonesian bi-text!


Step 1:Adapting Malay-Englishto “Indonesian”-English



Word-Level Bi-text Adaptation:Overview

Given a Malay-English sentence pair

1. Adapt the Malay sentence to “Indonesian”• Word-level paraphrases• Phrase-level paraphrases• Cross-lingual morphology

2. We pair the adapted “Indonesian” with English from Malay-English sentence pair

Thus, we generate a new “Indonesian”-English sentence pair.



Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.

Decode using a large Indonesian LM




Malaysia’s GDP is expected to reach 8 per cent in 2010.

13

Pair each with the English counter-part

Thus, we generate a new “Indonesian”-English bi-text.




Indonesian translations for Malay: pivoting over English

Weights


Malay sentenceML1 ML2 ML3 ML4 ML5

English sentenceEN1 EN2 EN3 EN4

English sentenceEN11 EN3 EN12

Indonesian sentenceIN1 IN2 IN3 IN4

ML-EN bi-text

IN-EN bi-text

Word-Level Adaptation:Extracting Paraphrases

Note: we have no Malay-Indonesian bi-text, so we pivot.



IN-EN bi-text is small, thus:

Unreliable IN-EN word alignments bad ML-IN paraphrases Solution:

improve IN-EN alignments using the ML-EN bi-text concatenate: IN-EN*k + ML-EN

» k ≈ |ML-EN| / |IN-EN|

word alignment get the alignments for one copy of IN-EN only


Word-Level Adaptation:Issue 1

IN

ML

ENpoor

rich

Works because of cognates between Malay and Indonesian.



IN-EN bi-text is small, thus:

Small IN vocabulary for the ML-IN paraphrases Solution:

Add cross-lingual morphological variants: Given ML word: seperminuman Find ML lemma: minum Propose all known IN words sharing the same lemma:

» diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum



IN

ML

ENpoor

rich

Note: The IN variants are from a larger monolingual IN text.



Word-level pivoting Ignores context, and relies on LM Cannot drop/insert/merge/split/reorder words Solution:

Phrase-level pivoting Build ML-EN and EN-IN phrase tables Induce ML-IN phrase table (pivoting over EN) Adapt the ML side of ML-EN to get “IN”-EN bi-text:

» using Indonesian LM and n-best “IN” as before

Also, use cross-lingual morphological variants



- Models context better: not only Indonesian LM, but also phrases.- Allows many word operations, e.g., insertion, deletion.

IN

ML

ENpoor

rich


Step 2:Combining

IN-EN + “IN”-EN



Combining IN-EN and “IN”-EN bi-texts Simple concatenation: IN-EN + “IN”-EN

Balanced concatenation: IN-EN * k + “IN”-EN

Sophisticated phrase table combination: (Nakov and Ng, EMNLP 2009), (Nakov and Ng, JAIR 2012) Improved word alignments for IN-EN Phrase table combination with extra features

Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009)Preslav Nakov, Hwee Tou Ng


Experiments & Evaluation



Data

Translation data (for IN-EN) IN2EN-train: 0.9M IN2EN-dev: 37K IN2EN-test: 37K EN-monoling.: 5M

Adaptation data (for ML-EN “IN”-EN) ML2EN: 8.6M IN-monoling.: 20M

(tokens)



Isolated Experiments:Training on “IN”-EN only

BLEU

System combination using MEMT (Heafield and Lavie, 2010)


23Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 23

BLEU

Combined Experiments:Training on IN-EN + “IN”-EN



Experiments: Improvements

24

BLEU



Improve Macedonian-English SMT by adapting Bulgarian-English bi-text Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words) OPUS movie subtitles

Application to Other Languages & Domains

BLEU



Conclusion



Adapt bi-texts for related resource-rich languages, using confusion networks word-level & phrase-level paraphrasing cross-lingual morphological analysis

Achieved:+6.7 BLEU over ML2EN+2.6 BLEU over IN2EN+1.5-3.0 BLEU over comb(IN2EN,ML2EN)

Future work add split/merge as word operations better integrate word-level and phrase-level methods apply our methods to other languages & NLP problems

Thank you!

Conclusion & Future Work

Supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.



Further Analysis



ParaphrasingNon-Indonesian Malay Words Only

So, we do need to paraphrase all words.



Human Judgments

Morphology yields worse top-3 adaptationsbut better phrase tables, due to coverage.

Is the adapted sentence better Indonesianthan the original Malay sentence?

100 random sentences



Reverse AdaptationIdea:

Adapt dev/test Indonesian input to “Malay”,then, translate with a Malay-English system

Input to SMT: - “Malay” lattice- 1-best “Malay” sentence from the lattice

Adapting dev/test is worse than adapting the training bi-text:So, we need both n-best and LM



Related Work



Related Work (1)

Machine translation between related languages E.g.

Cantonese–Mandarin (Zhang, 1998)

Czech–Slovak (Hajic & al., 2000)

Turkish–Crimean Tatar (Altintas & Cicekli, 2002)

Irish–Scottish Gaelic (Scannell, 2006)

Bulgarian–Macedonian (Nakov & Tiedemann, 2012)

We do not translate (no training data), we “adapt”.



Related Work (2)

Adapting dialects to standard language (e.g., Arabic)(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)

manual rules

Normalizing Tweets and SMS(Aw & al., 2006; Han & Baldwin, 2011)

informal text: spelling, abbreviations, slang same language



Related Work (3)

Adapt Brazilian to European Portuguese (Marujo & al. 2011)

rule-based, language-dependent tiny improvements for SMT

Reuse bi-texts between related languages (Nakov & Ng. 2009)

no language adaptation (just transliteration)

Documents

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee