35
Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee Tou Ng, National University of Singapore

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

Embed Size (px)

DESCRIPTION

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 3 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 3 Overview  Statistical Machine Translation (SMT) systems  Need large sentence-aligned bilingual corpora (bi-texts).  Problem  Such training bi-texts do not exist for most languages.  Idea  Adapt a bi-text for a related resource-rich language.

Citation preview

Page 1: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

Source Language Adaptationfor Resource-Poor Machine Translation

Pidong Wang, National University of SingaporePreslav Nakov, QCRI, Qatar Foundation

Hwee Tou Ng, National University of Singapore

Page 2: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

Introduction

Page 3: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

3Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 3Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Overview

Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts).

ProblemSuch training bi-texts do not exist for most languages.

IdeaAdapt a bi-text for a related resource-rich language.

Page 4: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

4Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Idea: reuse bi-texts from related resource-rich languages to improve resource-poor SMT

Related languages have overlapping vocabulary (cognates)

e.g., casa (‘house’) in Spanish, Portuguese

similarword ordersyntax

Idea & Motivation

Page 5: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

5Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 5Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Related EU – nonEU languages Swedish – Norwegian Bulgarian – Macedonian

Related EU languages Spanish – Catalan Czech – Slovak Irish – Gaelic Scottish Standard German – Swiss German

Related languages outside Europe MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi) Hindi – Urdu Turkish – Azerbaijani Russian – Ukrainian Malay – Indonesian

Resource-rich vs. Resource-poor Languages

We will explorethese pairs.

Page 6: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

6Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Our Main focus:

ImprovingIndonesian-English SMT

Using Malay-English

Page 7: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

7Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 7Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay vs. Indonesian

MalaySemua manusia dilahirkan bebas dan samarata dari segi

kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati dan

hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.

IndonesianSemua orang dilahirkan merdeka dan mempunyai martabat

dan hak-hak yang sama.Mereka dikaruniai akal dan hati nurani dan hendaknya

bergaul satu sama lain dalam semangat persaudaraan.

~50% exact word overlap

from Article 1 of the Universal Declaration of Human Rights

Page 8: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

8Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 8Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay Can Look “More Indonesian”…

MalaySemua manusia dilahirkan bebas dan samarata

dari segi kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati

dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.

~75% exact word overlap

Post-edited Malay to look “Indonesian” (by an Indonesian speaker).

IndonesianSemua manusia dilahirkan bebas dan mempunyai martabat

dan hak-hak yang sama.Mereka mempunyai pemikiran dan perasaan dan hendaklah

bergaul satu sama lain dalam semangat persaudaraan.

from Article 1 of the Universal Declaration of Human Rights

We attempt to do this automatically:adapt Malay to look IndonesianThen, use it to improve SMT…

Page 9: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

9Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Indonesian

Malay

Englishpoor

rich

Method at a Glance

Indonesian

“Indonesian”

Englishpoor

rich

Step 1:Adaptation

Indonesian + “Indonesian” EnglishStep 2:

Combination

Adapt

Note that we have no Malay-Indonesian bi-text!

Page 10: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

10Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Step 1:Adapting Malay-Englishto “Indonesian”-English

Page 11: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

11Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 11Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Bi-text Adaptation:Overview

Given a Malay-English sentence pair

1. Adapt the Malay sentence to “Indonesian”• Word-level paraphrases• Phrase-level paraphrases• Cross-lingual morphology

2. We pair the adapted “Indonesian” with English from Malay-English sentence pair

Thus, we generate a new “Indonesian”-English sentence pair.

Page 12: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

12Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 12Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.

Decode using a large Indonesian LM

Word-Level Bi-text Adaptation:Overview

Page 13: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malaysia’s GDP is expected to reach 8 per cent in 2010.

13

Pair each with the English counter-part

Thus, we generate a new “Indonesian”-English bi-text.

Word-Level Bi-text Adaptation:Overview

Page 14: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

14Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Indonesian translations for Malay: pivoting over English

Weights

14Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay sentenceML1 ML2 ML3 ML4 ML5

English sentenceEN1 EN2 EN3 EN4

English sentenceEN11 EN3 EN12

Indonesian sentenceIN1 IN2 IN3 IN4

ML-EN bi-text

IN-EN bi-text

Word-Level Adaptation:Extracting Paraphrases

Note: we have no Malay-Indonesian bi-text, so we pivot.

Page 15: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

15Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

IN-EN bi-text is small, thus:

Unreliable IN-EN word alignments bad ML-IN paraphrases Solution:

improve IN-EN alignments using the ML-EN bi-text concatenate: IN-EN*k + ML-EN

» k ≈ |ML-EN| / |IN-EN|

word alignment get the alignments for one copy of IN-EN only

15Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Adaptation:Issue 1

IN

ML

ENpoor

rich

Works because of cognates between Malay and Indonesian.

Page 16: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

16Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

IN-EN bi-text is small, thus:

Small IN vocabulary for the ML-IN paraphrases Solution:

Add cross-lingual morphological variants: Given ML word: seperminuman Find ML lemma: minum Propose all known IN words sharing the same lemma:

» diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum

16Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Adaptation:Issue 2

IN

ML

ENpoor

rich

Note: The IN variants are from a larger monolingual IN text.

Page 17: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-level pivoting Ignores context, and relies on LM Cannot drop/insert/merge/split/reorder words Solution:

Phrase-level pivoting Build ML-EN and EN-IN phrase tables Induce ML-IN phrase table (pivoting over EN) Adapt the ML side of ML-EN to get “IN”-EN bi-text:

» using Indonesian LM and n-best “IN” as before

Also, use cross-lingual morphological variants

17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Adaptation:Issue 3

- Models context better: not only Indonesian LM, but also phrases.- Allows many word operations, e.g., insertion, deletion.

IN

ML

ENpoor

rich

Page 18: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

18Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Step 2:Combining

IN-EN + “IN”-EN

Page 19: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

19Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Combining IN-EN and “IN”-EN bi-texts Simple concatenation: IN-EN + “IN”-EN

Balanced concatenation: IN-EN * k + “IN”-EN

Sophisticated phrase table combination: (Nakov and Ng, EMNLP 2009), (Nakov and Ng, JAIR 2012) Improved word alignments for IN-EN Phrase table combination with extra features

Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009)Preslav Nakov, Hwee Tou Ng

Page 20: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

20Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Experiments & Evaluation

Page 21: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

21Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Data

Translation data (for IN-EN) IN2EN-train: 0.9M IN2EN-dev: 37K IN2EN-test: 37K EN-monoling.: 5M

Adaptation data (for ML-EN “IN”-EN) ML2EN: 8.6M IN-monoling.: 20M

(tokens)

Page 22: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

22Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Isolated Experiments:Training on “IN”-EN only

BLEU

System combination using MEMT (Heafield and Lavie, 2010)

Page 23: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

23Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 23

BLEU

Combined Experiments:Training on IN-EN + “IN”-EN

Page 24: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

24Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Experiments: Improvements

24

BLEU

Page 25: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

25Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Improve Macedonian-English SMT by adapting Bulgarian-English bi-text Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words) OPUS movie subtitles

Application to Other Languages & Domains

BLEU

Page 26: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Conclusion

Page 27: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

27Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Adapt bi-texts for related resource-rich languages, using confusion networks word-level & phrase-level paraphrasing cross-lingual morphological analysis

Achieved:+6.7 BLEU over ML2EN+2.6 BLEU over IN2EN+1.5-3.0 BLEU over comb(IN2EN,ML2EN)

Future work add split/merge as word operations better integrate word-level and phrase-level methods apply our methods to other languages & NLP problems

Thank you!

Conclusion & Future Work

Supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.

Page 28: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

28Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 28Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Further Analysis

Page 29: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

29Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

ParaphrasingNon-Indonesian Malay Words Only

So, we do need to paraphrase all words.

Page 30: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

30Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Human Judgments

Morphology yields worse top-3 adaptationsbut better phrase tables, due to coverage.

Is the adapted sentence better Indonesianthan the original Malay sentence?

100 random sentences

Page 31: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

31Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Reverse AdaptationIdea:

Adapt dev/test Indonesian input to “Malay”,then, translate with a Malay-English system

Input to SMT: - “Malay” lattice- 1-best “Malay” sentence from the lattice

Adapting dev/test is worse than adapting the training bi-text:So, we need both n-best and LM

Page 32: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

32Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 32Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Related Work

Page 33: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

33Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Related Work (1)

Machine translation between related languages E.g.

Cantonese–Mandarin (Zhang, 1998)

Czech–Slovak (Hajic & al., 2000)

Turkish–Crimean Tatar (Altintas & Cicekli, 2002)

Irish–Scottish Gaelic (Scannell, 2006)

Bulgarian–Macedonian (Nakov & Tiedemann, 2012)

We do not translate (no training data), we “adapt”.

Page 34: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

34Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Related Work (2)

Adapting dialects to standard language (e.g., Arabic)(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)

manual rules

Normalizing Tweets and SMS(Aw & al., 2006; Han & Baldwin, 2011)

informal text: spelling, abbreviations, slang same language

Page 35: Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea

35Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Related Work (3)

Adapt Brazilian to European Portuguese (Marujo & al. 2011)

rule-based, language-dependent tiny improvements for SMT

Reuse bi-texts between related languages (Nakov & Ng. 2009)

no language adaptation (just transliteration)