Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa,...

Using Comparable Corpora to Adapt

a Translation Model to Domains

Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada

Department of Computer Science, Shizuoka University

The 7th International Conference on Language Resources and Evaluation, Malta, May 2010.

1. Motivation and goal

2. Proposed method

a. Estimating noun translation pseudo-probabilities

b. Estimating noun-sequence translation pseudo-probabilities

c. Phrase-based SMT using translation pseudo-probabilities

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Statistical machine translation– Able to learn a translation model from a parallel corpus– Suffer from the limited availability of large parallel corpora

Use comparable corpora for SMT– Estimate translation pseudo-probabilities from a bilingual

dictionary and comparable corpora– Use the pseudo-probabilities estimated from in-domain

comparable corpora to• Adapt a translation model learned from an out-of-domain

parallel corpus, or• Augment a translation model learned from a small in-

domain parallel corpus

Motivation and goal

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Word associations suggest particular senses or translations of a polysemous word (Yarowsky 1993)

– (tank, soldier) the “military vehicle” sense or translation “ 戦車 [SENSHA]” of “tank”

– (tank, gasoline) the “container for liquid or gas” sense or translation “ タンク [TANKU]” of “tank”

Comparable corpora allow us to determine which word associations suggest which translations of a polysemous word (Kaji & Morimoto 2002)

Assume that the more word associations that suggest a translation, the higher the probability of the translation word would be

Basic idea for estimating word translation pseudo-probabilities from comparable corpora

Naive method for estimating word translation pseudo-probabilities

English corpus Japanese corpus

English-Japanese dictionary

Align word associationsAlign word associations

Calculate the percentage of associated words suggesting each translationCalculate the percentage of associated words suggesting each translation

Extract word associationsExtract word associationsExtract word associationsExtract word associations

“Missile”, “soldier”, and others suggest “ 戦車 [SENSHA]”“Fuel”, “gasoline”, and others suggest “ タンク [TANKU]”

(tank, soldier)

(tank, gasoline)

(tank, missile)

(tank, fuel)

( 戦車 [SENSHA], 兵士 [HEISHI])

( タンク [TANKU], ガソリン[GASORIN])( 戦車 [SENSHA], ミサイル [MISAIRU])

( タンク [TANKU], 燃料[NENRYOU])

Pps( 戦車 [SENSHA]|tank)=|{missile, soldier, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

Pps( タンク [TANKU]|tank)=|{fuel, gasoline, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

Failure in word-association alignment

(tank, Chechen) ?– due to the disparity in topical coverage between two language

corpora

(tank, Chechen) ? ( 戦車 [SENSHA], チェチェン[CHECHEN])

– due to the incomplete coverage of the intermediary bilingual dictionary

Incorrect word-association alignment

(tank, troop) ( 水槽 [SUISOU], 群れ [MURE])– due to incidental word-for-word correspondence between

word associations that do not really correspond to each other

Difficulties the naive method suffers from

Two words associated with a third word are likely to suggest the same sense or translation of the third word when they are also associated with each other

“Soldier” and “troop”, both of which are associated with “tank”, are associated with each other

“Soldier” and “troop” suggest the same translation “ 戦車 [SENSHA]” Define a correlation between an associated word and a translation

using the correlations between other associated words and the translation

– C(troop, 戦車 [SENSHA]) MI(troop, tank) {MI(troop, soldier) C(soldier, 戦車 [SENSHA])

+ MI(troop, missile) C(missile, 戦車 [SENSHA]) + …}

– C(troop, タンク [TANKU]) MI(troop, tank) {MI(troop, soldier) C(soldier, タンク [TANKU])

+ MI(troop, missile) C(missile, タンク[TANKU]) + …}

How to overcome the difficulties

Calculate the correlations iteratively starting with the initial values determined according to the results of word-association alignment via a bilingual dictionary

tank 戦車[SENSHA]

タンク[TANKU]

Chechen

fuel gasoline missile soldier troop

• Alignment

tank 戦車[SENSHA]

タンク[TANKU]

Chechen 0.0 0.0

fuel 0.0 1.0

gasoline 0.0 1.0

missile 1.0 0.0

soldier 1.0 0.0

troop 0.5 0.5

• C0(associated_word, translation)

Overview of our method for estimating noun translation pseudo-probabilities

English corpus Japanese corpus

English-Japanese dictionary

AlignAlign

Calculate pairwise correlation between associated words and translations iterativelyCalculate pairwise correlation between associated words and translations iteratively

Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*

English word associations

Calculate point-wise mutual information

Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*

Calculate point-wise mutual information

Japanese word associations

Initial value of correlation matrix of English associated words vs. Japanese translations for an English noun

Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each

translation

Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each

translation

Correlation matrix of associated words vs. translations

Noun translation pseudo-probabilities

* Window size = 10 content words

Example correlation matrix and estimated noun translation pseudo-probabilities

plant 装置[SOUCHI]

設備[SETSUBI

植物[SHOKU-BUTSU]

工場[KOUJOU]

プラント

[PURANTO]

苗[NAEI]

植木[UEKI]

activity 0.02 0.03 2.10 0.20 0.03 0.01 0.02bacteria 0.02 0.03 1.98 0.01 0.02 0.27 0.02boiler 0.05 2.70 0.05 0.03 2.73 0.03 0.04coal 0.87 2.35 1.70 0.68 2.06 0.65 0.99computer 0.55 0.71 0.02 0.49 0.73 0.01 0.01control 0.47 0.51 0.17 0.15 0.62 0.06 0.01culture 0.03 0.05 3.26 0.23 0.12 0.77 0.88environment 0.76 1.25 1.32 0.03 0.05 0.23 0.03failure 0.93 1.22 0.03 0.53 1.43 0.01 0.01flower 0.04 0.06 4.02 0.04 0.04 1.23 1.70

: : : : : : : :Translation pseudo-probabilities

.047 .241 .423 .022 .223 .022 .022

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Our method for estimating noun-sequence translation pseudo-probabilities

English corpus Japanese corpusEnglish-Japanese dictionary

Generate all compositional translations

E(1)=e1(1)e2

(1)… em(1),

E(2)= e1(2)e2

(2)… em(2),

E(n)=e1(n)e2

(n)… em(n) F=f1f2…fm

Estimate according to occurrence frequenciesEstimate according to occurrence frequencies

Combine two estimatesCombine two estimates

Retrieve compo-sitional transla-tions and count their frequencies

Extract a noun sequence with its frequency

Estimate according to constituent-word translation pseudo-probabilitiesEstimate according to constituent-word translation pseudo-probabilities

j fePfePFEP1 1

)()(2 )|()|()|(

kjj EgEgFEP1

)()()(1 )()()|(

)|()|()|( )(2

)( FEPFEPFEP jjjps

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Phrase-based SMT using translation pseudo-probabilities

In-domain phrase table (pseudo-probabilities)

Bilingual dictionary

MergeMerge

Estimate translation pseudo-probabilities Estimate translation pseudo-probabilities

Giza++ & heuristics Giza++ & heuristics

SRILMSRILM

Out-of-domain (or in-domain) parallel corpus

In-domain source- language corpus

In-domain target- language corpus

Moses decoder Moses decoder

Basic phrase table

Adapted (or augmented) phrase table In-domain language model

Source language text Target language text

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Experiment AAdapt a phrase table learned from an out-of-domain parallel corpus by using in-domain comparable corpora

Experiment BAugment a phrase table learned from an in-domain small parallel corpus by using in-domain larger comparable corpora

Experimental setting

Experiment A Experiment B

Training parallel corpus

20,000 pairs of Japanese and English patent abstracts in the physics

20,000 pairs of Japanese and English sentences having high similarity ―ones extracted from scientific-paper abstracts in the chemistry

Training comparable corpora

Scientific-paper abstracts in the chemistry ・ Japanese: 151,958 abstracts (90.8 Mbytes) ・ English: 102,730 abstracts (64.9 Mbytes)

Test corpus 1,000 Japanese sentences, each having one reference English translation, from scientific paper abstracts in the chemistry

Bilingual dictionary

333,656 pairs of translation equivalents between 163,247 Japanese and 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries

Our method in four cases using a different volume of comparable corpora1. Japanese: all, English: all2. Japanese: half, English: all3. Japanese: all, English: half4. Japanese: half, English: half

Two baseline methods using the phrase table learned from the parallel corpus1. Baseline without dictionary2. Baseline with dictionary: Phrase table were augmented with the bilingual

dictionary[Note] The TL language model learned from the whole TL monolingual

corpus was used commonly in all cases involving our method and the baseline methods

Evaluation metric: BLEU-4

BLEU-4 score

Our method rather slightly improved the BLEU score The effect of the difference in volume of comparable corpora remains

unclear Simply adding a bilingual dictionary improved the out-of-domain phrase

table, but did not improve the in-domain phrase table

Method Experiment A Experiment B

Our method

J:all, E:all 13.30 16.82

J:half, E:all 13.19 16.70

J:all, E:half 13.21 16.78

J:half, E:half 13.27 16.71

Baseline w/o dictionary 11.42 16.37

Baseline w/ dictionary 12.94 16.32

Experimental results

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

1. Optimization of the parameters – Parameters, including the window size and thresholds for word

occurrence frequency, co-occurrence frequency, and pointwise mutual information, affect the correlation matrix of associated words vs. translations

– How to optimize the values for the parameters remains unsolved

2. Alternatives for word-association measure – Pointwise mutual information, which tends to overestimate low-

frequency words, is not the most suitable for acquiring word associations

– Need to compare with alternatives such as log-likelihood ratio and the Dice coefficient

Discussions

3. Refinement of the definition of translation pseudo-probability

– Need to consider the frequencies of associated words as well as the dependence among associated words

– Need to reconsider the strategy assigning an associated word to only one translation

4. Estimate of verb translation pseudo-probabilities

– Need to use syntactic co-occurrence, instead of co-occurrence in a widow, to extract verb-noun associations from corpora

– Need to define pariwise correlation between associated nouns and translations recursively based on heuristics where two nouns associated with a verb are likely to suggest the same sense of the verb when they belong to the same semantic class

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Related work

Many studies on bilingual lexicon acquisition from bilingual comparable corpora have been reported since the mid 90s, but few studies on word translation probability estimate from bilingual comparable corpora

Estimate of word translation probabilities from comparable corpora using an EM algorithm (Koehn & Knight 2000) could be greatly affected by the occurrence frequencies of translation candidates in the TL corpus

In contrast, our method produces translation pseudo-probabilities that reflect the distribution of the senses of the SL word in the SL corpus

Methods for extracting parallel sentence pairs from bilingual comparable corpora (Zhao & Vogel, 2002; Utiyama & Isahara 2003; Fung & Cheung, 2004; Munteanu & Marcu, 2005); extracted parallel sentences could be used to learn a translation model with a conventional method based on word-for-word alignment. This approach is applicable only to closely comparable corpora.

In contrast, our method is applicable even to a pair of unrelated monolingual corpora.

2. Proposed method

3. Experiments

4. Discussion

5. Related work

6. Summary

Overview

Summary

A method for estimating translation pseudo-probabilities from a bilingual dictionary and bilingual comparable corpora was created– Assumption: The more associated words a translation is correlated with,

the higher its translation probability

– Essence of the method: Calculate pairwise correlations between associated words of an SL word and its TL translations

A phrase-based SMT framework using out-of-domain parallel corpus and in-domain comparable corpora was proposed– An experiment showed promising results; the BLEU score was improved

by using the translation pseudo-probabilities estimated from in-domain comparable corpora.

Future work includes optimizing the parameters and extending the method to estimate translation pseudo-probabilities for verbs.

Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa,...

Documents

VANESSA OKADA KIKKO

名称未設定-1 · Alphabetical List Of Reviewers of This Okabe, Takahiro Okada, Ryuzo Okamoto, Takeshi Okamura, Hiroyuki Okamura, Koji Okataku, Yasukuni Okatani, Takayuki Okawa,

TraﬃcSignalOptimizationonaSquareLatticeusingtheD ... · TraﬃcSignalOptimizationonaSquareLatticeusingtheD-WaveQuantumAnnealer Daisuke Inoue, Akihisa Okada, Tadayoshi Matsumori,

Wynn Resorts vs Okada

雪梁舎風の会展 in Firenze...Asuka Aoki Atsushi Someya Aya Ozeki Ayako Ogura Chizuru Sakamoto Daisuke Hashimoto Haruna Sawamura Hirohito I ba Hiroshi Ishizaka Hiroyuki Suetsugu

Okada - EU-Japan

Alexandra Okada, Saburo Okada e Edmea Santos

Daisuke Sato

Hiroyuki Kasai, Hiroyuki Sato, Bamdev Mishraproceedings.mlr.press/v84/kasai18a/kasai18a-supp.pdf · Hiroyuki Kasai, Hiroyuki Sato, Bamdev Mishra 2. We bound Hk tat w based on the

OKADA AIYON

IFSCC2020 1st Announcement · First Announcement President: Yoichi Shimatani Tadahito Seto Fumihiro Okada Yasuhiro Shigihara Yoshihiko Ando Ryuji Akatsuka Hiroyuki Asano Motoi Hayase

Ms. Lorraine Okada

Eliot Okada

Daisuke Namikawa.docx

4.Karo Okada

Supporting Online Material for - Science · Supporting Online Material for Input-Specific Spine Entry of Soma-Derived Vesl-1S Protein Conforms to Synaptic Tagging Daisuke Okada,*

Curso USP Okada

Hiroyuki Inoue

OKADA (2007) TECNOLOGIAS EDUCACIONAIS PARA APRENDIZAGEM ABERTA NO PROJETO OPENLEARN DA OPEN UNIVERSITY Saburo Okada e Alexandra Okada CoLearn - The Open

Mughal Miniature Okada