View
223
Download
2
Category
Preview:
Citation preview
1
Using Comparable Corpora to Adapt
a Translation Model to Domains
Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada
Department of Computer Science, Shizuoka University
The 7th International Conference on Language Resources and Evaluation, Malta, May 2010.
2
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
3
Statistical machine translation– Able to learn a translation model from a parallel corpus– Suffer from the limited availability of large parallel corpora
Use comparable corpora for SMT– Estimate translation pseudo-probabilities from a bilingual
dictionary and comparable corpora– Use the pseudo-probabilities estimated from in-domain
comparable corpora to• Adapt a translation model learned from an out-of-domain
parallel corpus, or• Augment a translation model learned from a small in-
domain parallel corpus
Motivation and goal
4
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
5
Word associations suggest particular senses or translations of a polysemous word (Yarowsky 1993)
– (tank, soldier) the “military vehicle” sense or translation “ 戦車 [SENSHA]” of “tank”
– (tank, gasoline) the “container for liquid or gas” sense or translation “ タンク [TANKU]” of “tank”
Comparable corpora allow us to determine which word associations suggest which translations of a polysemous word (Kaji & Morimoto 2002)
Assume that the more word associations that suggest a translation, the higher the probability of the translation word would be
Basic idea for estimating word translation pseudo-probabilities from comparable corpora
6
Naive method for estimating word translation pseudo-probabilities
English corpus Japanese corpus
English-Japanese dictionary
Align word associationsAlign word associations
Calculate the percentage of associated words suggesting each translationCalculate the percentage of associated words suggesting each translation
Extract word associationsExtract word associationsExtract word associationsExtract word associations
“Missile”, “soldier”, and others suggest “ 戦車 [SENSHA]”“Fuel”, “gasoline”, and others suggest “ タンク [TANKU]”
(tank, soldier)
(tank, gasoline)
(tank, missile)
(tank, fuel)
( 戦車 [SENSHA], 兵士 [HEISHI])
( タンク [TANKU], ガソリン[GASORIN])( 戦車 [SENSHA], ミサイル [MISAIRU])
( タンク [TANKU], 燃料[NENRYOU])
Pps( 戦車 [SENSHA]|tank)=|{missile, soldier, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)
Pps( タンク [TANKU]|tank)=|{fuel, gasoline, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)
7
Failure in word-association alignment
(tank, Chechen) ?– due to the disparity in topical coverage between two language
corpora
(tank, Chechen) ? ( 戦車 [SENSHA], チェチェン[CHECHEN])
– due to the incomplete coverage of the intermediary bilingual dictionary
Incorrect word-association alignment
(tank, troop) ( 水槽 [SUISOU], 群れ [MURE])– due to incidental word-for-word correspondence between
word associations that do not really correspond to each other
Difficulties the naive method suffers from
8
Two words associated with a third word are likely to suggest the same sense or translation of the third word when they are also associated with each other
“Soldier” and “troop”, both of which are associated with “tank”, are associated with each other
“Soldier” and “troop” suggest the same translation “ 戦車 [SENSHA]” Define a correlation between an associated word and a translation
using the correlations between other associated words and the translation
– C(troop, 戦車 [SENSHA]) MI(troop, tank) {MI(troop, soldier) C(soldier, 戦車 [SENSHA])
+ MI(troop, missile) C(missile, 戦車 [SENSHA]) + …}
– C(troop, タンク [TANKU]) MI(troop, tank) {MI(troop, soldier) C(soldier, タンク [TANKU])
+ MI(troop, missile) C(missile, タンク[TANKU]) + …}
How to overcome the difficulties
9
Calculate the correlations iteratively starting with the initial values determined according to the results of word-association alignment via a bilingual dictionary
tank 戦車[SENSHA]
タンク[TANKU]
Chechen
fuel gasoline missile soldier troop
• Alignment
tank 戦車[SENSHA]
タンク[TANKU]
Chechen 0.0 0.0
fuel 0.0 1.0
gasoline 0.0 1.0
missile 1.0 0.0
soldier 1.0 0.0
troop 0.5 0.5
• C0(associated_word, translation)
10
Overview of our method for estimating noun translation pseudo-probabilities
English corpus Japanese corpus
English-Japanese dictionary
AlignAlign
Calculate pairwise correlation between associated words and translations iterativelyCalculate pairwise correlation between associated words and translations iteratively
Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*
English word associations
Calculate point-wise mutual information
Calculate point-wise mutual information
Extract pairs of words co-occurring in a window*Extract pairs of words co-occurring in a window*
Calculate point-wise mutual information
Calculate point-wise mutual information
Japanese word associations
Initial value of correlation matrix of English associated words vs. Japanese translations for an English noun
Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each
translation
Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each
translation
Correlation matrix of associated words vs. translations
Noun translation pseudo-probabilities
* Window size = 10 content words
11
Example correlation matrix and estimated noun translation pseudo-probabilities
plant 装置[SOUCHI]
設備[SETSUBI
]
植物[SHOKU-BUTSU]
工場[KOUJOU]
プラント
[PURANTO]
苗[NAEI]
植木[UEKI]
activity 0.02 0.03 2.10 0.20 0.03 0.01 0.02bacteria 0.02 0.03 1.98 0.01 0.02 0.27 0.02boiler 0.05 2.70 0.05 0.03 2.73 0.03 0.04coal 0.87 2.35 1.70 0.68 2.06 0.65 0.99computer 0.55 0.71 0.02 0.49 0.73 0.01 0.01control 0.47 0.51 0.17 0.15 0.62 0.06 0.01culture 0.03 0.05 3.26 0.23 0.12 0.77 0.88environment 0.76 1.25 1.32 0.03 0.05 0.23 0.03failure 0.93 1.22 0.03 0.53 1.43 0.01 0.01flower 0.04 0.06 4.02 0.04 0.04 1.23 1.70
: : : : : : : :Translation pseudo-probabilities
.047 .241 .423 .022 .223 .022 .022
12
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
13
Our method for estimating noun-sequence translation pseudo-probabilities
English corpus Japanese corpusEnglish-Japanese dictionary
Generate all compositional translations
Generate all compositional translations
E(1)=e1(1)e2
(1)… em(1),
E(2)= e1(2)e2
(2)… em(2),
…,
E(n)=e1(n)e2
(n)… em(n) F=f1f2…fm
Estimate according to occurrence frequenciesEstimate according to occurrence frequencies
Combine two estimatesCombine two estimates
Retrieve compo-sitional transla-tions and count their frequencies
Retrieve compo-sitional transla-tions and count their frequencies
Extract a noun sequence with its frequency
Extract a noun sequence with its frequency
Estimate according to constituent-word translation pseudo-probabilitiesEstimate according to constituent-word translation pseudo-probabilities
n
k
m
ii
kips
m
ii
jips
j fePfePFEP1 1
)(
1
)()(2 )|()|()|(
n
k
kjj EgEgFEP1
)()()(1 )()()|(
)|()|()|( )(2
)(1
)( FEPFEPFEP jjjps
14
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
15
Phrase-based SMT using translation pseudo-probabilities
In-domain phrase table (pseudo-probabilities)
Bilingual dictionary
MergeMerge
Estimate translation pseudo-probabilities Estimate translation pseudo-probabilities
Giza++ & heuristics Giza++ & heuristics
SRILMSRILM
Out-of-domain (or in-domain) parallel corpus
In-domain source- language corpus
In-domain target- language corpus
Moses decoder Moses decoder
Basic phrase table
Adapted (or augmented) phrase table In-domain language model
Source language text Target language text
16
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
17
Experiment AAdapt a phrase table learned from an out-of-domain parallel corpus by using in-domain comparable corpora
Experiment BAugment a phrase table learned from an in-domain small parallel corpus by using in-domain larger comparable corpora
Experimental setting
Experiment A Experiment B
Training parallel corpus
20,000 pairs of Japanese and English patent abstracts in the physics
20,000 pairs of Japanese and English sentences having high similarity ―ones extracted from scientific-paper abstracts in the chemistry
Training comparable corpora
Scientific-paper abstracts in the chemistry ・ Japanese: 151,958 abstracts (90.8 Mbytes) ・ English: 102,730 abstracts (64.9 Mbytes)
Test corpus 1,000 Japanese sentences, each having one reference English translation, from scientific paper abstracts in the chemistry
Bilingual dictionary
333,656 pairs of translation equivalents between 163,247 Japanese and 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries
18
Our method in four cases using a different volume of comparable corpora1. Japanese: all, English: all2. Japanese: half, English: all3. Japanese: all, English: half4. Japanese: half, English: half
Two baseline methods using the phrase table learned from the parallel corpus1. Baseline without dictionary2. Baseline with dictionary: Phrase table were augmented with the bilingual
dictionary[Note] The TL language model learned from the whole TL monolingual
corpus was used commonly in all cases involving our method and the baseline methods
Evaluation metric: BLEU-4
19
BLEU-4 score
Our method rather slightly improved the BLEU score The effect of the difference in volume of comparable corpora remains
unclear Simply adding a bilingual dictionary improved the out-of-domain phrase
table, but did not improve the in-domain phrase table
Method Experiment A Experiment B
Our method
J:all, E:all 13.30 16.82
J:half, E:all 13.19 16.70
J:all, E:half 13.21 16.78
J:half, E:half 13.27 16.71
Baseline w/o dictionary 11.42 16.37
Baseline w/ dictionary 12.94 16.32
Experimental results
20
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
21
1. Optimization of the parameters – Parameters, including the window size and thresholds for word
occurrence frequency, co-occurrence frequency, and pointwise mutual information, affect the correlation matrix of associated words vs. translations
– How to optimize the values for the parameters remains unsolved
2. Alternatives for word-association measure – Pointwise mutual information, which tends to overestimate low-
frequency words, is not the most suitable for acquiring word associations
– Need to compare with alternatives such as log-likelihood ratio and the Dice coefficient
Discussions
22
3. Refinement of the definition of translation pseudo-probability
– Need to consider the frequencies of associated words as well as the dependence among associated words
– Need to reconsider the strategy assigning an associated word to only one translation
4. Estimate of verb translation pseudo-probabilities
– Need to use syntactic co-occurrence, instead of co-occurrence in a widow, to extract verb-noun associations from corpora
– Need to define pariwise correlation between associated nouns and translations recursively based on heuristics where two nouns associated with a verb are likely to suggest the same sense of the verb when they belong to the same semantic class
23
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
24
Related work
Many studies on bilingual lexicon acquisition from bilingual comparable corpora have been reported since the mid 90s, but few studies on word translation probability estimate from bilingual comparable corpora
Estimate of word translation probabilities from comparable corpora using an EM algorithm (Koehn & Knight 2000) could be greatly affected by the occurrence frequencies of translation candidates in the TL corpus
In contrast, our method produces translation pseudo-probabilities that reflect the distribution of the senses of the SL word in the SL corpus
Methods for extracting parallel sentence pairs from bilingual comparable corpora (Zhao & Vogel, 2002; Utiyama & Isahara 2003; Fung & Cheung, 2004; Munteanu & Marcu, 2005); extracted parallel sentences could be used to learn a translation model with a conventional method based on word-for-word alignment. This approach is applicable only to closely comparable corpora.
In contrast, our method is applicable even to a pair of unrelated monolingual corpora.
25
1. Motivation and goal
2. Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3. Experiments
4. Discussion
5. Related work
6. Summary
Overview
26
Summary
A method for estimating translation pseudo-probabilities from a bilingual dictionary and bilingual comparable corpora was created– Assumption: The more associated words a translation is correlated with,
the higher its translation probability
– Essence of the method: Calculate pairwise correlations between associated words of an SL word and its TL translations
A phrase-based SMT framework using out-of-domain parallel corpus and in-domain comparable corpora was proposed– An experiment showed promising results; the BLEU score was improved
by using the translation pseudo-probabilities estimated from in-domain comparable corpora.
Future work includes optimizing the parameters and extending the method to estimate translation pseudo-probabilities for verbs.
Recommended