Modality-PreservingPhrase-Based
Statistical Machine Translation
Masamichi Ideue, Masao Utiyama,Eiichiro Sumita and Kazuhide Yamamoto
(Nagaoka University of Technology and NICT)
Purpose of our studyJapanese to English translation preserving negation and question modality by Phrase-based SMT.
Input 私はりんごが好きではありません。
Translation I don’t like apples.
The MT users would not be able to detect a modality error.
1
MT Translation I like apples.
Related Studies
• Class-Dependent Modeling for Dialog Translation [Finch et al., 2009]• Discriminative Reranking for SMT
using Various Global Features [Goh et al., 2010]
Our study focused on characteristic modality words in negations and questions.
Neither of the studies discussed what expressions influence modalities.
2
Proposed MethodAdd feature functions considered characteristic words of negation and question.
3
Added feature functionsThe number of phrase pairs including characteristic words of question (negation) in Japanese phrase and English phrase.
Hypothesis e Where is the purse ?
Input f 財布 は どこ に あり ます か ?
2
4
Characteristic Words Extraction
• Manual extraction
• Automatic extraction• Using LLR(Log-likelihood ratio) score
Extract characteristic words from the parallel corpus in travel domain.
5
Manual Extraction(English)
Negation Questionnot ’t
don Don
haven isn
No won
wasn doesn
didn cannot
hadn
? WhyWill What
Could IsHow DoesCan DoAre WhichWhen WhereHave DoesDid WasMay
6
Manual Extraction(Japanese)
Negation Question
ない(nai) ません(masen) ? か 。(ka.)
• The characteristic words that clearly express the modalities are few.
• Whether a word expresses modality or not, there is tendency to depends on the domain.
7
Automatic Extraction• Automatic extraction is based on LLR.• LLR is convenient for extracting characteristic words in travel domain (Chujo et al., 2006).
1 ?
2 Will
3 Could
4 How
5 Can
... ...
Extract top N words from the ranking by LLR score as the
characteristic words.
Order by LLR score (Question)
8
Calculation of LLR(In case of negation)
If a word tends to occur in negation only, the LLR score becomes high.
Negation Affirmationw=1 a b a+bw=0 c d c+d
a+c b+d n(a,b,c,d : occurrence frequency in each condition)
9
Sentence type classification
To build the contingency table, we divided sentences in the parallel corpus with manually extracted English characteristic words.
English Japanese TypeHe is not an artist. 彼は芸術家ではない。 negation
I like apples. 私はりんごが好きです。 affirmationAre you a doctor? あなたは医者ですか。 question
10
Extracted Words by LLR(English)
Negation Questiondo any
there havethis donlong itisn didyour muchhow time
can yetany butknow worry
I anythingit so
afraid understandwhat enough
11
Extracted Words by LLR(Japanese)
Negation Questionか どこ
何 どう
いくら は
いただけ どの
何時 あり
でしょ もらえ
いかが どんな
ませ ない
ん は
なかっ あまり
まだ あり
でき じゃ
いいえ そんなに
そんな たく
12
Experiments
SMT Toolkit Moses
Tuning Minimum Error Rate Training
Parallel corpus Basic Travel Expression Corpus (BTEC; 70,000 pairs)
Test set1,500 sentences
(included 500 sentences for negation, question, and affirmation)
Development set
1,500 sentences (in the same way as test set)
13
Experiments
• From preliminary experimental evaluation with BLEU, the N is decided as 30 (LLR30).
• Baseline method is no additional features.
14
Manual Evaluation
• To verify effectiveness of translation quality when add the proposed features.
• To verify accuracy of each modality.We randomly extracted 90 pairs to test the methods for each modality (total 270 pairs).
15
Translation Quality
Good(S,A,B) S A B C D
Baseline(No additional features)
151 60 57 34 26 93
Manually Extraction 153 55 54 44 29 88
LLR30 154 60 56 38 28 88
All the methods have the same translation quality if S, A and B are assumed good translation.
(number of sentences)
16
Accuracy of each modality
Aff Neg Que
Baseline 86.67 39.22 90.48
Manually Extraction 87.41 64.71 90.48
LLR30 87.41 62.75 95.24
(Percentage of the outputs preserved the modality of the input.)
•Proposed methods indicated a marked improvement in negation modality.
•The accuracy of LLR30 was better than the accuracy of the baseline in all modalities.
17
Translation Example
Proposed method (Manually Extraction):Which one shall we go to the circus and zoo? (O)
Input (Question):サーカスと動物園、どっちに行こうか。
Baseline:Let’s go to the circus and, the zoo? (X)
18
Translation Example
Proposed method (Manually Extraction):I don’t mind if you cancel it? (X)
Input (Question):キャンセルしてもかまいませんか。
Baseline:May I cancel? (O)
masen(negation)
masen ka(question)
We have to treat word combinations.
19
Conclusion• We proposed additional feature considering characteristic words for modality-preserving PBSMT.
• Produced more translations preserved the modality of the input sentence than baseline without decrease of translation quality.
• Automatic extraction performed the same as or better than manual extraction.
20
LLR
LLR
Negation Affirmationw=1 a b a+bw=0 c d c+d
a+c b+d n
Translation Example
Proposed method (Manually Extraction):Please go easy. (O)
Input (Affirmation):やさしく打ってくださいね。
Proposed mthod (English side only):Please go easy, isn’t it? (X)
Calculation of LLR
• Pr(D|H_indep) is the probability under the null hypothesis that the occurrences of a word w in the negative and affirmative sentences are independent of one another.
• Pr(D|H_dep) is the case in which the occurrences are dependent.
(In case of negation)
If a word tends to occur in negation only, the LLR score becomes high.
Calculation of LLRNegation Affirmation
w=1 a b a+bw=0 c d c+d
a+c b+d n(a,b,c,d : occurrence frequency in each condition)
Related Studies• Class-Dependent Modeling for Dialog Translation
[Finch et al., 2009]• 2 models are trained for question sentence and other
sentence.
• Discriminative Reranking for SMT using Various Global Features [Goh et al., 2010]• Probabilities of sentence types such as negations and
questions are used.
Our study focused on characteristic modality words in negations and questions.
Neither of the studies discussed what expressions influence modalities.