Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Automatic Evaluationof Translation Quality forDistant Language Pairs
Stanislav Reichert
University of Tübingen
14th June 2018
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
OutlineIntroduction
NTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-English Translation ResultsEnglish-Japanese Translation Results
From BLEU to RIBESIdentifying BLEU’s weaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
Reordering StrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Good translations: I know one when I see one
Figure 1: From Koehn (2009, p. 218)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Good translations: I know one when I see one
Figure 2: From Goto, Chow, et al. (2013)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
NTCIR Workshops & Patent Tasks
▶ Reliable automatic evaluation of MT is a desideratum▶ The NTCIR Workshop is a series of evaluation
workshops designed to enhance research in informationaccess technologies (including languages like Japaneseand Chinese)
▶ Patents are a challenging domain for MT due to thelength and complexity of the patent sentences (JP-EN& EN-JP)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
NTCIR-10 Patent MT Task
▶ NTCIR-8 and NTCIR-9 PatentMT Test CollectionJapanese-to-English Machine Translation Data package1
▶ Training data for JP-EN and EN-JP subtasks:▶ Approximately 3.2 million patent parallel sentence pairs▶ Monolingual patent corpus in Japanese spanning 13
years
1http://research.nii.ac.jp/ntcir/permission/ntcir-9/perm-en-PatentMT.html
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Japanese-English Translation Results
Figure 3: From Goto, Chow, et al. (2013)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Japanese-English Translation Results
▶ 4 top systems are RBMT or make use of commercialRBMT systems
▶ NTITI systems → post-ordering method andpre-ordering methods
▶ RWTH-1 → phrase-based SMT with a hierarchicalphrase reordering model
▶ The source sentence meaning could be understood(awarded C rank or better) for
▶ 55% of the sentences for the best RBMT (JAPIO-1)▶ 38% of the sentences for the best SMT (NTITI-1)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
English-Japanese Translation Results
Figure 4: From (Goto, Chow, et al., 2013)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
English-Japanese Translation Results
▶ EN-JP is a more difficult case for SMT systems▶ Yet at the NTCIR-9, NTITI-2 (pre-ordering) was
significantly better than the best-ranked RBMT system▶ The source sentence meaning could be understood
(awarded C rank or better) for▶ 70% of the translated sentences (vs. 58% for RBMT)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
BLEU singing the blues?
▶ N-gram evaluation metric with modified n-gramprecision to deal with variation in phrase order
▶ Demonstrably correlates with human judgments inmany cases
▶ Can be used for tracking broad, incremental changes toa single system
▶ However: NIST Machine Translation Evaluationexercise 2005 revealed a major mismatch between theBLEU evaluation and human judgments
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
BLEU singing the blues?
▶ Changes in translation quality are reported in opaqueBLEU scores, yet actual examples are rarely provided
▶ Echizen-ya et al. (2009) show that the popular BLEUand NIST metrics do not work well.Alternatives: IMPACT
▶ The global word order is essential for an SMT systemworking with distant language pairs:
▶ Common error: A because of B ⇒ B because of A▶ BLEU disregards global word order
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Disassembling BLEU scores
▶ “Millions of variations on a hypothesis translation [...]receive the same Bleu score” (Callison-Burch, Osborne,& Koehn, 2006, p. 1)
▶ Orejuela seemed quite calm as he was being led to theAmerican plane that would take him to Miami inFlorida.
▶ Appeared calm | when | he was | taken | to theAmerican plane | , | which will | to Miami , Florida .
▶ which will | he was | , | when | taken | Appeared calm |to the American plane | to Miami , Florida .
▶ was being led to the | calm as he was | would take |carry him | seemed quite | when | taken
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Disassembling BLEU scores
▶ Equal weighting of all items in the reference sentences▶ Words in the hypothesis sentence that did not appear in
the reference sentences can be substituted with anything▶ BLEU may not be appropriate for comparing systems
which employ different translation strategies
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
The RIBES metric
▶ Adequacy is more important for distant language pairs(Isozaki, Hirao, et al., 2010)
▶ NIST, PER, and TER turn out to be inadequatemeasures
▶ Solution: rank correlation coefficients compare wordranks in the reference with those in the hypothesissentences
▶ RIBES: Rank-based Intuitive Bilingual Evaluation Score▶ Defined for each test sentence and averaged score is
used for evaluating the entire test corpus(Isozaki et al.,2014)
▶ Significantly penalizes global word order mistakes
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Refining RIBES
▶ Rank correlation metrics are defined only when there isone-to-one correspondence
▶ In many cases, Spearman’s ρ and Kendall’s τ cannot bedetermined
▶ Solution:▶ find one-to-one corresponding higher-order n-grams▶ ρ and τ seem to be better metrics than BLEU
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Refining RIBES
▶ ρ and τ are normalized (NSR and NKT)▶ Precision (P), recall, and F-measure to reduce the
overestimation of NSR and NKT▶ Precision correlates best with adequacy among these
three metrics▶ New metrics for (0 ≤ α ≤ 1):
▶ NSRPα
▶ NKTPα
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Evaluating RIBES: Results
▶ NTCIR-7 PATMT JP-EN translation data
Figure 5: From Isozaki, Hirao, et al. (2010)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Evaluating RIBES: Discussion
▶ Results for new metrics are comparable to conventionalmethods even for similar language pairs (European)
▶ Tested on WMT-07 data (Callison-Burch et al., 2007)▶ With the precision-based penalty, both ρ and τ
outperform conventional methods
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
English vs. Japanese
▶ English is a head-initial language (SVO)▶ Japanese is a head-final language (SOV). It is also a
topic-prominent language▶ Scrambling of syntactic chunks called bunsetsu is a
frequent phenomenon▶ Deterministic rules fail at some sentence sentence
constructions▶ RIBES performs worse on scrambled sentences
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Difficult sentences
(1) KareHe
waTOP
kōhicoffee
woOBJ
noma-naidrink-NEG
He doesn’t drink coffee(2) Kōhi
CoffeewaTOP
noma-naidrink-NEG
I don’t drink coffee(3) Watashi
IwaTOP
jikantime
gaSUB
naiNEG
I don’t have time(4) Ima
NowwaTOP
jikantime
gaSUB
naiNEG
Now I don’t have time
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Reordering Strategies
▶ Conducting target word selection and reordering jointly:phrase-based SMT, hierarchical phrase-based SMT,syntax-based SMT
▶ Pre-ordering: reorder the source language sentence intoa target language word order. Then, translate thereordered source word sequence using SMT methods
▶ Post-ordering: reverse of the pre-ordering
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Single reordering rule: Head Final English
▶ Head Finalization: move syntactic heads to the end ofthe corresponding syntactic constituents (Isozaki,Sudoh, et al., 2010)
▶ A HPSG parser ENJU outputs syntactic heads▶ Particle seeds can be generated from the output
▶ Difficulties:▶ The sentence pair is not an exact one-to-one translation▶ Mistakes in ENJU’s tagging or parsing (esp. VBN/VBD
and VBZ/NNS mistakes) or in the GIZA++ automaticword alignment
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Rule-based pre-ordering model for JP-EN SMT
▶ Hoshino et al. (2013) put forth four rules:▶ Modify Japanese dependency trees: every head chunk
comes first and its dependent children follow▶ Convert SOV into SVO: place V after the S or the O, if
no S. Move V before the rightmost chunk if there isonly V
▶ Keep the word order of coordinate clauses unchangedand put them into the leftmost slot (punctuation comesinto the rightmost slot)
▶ For every chunk, swap function and content words →pseudo-prepositional phrases
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Rule-based pre-ordering model for JP-EN SMT
Figure 6: From (Hoshino et al., 2013)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Post-ordering framework for JP-EN translation
▶ Reordering model makes use of syntactic structures(relying on ENJU’s output) (Goto, Utiyama, & Sumita,2013)
▶ The model is based on:▶ parsing using probabilistic context free grammar (PCFG)▶ inversion transduction grammar (ITG)
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Bibliography I
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., &Schroeder, J. (2007). (meta-)evaluation of MachineTranslation. In Proceedings of the second workshop onstatistical machine translation (pp. 136–158).Association for Computational Linguistics.
Callison-Burch, C., Osborne, M., & Koehn, P. (2006).Re-evaluation the role of BLEU in machine translationresearch. In 11th conference of the european chapterof the association for computational linguistics.
Echizen-ya, H., Ehara, T., Shimohata, S., Fujii, A., Utiyama,M., Yamamoto, M., … Kando, N. (2009).Meta-evaluation of automatic evaluation methods formachine translation using patent translation data inNTCIR-7. In Proceedings of the 3rd workshop onpatent translation (p. 9-16).
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Bibliography II
Goto, I., Chow, K.-P., Lu, B., Sumita, E., & Tsou, B. K.(2013). Overview of the patent machine translationtask at the NTCIR-10 workshop. In Proceedings of the10th ntcir conference (p. 260-286).
Goto, I., Utiyama, M., & Sumita, E. (2013). Post-orderingby parsing with ITG for Japanese-English statisticalmachine translation. ACM Transactions on AsianLanguage Information Processing (TALIP), 12(4),17:1–17:22.
Hoshino, S., Miyao, Y., Sudoh, K., & Nagata, M. (2013).Two-stage pre-ordering for Japanese-to-Englishstatistical machine translation. In Proceedings of the6th international joint conference on natural languageprocessing (p. 1062-1066).
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Bibliography III
Isozaki, H., Hirao, T., Duh, K., Sudoh, K., & Tsukada, H.(2010). Automatic evaluation of translation quality fordistant language pairs. In Proceedings of the 2010conference on empirical methods in natural languageprocessing (pp. 944–952).
Isozaki, H., Kouchi, N., & Hirao, T. (2014).Dependency-based automatic enumeration ofsemantically equivalent word orders for evaluatingjapanese translations. In Proceedings of the 9thworkshop on statistical machine translation(p. 287-292).
Isozaki, H., Sudoh, K., Tsukada, H., & Duh, K. (2010).Head finalization: A simple reordering rule for SOVlanguages. In Proceedings of the joint 5th workshopon statistical machine translation and metricsmatr(p. 244-251).
RIBES
Stanislav Reichert
IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results
From BLEU toRIBESIdentifying BLEU’sweaknesses
The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES
ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models
References
Bibliography IV
Koehn, P. (2009). Statistical machine translation.Cambridge University Press.