Automatic Evaluation of Translation Quality for Distant ...sfs.uni-tuebingen.de/~keberle/HybridMT/Presentations/Stanislav Reichert...RIBES Stanislav Reichert Introduction NTCIR Workshops

RIBES

Stanislav Reichert

IntroductionNTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-EnglishTranslation ResultsEnglish-JapaneseTranslation Results

From BLEU toRIBESIdentifying BLEU’sweaknesses

The RIBES metricIntroducing RIBESRefining RIBESEvaluating RIBES

ReorderingStrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models

References

Automatic Evaluationof Translation Quality forDistant Language Pairs

Stanislav Reichert

University of Tübingen

14th June 2018

RIBES

Stanislav Reichert





References

OutlineIntroduction

NTCIR WorkshopsNTCIR-10 Patent MT TaskJapanese-English Translation ResultsEnglish-Japanese Translation Results

From BLEU to RIBESIdentifying BLEU’s weaknesses


Reordering StrategiesEnglish vs. JapanesePre-ordering modelsPost-ordering models

References

RIBES

Stanislav Reichert





References

Good translations: I know one when I see one

Figure 1: From Koehn (2009, p. 218)

RIBES

Stanislav Reichert





References

Good translations: I know one when I see one

Figure 2: From Goto, Chow, et al. (2013)

RIBES

Stanislav Reichert





References

NTCIR Workshops & Patent Tasks

▶ Reliable automatic evaluation of MT is a desideratum▶ The NTCIR Workshop is a series of evaluation

workshops designed to enhance research in informationaccess technologies (including languages like Japaneseand Chinese)

▶ Patents are a challenging domain for MT due to thelength and complexity of the patent sentences (JP-EN& EN-JP)

RIBES

Stanislav Reichert





References

NTCIR-10 Patent MT Task

▶ NTCIR-8 and NTCIR-9 PatentMT Test CollectionJapanese-to-English Machine Translation Data package1

▶ Training data for JP-EN and EN-JP subtasks:▶ Approximately 3.2 million patent parallel sentence pairs▶ Monolingual patent corpus in Japanese spanning 13

years

1http://research.nii.ac.jp/ntcir/permission/ntcir-9/perm-en-PatentMT.html

http://research.nii.ac.jp/ntcir/permission/ntcir-9/perm-en-PatentMT.html

http://research.nii.ac.jp/ntcir/permission/ntcir-9/perm-en-PatentMT.html

RIBES

Stanislav Reichert





References

Japanese-English Translation Results

Figure 3: From Goto, Chow, et al. (2013)

RIBES

Stanislav Reichert





References

Japanese-English Translation Results

▶ 4 top systems are RBMT or make use of commercialRBMT systems

▶ NTITI systems → post-ordering method andpre-ordering methods

▶ RWTH-1 → phrase-based SMT with a hierarchicalphrase reordering model

▶ The source sentence meaning could be understood(awarded C rank or better) for

▶ 55% of the sentences for the best RBMT (JAPIO-1)▶ 38% of the sentences for the best SMT (NTITI-1)

RIBES

Stanislav Reichert





References

English-Japanese Translation Results

Figure 4: From (Goto, Chow, et al., 2013)

RIBES

Stanislav Reichert





References

English-Japanese Translation Results

▶ EN-JP is a more difficult case for SMT systems▶ Yet at the NTCIR-9, NTITI-2 (pre-ordering) was

significantly better than the best-ranked RBMT system▶ The source sentence meaning could be understood

(awarded C rank or better) for▶ 70% of the translated sentences (vs. 58% for RBMT)

RIBES

Stanislav Reichert





References

BLEU singing the blues?

▶ N-gram evaluation metric with modified n-gramprecision to deal with variation in phrase order

▶ Demonstrably correlates with human judgments inmany cases

▶ Can be used for tracking broad, incremental changes toa single system

▶ However: NIST Machine Translation Evaluationexercise 2005 revealed a major mismatch between theBLEU evaluation and human judgments

RIBES

Stanislav Reichert





References

BLEU singing the blues?

▶ Changes in translation quality are reported in opaqueBLEU scores, yet actual examples are rarely provided

▶ Echizen-ya et al. (2009) show that the popular BLEUand NIST metrics do not work well.Alternatives: IMPACT

▶ The global word order is essential for an SMT systemworking with distant language pairs:

▶ Common error: A because of B ⇒ B because of A▶ BLEU disregards global word order

RIBES

Stanislav Reichert





References

Disassembling BLEU scores

▶ “Millions of variations on a hypothesis translation [...]receive the same Bleu score” (Callison-Burch, Osborne,& Koehn, 2006, p. 1)

▶ Orejuela seemed quite calm as he was being led to theAmerican plane that would take him to Miami inFlorida.

▶ Appeared calm | when | he was | taken | to theAmerican plane | , | which will | to Miami , Florida .

▶ which will | he was | , | when | taken | Appeared calm |to the American plane | to Miami , Florida .

▶ was being led to the | calm as he was | would take |carry him | seemed quite | when | taken

RIBES

Stanislav Reichert





References

Disassembling BLEU scores

▶ Equal weighting of all items in the reference sentences▶ Words in the hypothesis sentence that did not appear in

the reference sentences can be substituted with anything▶ BLEU may not be appropriate for comparing systems

which employ different translation strategies

RIBES

Stanislav Reichert





References

The RIBES metric

▶ Adequacy is more important for distant language pairs(Isozaki, Hirao, et al., 2010)

▶ NIST, PER, and TER turn out to be inadequatemeasures

▶ Solution: rank correlation coefficients compare wordranks in the reference with those in the hypothesissentences

▶ RIBES: Rank-based Intuitive Bilingual Evaluation Score▶ Defined for each test sentence and averaged score is

used for evaluating the entire test corpus(Isozaki et al.,2014)

▶ Significantly penalizes global word order mistakes

RIBES

Stanislav Reichert





References

Refining RIBES

▶ Rank correlation metrics are defined only when there isone-to-one correspondence

▶ In many cases, Spearman’s ρ and Kendall’s τ cannot bedetermined

▶ Solution:▶ find one-to-one corresponding higher-order n-grams▶ ρ and τ seem to be better metrics than BLEU

RIBES

Stanislav Reichert





References

Refining RIBES

▶ ρ and τ are normalized (NSR and NKT)▶ Precision (P), recall, and F-measure to reduce the

overestimation of NSR and NKT▶ Precision correlates best with adequacy among these

three metrics▶ New metrics for (0 ≤ α ≤ 1):

▶ NSRPα

▶ NKTPα

RIBES

Stanislav Reichert





References

Evaluating RIBES: Results

▶ NTCIR-7 PATMT JP-EN translation data

Figure 5: From Isozaki, Hirao, et al. (2010)

RIBES

Stanislav Reichert





References

Evaluating RIBES: Discussion

▶ Results for new metrics are comparable to conventionalmethods even for similar language pairs (European)

▶ Tested on WMT-07 data (Callison-Burch et al., 2007)▶ With the precision-based penalty, both ρ and τ

outperform conventional methods

RIBES

Stanislav Reichert





References

English vs. Japanese

▶ English is a head-initial language (SVO)▶ Japanese is a head-final language (SOV). It is also a

topic-prominent language▶ Scrambling of syntactic chunks called bunsetsu is a

frequent phenomenon▶ Deterministic rules fail at some sentence sentence

constructions▶ RIBES performs worse on scrambled sentences

RIBES

Stanislav Reichert





References

Difficult sentences

(1) KareHe

waTOP

kōhicoffee

woOBJ

noma-naidrink-NEG

He doesn’t drink coffee(2) Kōhi

CoffeewaTOP

noma-naidrink-NEG

I don’t drink coffee(3) Watashi

IwaTOP

jikantime

gaSUB

naiNEG

I don’t have time(4) Ima

NowwaTOP

jikantime

gaSUB

naiNEG

Now I don’t have time

RIBES

Stanislav Reichert





References

Reordering Strategies

▶ Conducting target word selection and reordering jointly:phrase-based SMT, hierarchical phrase-based SMT,syntax-based SMT

▶ Pre-ordering: reorder the source language sentence intoa target language word order. Then, translate thereordered source word sequence using SMT methods

▶ Post-ordering: reverse of the pre-ordering

RIBES

Stanislav Reichert





References

Single reordering rule: Head Final English

▶ Head Finalization: move syntactic heads to the end ofthe corresponding syntactic constituents (Isozaki,Sudoh, et al., 2010)

▶ A HPSG parser ENJU outputs syntactic heads▶ Particle seeds can be generated from the output

▶ Difficulties:▶ The sentence pair is not an exact one-to-one translation▶ Mistakes in ENJU’s tagging or parsing (esp. VBN/VBD

and VBZ/NNS mistakes) or in the GIZA++ automaticword alignment

RIBES

Stanislav Reichert





References

Rule-based pre-ordering model for JP-EN SMT

▶ Hoshino et al. (2013) put forth four rules:▶ Modify Japanese dependency trees: every head chunk

comes first and its dependent children follow▶ Convert SOV into SVO: place V after the S or the O, if

no S. Move V before the rightmost chunk if there isonly V

▶ Keep the word order of coordinate clauses unchangedand put them into the leftmost slot (punctuation comesinto the rightmost slot)

▶ For every chunk, swap function and content words →pseudo-prepositional phrases

RIBES

Stanislav Reichert





References

Rule-based pre-ordering model for JP-EN SMT

Figure 6: From (Hoshino et al., 2013)

RIBES

Stanislav Reichert





References

Post-ordering framework for JP-EN translation

▶ Reordering model makes use of syntactic structures(relying on ENJU’s output) (Goto, Utiyama, & Sumita,2013)

▶ The model is based on:▶ parsing using probabilistic context free grammar (PCFG)▶ inversion transduction grammar (ITG)

RIBES

Stanislav Reichert





References

Bibliography I

Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., &Schroeder, J. (2007). (meta-)evaluation of MachineTranslation. In Proceedings of the second workshop onstatistical machine translation (pp. 136–158).Association for Computational Linguistics.

Callison-Burch, C., Osborne, M., & Koehn, P. (2006).Re-evaluation the role of BLEU in machine translationresearch. In 11th conference of the european chapterof the association for computational linguistics.

Echizen-ya, H., Ehara, T., Shimohata, S., Fujii, A., Utiyama,M., Yamamoto, M., … Kando, N. (2009).Meta-evaluation of automatic evaluation methods formachine translation using patent translation data inNTCIR-7. In Proceedings of the 3rd workshop onpatent translation (p. 9-16).

RIBES

Stanislav Reichert





References

Bibliography II

Goto, I., Chow, K.-P., Lu, B., Sumita, E., & Tsou, B. K.(2013). Overview of the patent machine translationtask at the NTCIR-10 workshop. In Proceedings of the10th ntcir conference (p. 260-286).

Goto, I., Utiyama, M., & Sumita, E. (2013). Post-orderingby parsing with ITG for Japanese-English statisticalmachine translation. ACM Transactions on AsianLanguage Information Processing (TALIP), 12(4),17:1–17:22.

Hoshino, S., Miyao, Y., Sudoh, K., & Nagata, M. (2013).Two-stage pre-ordering for Japanese-to-Englishstatistical machine translation. In Proceedings of the6th international joint conference on natural languageprocessing (p. 1062-1066).

RIBES

Stanislav Reichert





References

Bibliography III

Isozaki, H., Hirao, T., Duh, K., Sudoh, K., & Tsukada, H.(2010). Automatic evaluation of translation quality fordistant language pairs. In Proceedings of the 2010conference on empirical methods in natural languageprocessing (pp. 944–952).

Isozaki, H., Kouchi, N., & Hirao, T. (2014).Dependency-based automatic enumeration ofsemantically equivalent word orders for evaluatingjapanese translations. In Proceedings of the 9thworkshop on statistical machine translation(p. 287-292).

Isozaki, H., Sudoh, K., Tsukada, H., & Duh, K. (2010).Head finalization: A simple reordering rule for SOVlanguages. In Proceedings of the joint 5th workshopon statistical machine translation and metricsmatr(p. 244-251).

RIBES

Stanislav Reichert





References

Bibliography IV

Koehn, P. (2009). Statistical machine translation.Cambridge University Press.

Documents

Automatic Evaluation of Translation Quality for Distant ...sfs.uni-tuebingen.de/~keberle/HybridMT/Presentations/Stanislav Reichert...RIBES Stanislav Reichert Introduction NTCIR Workshops