2
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng NLP 2 CT Lab / Department of Computer and Information Science, University of Macau, Macau Machine Translation Summit XIV, Nice, France, 2 - 6 September 2013 The conventional machine translation evaluation metrics tend to perform well on certain language pairs but weak on other pairs. Some metrics only work on certain language pairs and are not independent from languages. Without using linguistic information usually leads to the evaluation result in low correlation with human judgments, while relying on linguistic features or external resources makes the metrics complicated and difficult in replicability. A novel language-independent evaluation metric is proposed to counterbalance it with enhanced factors and (optional) linguistic resources. Problem and Motivation The factor means n-gram based position difference penalty, representes n-gram position difference value, β„Ž_β„Ž denotes the length of hypothesis sentence, | _ | is the position difference value of th token. We extract the POS sequences of the corresponding word sentences using multilingual parsers and POS taggers, then measure the system level β„Ž value employing the same algorithm that is utilized on the lexical form β„Ž . The final score β„Ž is the combination of the two system-level scores. Proposed Method The augmented factors designed in the proposed models. The parameters and represent the length of candidate and reference sentences respectively. Proposed Method Enhanced Length Penalty: = 1βˆ’ ∢< 1βˆ’ ∢β‰₯ (1) Precision and Recall: = β„Ž (2) = β„Ž (3) HPR: harmonic mean of precision and recall. = + Γ— + (4) N-gram Position Difference Penalty: = βˆ’ (5) = 1 β„Ž β„Ž | | β„Ž β„Ž =1 (6) Reference: Two 1/7 clever 2/7 boys 3/7 sat 4/7 in 5/7 two 6/7 chairs 7/7 Output: Two 1/6 chairs 2/6 and 3/6 two 4/6 clever 5/6 boys 6/6 = 1 6 Γ— [ 1 6 βˆ’ 6 7 + 2 6 βˆ’ 7 7 + 4 6 βˆ’ 1 7 + 5 6 βˆ’ 2 7 + 6 6 βˆ’ 3 7 ] = 1 2 Sentence level score: β„Ž = + + + + (7) System level score: β„Ž = β„Ž = (8) Enhanced version with linguistic information: β„Ž = 1 β„Ž + β„Ž Γ— ( β„Ž β„Ž + β„Ž β„Ž ) (9)

MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Embed Size (px)

DESCRIPTION

Language-independent Model for Machine Translation Evaluation with Reinforced Factors International Association for Machine Translation2013 Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Xiaodong (Samuel) Zeng Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)

Citation preview

Page 1: MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng

NLP2CT Lab / Department of Computer and Information Science, University of Macau, Macau

Machine Translation Summit XIV, Nice, France, 2 - 6 September 2013

The conventional machine translation evaluation metrics tend to perform well on certain language pairs but weak on other pairs.

Some metrics only work on certain language pairs and are not independent from languages.

Without using linguistic information usually leads to the evaluation result in low correlation with human judgments, while relying on linguistic features or external resources makes the metrics complicated and difficult in replicability.

A novel language-independent evaluation metric is proposed to counterbalance it with enhanced factors and (optional) linguistic resources.

Problem and Motivation

The factor π‘π‘ƒπ‘œπ‘ π‘ƒπ‘’π‘›π‘Žπ‘™ means n-gram based position difference penalty, 𝑁𝑃𝐷 representes n-gram position difference value, πΏπ‘’π‘›π‘”π‘‘β„Ž_β„Žπ‘¦π‘ denotes the length of hypothesis sentence, |𝑃𝐷_𝑖 | is the position difference value of 𝑖th token.

We extract the POS sequences of the corresponding word sentences using multilingual parsers and POS taggers, then measure the system

level β„ŽπΏπΈπ‘ƒπ‘‚π‘…π‘ƒπ‘‚π‘† value employing the same algorithm that is utilized

on the lexical form β„ŽπΏπΈπ‘ƒπ‘‚π‘…π‘€π‘œπ‘Ÿπ‘‘ . The final score β„ŽπΏπΈπ‘ƒπ‘‚π‘…πΈ is the combination of the two system-level scores.

Proposed Method

The augmented factors designed in the proposed models.

The parameters 𝑐 and π‘Ÿ represent the length of candidate and reference sentences respectively.

Proposed Method

Enhanced Length

Penalty:

𝐸𝐿𝑃 = 𝑒1βˆ’

π‘Ÿπ‘ ∢ 𝑐 < π‘Ÿ

𝑒1βˆ’π‘π‘Ÿ ∢ 𝑐 β‰₯ π‘Ÿ

(1)

Precision and Recall: π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› =π΄π‘™π‘–π‘”π‘›π‘’π‘‘π‘›π‘’π‘š

πΏπ‘’π‘›π‘”π‘‘β„Žπ‘œπ‘’π‘‘π‘π‘’π‘‘ (2)

π‘…π‘’π‘π‘Žπ‘™π‘™ =π΄π‘™π‘–π‘”π‘›π‘’π‘‘π‘›π‘’π‘š

πΏπ‘’π‘›π‘”π‘‘β„Žπ‘Ÿπ‘’π‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’ (3)

HPR: harmonic mean of

precision and recall. 𝐻𝑃𝑅 =

𝛼 + 𝛽 π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› Γ— π‘…π‘’π‘π‘Žπ‘™π‘™

π›Όπ‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› + π›½π‘…π‘’π‘π‘Žπ‘™π‘™ (4)

N-gram Position

Difference Penalty: π‘π‘ƒπ‘œπ‘ π‘ƒπ‘’π‘›π‘Žπ‘™ = π‘’βˆ’π‘π‘ƒπ· (5)

𝑁𝑃𝐷 =1

πΏπ‘’π‘›π‘”π‘‘β„Žβ„Žπ‘¦π‘ |𝑃𝐷𝑖|

πΏπ‘’π‘›π‘”π‘‘β„Žβ„Žπ‘¦π‘

𝑖=1 (6)

Reference: Two1/7 clever2/7 boys3/7 sat4/7 in5/7 two6/7 chairs7/7

Output: Two1/6 chairs2/6 and3/6 two4/6 clever5/6 boys 6/6

𝑁𝑃𝐷 =1

6Γ— [

1

6βˆ’

6

7 +

2

6βˆ’

7

7 +

4

6βˆ’

1

7 +

5

6βˆ’

2

7 +

6

6βˆ’

3

7 ] =

1

2

Sentence level score: β„ŽπΏπΈπ‘ƒπ‘‚π‘… =

𝑀𝐿𝑃 + π‘€π‘π‘ƒπ‘œπ‘ π‘ƒπ‘’π‘›π‘Žπ‘™ + 𝑀𝐻𝑃𝑅𝑀𝐿𝑃𝐿𝑃

+π‘€π‘π‘ƒπ‘œπ‘ π‘ƒπ‘’π‘›π‘Žπ‘™π‘π‘ƒπ‘œπ‘ π‘ƒπ‘’π‘›π‘Žπ‘™

+𝑀𝐻𝑃𝑅𝐻𝑃𝑅

(7)

System level score: β„ŽπΏπΈπ‘ƒπ‘‚π‘… =𝟏

π‘†π‘’π‘›π‘‘π‘π‘’π‘š β„ŽπΏπΈπ‘ƒπ‘‚π‘…π‘–

π‘†π‘’π‘›π‘‘π‘π‘’π‘šπ’Š=𝟏 (8)

Enhanced version with linguistic information:

β„ŽπΏπΈπ‘ƒπ‘‚π‘…πΈ =1

π‘€β„Žπ‘€+π‘€β„Žπ‘Γ—

(π‘€β„Žπ‘€β„ŽπΏπΈπ‘ƒπ‘‚π‘…π‘€π‘œπ‘Ÿπ‘‘ + π‘€β„Žπ‘β„ŽπΏπΈπ‘ƒπ‘‚π‘…π‘ƒπ‘‚π‘†)

(9)

Page 2: MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng

NLP2CT Lab / Department of Computer and Information Science, University of Macau, Macau

Machine Translation Summit XIV, Nice, France, 2 - 6 September 2013

Experiments

CS-EN DE-EN ES-EN FR-EN EN-CS EN-DE EN-ES EN-FR

HPR:ELP:NPP(W) 7:2:1 3:2:1 7:2:1 3:2:1 3:2:1 1:3:7 3:2:1 3:2:1

HPR:ELP:NPP(POS) N/A 3:2:1 N/A 3:2:1 N/A 7:2:1 N/A 3:2:1

𝛼: 𝛽(W) 1:9 9:1 1:9 9:1 9:1 9:1 9:1 9:1

𝛼: 𝛽(POS) N/A 9:1 N/A 9:1 N/A 9:1 N/A 9:1

π‘€β„Žπ‘€: π‘€β„Žπ‘ N/A 1:9 N/A 9:1 N/A 1:9 N/A 9:1

Table 1. Values of weights used in other-EN and EN-other evaluations

CS-EN DE-EN ES-EN FR-EN EN-CS EN-DE EN-ES EN-FR Mean

β„ŽπΏπΈπ‘ƒπ‘‚π‘…πΈ 0.93 0.86 0.88 0.92 0.56 0.82 0.85 0.83 0.83

MPF 0.95 0.69 0.83 0.87 0.72 0.63 0.87 0.89 0.81

ROSE 0.88 0.59 0.92 0.86 0.65 0.41 0.90 0.86 0.76

METEOR 0.93 0.71 0.91 0.93 0.65 0.30 0.74 0.85 0.75

BLEU 0.88 0.48 0.90 0.85 0.65 0.44 0.87 0.86 0.74

TER 0.83 0.33 0.89 0.77 0.50 0.12 0.81 0.84 0.64

Table 2. Spearman correlation with human judgments on WMT11

EN-FR EN-DE EN-ES EN-CS EN-RU Av

LEPOR_v3.1 .91 .94 .91 .76 .77 .86

nLEPOR_baseline .92 .92 .90 .82 .68 .85

SIMPBLEU_RECALL .95 .93 .90 .82 .63 .84

SIMPBLEU_PREC .94 .90 .89 .82 .65 .84

NIST-mteval-inter .91 .83 .84 .79 .68 .81

Meteor .91 .88 .88 .82 .55 .81

BLEU-mteval-inter .89 .84 .88 .81 .61 .80

BLEU-moses .90 .82 .88 .80 .62 .80

BLEU-mteval .90 .82 .87 .80 .62 .80

CDER-moses .91 .82 .88 .74 .63 .80

NIST-mteval .91 .79 .83 .78 .68 .79

PER-moses .88 .65 .88 .76 .62 .76

TER-moses .91 .73 .78 .70 .61 .75

WER-moses .92 .69 .77 .70 .61 .74

TerrorCat .94 .96 .95 na na .95

SEMPOS Na na na .72 na .72

ACTa .81 -.47 na na na .17

ACTa5+6 .81 -.47 na na na .17

Table 3. System-level Pearson Correlation Scores on WMT13

FR-EN DE-EN ES-EN CS-EN RU-EN Av

Meteor .98 .96 .97 .99 .84 .95

SEMPOS .95 .95 .96 .99 .82 .93

Depref-align .97 .97 .97 .98 .74 .93

Depref-exact .97 .97 .96 .98 .73 .92

SIMPBLEU_RECALL .97 .97 .96 .94 .78 .92

UMEANT .96 .97 .99 .97 .66 .91

MEANT .96 .96 .99 .96 .63 .90

CDER-moses .96 .91 .95 .90 .66 .88

SIMPBLEU_PREC .95 .92 .95 .91 .61 .87

LEPOR_v3.1 .96 .96 .90 .81 .71 .87

nLEPOR_baseline .96 .94 .94 .80 .69 .87

BLEU-mteval-inter .95 .92 .94 .90 .61 .86

NIST-mteval-inter .94 .91 .93 .84 .66 .86

BLEU-moses .94 .91 .94 .89 .60 .86

BLEU-mteval .95 .90 .94 .88 .60 .85

NIST-mteval .94 .90 .93 .84 .65 .85

TER-moses .93 .87 .91 .77 .52 .80

WER-moses .93 .84 .89 .76 .50 .78

PER-moses .84 .88 .87 .74 .45 .76

TerrorCat .98 .98 .97 na na .98

Table 4. System-level Pearson Correlation Scores on WMT13

Training data (WMT08):

2,028 sentences for each document

English vs Spanish / German / French / Czech

Testing data (WMT11):

3,003 sentences for each document

English vs Spanish / German / French / Czech

Testing data (WMT13):

3,000 sentences for each document

English vs Spanish / German / French / Czech

/ Russian