Findings of the WMT 2018 Shared Task on Quality Estimation · features for each task, and a software package to extract these and other quality estimation features, and perform model

Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 702–722Brussels, Belgium, October 31 - November 1, 2018. c©2018 Association for Computational Linguistics

https://doi.org/10.18653/v1/W18-6452

Findings of the WMT 2018 Shared Task on Quality EstimationLucia Specia and Frederic BlainDepartment of Computer Science

University of Sheffield, UK{l.specia,f.blain}@sheffield.ac.uk

Varvara LogachevaNeural Networks and Deep Learning Lab

MIPT, Moscow, [email protected]

Ramon F. AstudilloL2F, INESC-ID-Lisboa

Lisbon, [email protected]

Andre MartinsUnbabel & Instituto de Telecomunicacoes

Lisbon, [email protected]

Abstract

We report the results of the WMT18 sharedtask on Quality Estimation, i.e. the task ofpredicting the quality of the output of ma-chine translation systems at various granular-ity levels: word, phrase, sentence and doc-ument. This year we include four languagepairs, three text domains, and translations pro-duced by both statistical and neural machinetranslation systems. Participating teams fromten institutions submitted a variety of systemsto different task variants and language pairs.

1 Introduction

This shared task builds on its previous six edi-tions to further examine automatic methods for es-timating the quality of machine translation (MT)output at run-time, without the use of referencetranslations. It includes the (sub)tasks of word-level, phrase-level, sentence-level and document-level estimation. In addition to advancing the stateof the art at all prediction levels, our goals include:

• To study the performance of quality estima-tion approaches on the output of neural MTsystems. We do so by providing datasets fortwo language pairs where source segmentswere translated by both statistical phrase-based and neural MT systems.

• To study the predictability of missing wordsin the MT output. To do so, for the first timewe provide data annotated for such errors attraining time.

• To study the predictability of source wordsthat lead to errors in the MT output. To do so,for the first time we provide source segmentsannotated for such errors at the word level.

• To study the effectiveness of manually as-signed labels for phrases. For that we provide

a dataset where each phrase was annotated byhuman translators.

• To investigate the utility of detailed informa-tion logged during post-editing. We do soby providing post-editing time, keystrokes, aswell as post-editor ID.

• To study quality prediction for documentsfrom errors annotated at word-level withadded severity judgements. This is done us-ing a new corpus manually annotated witha fine-grained error taxonomy, from whichdocument-level scores are derived.

This year’s shared task provides new trainingand test datasets for all tasks, and allows partici-pants to explore any additional data and resourcesdeemed relevant. Tasks make use of large datasetsproduced either from post-editions or annotationsby professional translators, or from direct humanannotations. The following text domains are avail-able for different languages and tasks: informa-tion technology (IT), life sciences, and product ti-tle and descriptions on sports and outdoor activ-ities. In-house statistical and neural MT systemswere built to produce translations for the two firstdomains, while an online system was used for thethird domain.

The four tasks are defined as follows: Task 1aims at predicting post-editing effort at sentencelevel (Section 5); Task 2 aims at predicting wordsthat need editing, as well as missing words andincorrect source words (Section 6); Task 3 aimsat predicting phrases that need editing, as well asmissing phrases and incorrect source phrases (Sec-tion 7); and Task 4 (Section 8) aims at predictinga score for an entire document as a function of theproportion of incorrect words in such a document,weighted by the severity of the different errors.

702

Five datasets and language pairs are usedfor different tasks (Section 4): English-German(Tasks 1, 2) and English-Czech (Tasks 1, 2) onthe IT domain, English-Latvian (Tasks 1, 2) andGerman-English (Tasks 1, 2, 3), both on the lifesciences domain, English-French (Task 4) withproduct titles and descriptions within the sportsand outdoor activities domain.

Participants are provided with a baseline set offeatures for each task, and a software package toextract these and other quality estimation features,and perform model learning (Section 2). Partici-pants (Section 3) could submit up to two systemsfor each task and language pair. A discussion onthe main goals and findings from this year’s taskis given in Section 9.

2 Baseline systems

Sentence-level baseline system: For Task 1,QUEST++1 (Specia et al., 2015) was used to ex-tract 17 MT system-independent features from thesource and translation (target) files and parallelcorpora:

• Number of tokens in the source and targetsentences.• Average source token length.• Average number of occurrences of the target

word within the target sentence.• Number of punctuation marks in source and

target sentences.• Language model (LM) probability of source

and target sentences based on models builtusing the source or target sides of the parallelcorpus used to train the SMT system.• Average number of translations per source

word in the sentence as given by the IBMmodel 1 extracted using the SMT parallelcorpus, and thresholded such that P (t|s) >0.2 or P (t|s) > 0.01.• Percentage of unigrams, bigrams and tri-

grams in frequency quartiles 1 (lower fre-quency words) and 4 (higher frequencywords) in the source language extracted fromthe source side of the SMT parallel corpus.• Percentage of unigrams in the source sen-

tence seen in the source side of the SMT par-allel corps.

These features were used to train a Support Vec-tor Regression (SVR) algorithm using a Radial

1https://github.com/ghpaetzold/questplusplus

Basis Function (RBF) kernel within the SCIKIT-LEARN toolkit.2 The γ, ε and C parameters wereoptimised via grid search with 5-fold cross valida-tion on the training set, resulting in γ=0.01, ε =0.0825, C = 20. This baseline system has beenconsistently used as the baseline system for alleditions of the sentence-level task (Callison-Burchet al., 2012; Bojar et al., 2013, 2014, 2015, 2016,2017), and has proved strong enough for predict-ing various forms of post-editing effort across arange of language pairs and text domains for sta-tistical MT systems. This year it is also bench-marked on neural MT outputs.

Word-level baseline system: For Task 2, thebaseline features were extracted with the MAR-MOT tool (Logacheva et al., 2016). These are 28features that have been deemed the most informa-tive in previous research on word-level QE, mostlyinspired by (Luong et al., 2014). This is the samebaseline system used in WMT17:

• Word count in the source and target sen-tences, and source and target token count ra-tio. Although these features are sentence-level (i.e. their values will be the same for allwords in a sentence), the length of a sentencemight influence the probability of a word be-ing wrong.• Target token, its left and right contexts of one

word.• Source word aligned to the target token, its

left and right contexts of one word. Thealignments were given by the SMT systemthat produced the automatic translations.• Boolean dictionary features: target token is

a stop word, a punctuation mark, a propernoun, or a number.• Target language model features:

– The order of the highest order ngramwhich starts and end with the target to-ken.

– The order of the highest order ngramwhich starts and ends with the source to-ken.

– The part-of-speech (POS) tags of the tar-get and source tokens.

– Backoff behaviour of the ngrams(ti−2, ti−1, ti), (ti−1, ti, ti+1),(ti, ti+1, ti+2), where ti is the target

2http://scikit-learn.org/

703

token (backoff behaviour is computedas described by (2011)).

In addition to that, six new features were in-cluded which contain combinations of other fea-tures, and which proved useful in (Kreutzer et al.,2015; Martins et al., 2016):

• Target word + left context.• Target word + right context.• Target word + aligned source word.• POS of target word + POS of aligned source

word.• Target word + left context + source word.• Target word + right context + source word.

The baseline system models the task as asequence prediction problem using the Linear-Chain Conditional Random Fields (CRF) algo-rithm within the CRFSuite tool.3 The model wastrained using passive-aggressive optimisation al-gorithm.

We note that this baseline system was only usedto predict OK/BAD classes for existing words inthe MT output. No baseline system was providedfor predicting missing words or erroneous sourcewords.

Phrase-level baseline system: The phrase-levelsystem is identical to the one used in last year’sshared task. The phrase-level features were alsoextracted with MARMOT, but they are differentfrom the word-level features. They are based onthe sentence-level features in QUEST++.4 Theseare the so-called “black-box” features – featuresthat do not use the internal information from theMT system. The baseline uses the following 72features:

• Source phrase frequency features:

– average frequency of ngrams (unigrams,bigrams, trigrams) in different quartilesof frequency (the low and high fre-quency ngrams) in the source side of theSMT parallel corpus.

– percentage of distinct source ngrams(unigrams, bigrams, trigrams) seen inthe source side of the SMT parallel cor-pus.

3http://www.chokkan.org/software/crfsuite/

4http://www.quest.dcs.shef.ac.uk/quest_files/features_blackbox

• Translation probability features:

– average number of translations persource word in the phrase as givenby the IBM model 1 extracted usingthe SMT parallel corpus (with differenttranslation probability thresholds: 0.01,0.05, 0.1, 0.2, 0.5).

– average number of translations persource word in the phrase as givenby the IBM model 1 extracted usingthe SMT parallel corpus (with differenttranslation probability thresholds: 0.01,0.05, 0.1, 0.2, 0.5) weighted by the fre-quency of each word in the source sideof the parallel SMT corpus.

• Punctuation features:

– difference between numbers of variouspunctuation marks (periods, commas,colons, semicolons, question and excla-mation marks) in the source and the tar-get phrases.

– difference between numbers of variouspunctuation marks normalised by thelength of the target phrase.

– percentage of punctuation marks in thetarget or source phrases.

• Language model features:

– log probability of the source or targetphrases based on models built using thesource or target sides of the parallel cor-pus used to train the SMT system.

– perplexity of the source and the targetphrases using the same models as above.

• Phrase statistics:

– lengths of the source or target phrases.– ratio between the source and target

phrase lengths.– average length of tokens in source or tar-

get phrases.– average occurrence of target word

within the target phrase.

• Alignment features:

– number of unaligned target words, us-ing the word alignment provided by theSMT decoder.

– number of target words aligned to morethan one source word.

704

– average number of alignments per wordin the target phrase.

• Part-of-speech features:

– percentage of content words in thesource or target phrases.

– percentage of words of a particular partof speech tag (verb, noun, pronoun) inthe source or target phrases.

– ratio of numbers of words of a particu-lar part of speech (verb, noun, pronoun)between the source and target phrases.

– percentage of numbers and alphanu-meric tokens in the source or targetphrases.

– ratio between the percentage of numbersand alphanumeric tokens in the sourceand target phrases.

Analogously to the baseline word-level system,we treat phrase-level QE as a sequence labellingtask, and model it using CRF from the CRFSuitetoolkit and the passive-aggressive optimisation al-gorithm.

Once more, this baseline system was only usedto predict OK/BAD classes for existing phrases inthe MT output. No baseline system was providedfor predicting missing phrases or erroneous sourcephrases.

3 Participants

Table 1 lists all participating teams submitting sys-tems to any of the tasks. Each team was allowedup to two submissions for each task variant andlanguage pair. In the descriptions below, participa-tion in specific tasks is denoted by a task identifier(T1 = Task 1, T2 = Task 2, T3 = Task 3, T4 = Task4).

CMU-LTI (T2):

The CMU-LTI team proposes a ContextualEncoding model for QE. The model consistsin three major parts that encode the localand global context information for each tar-get word. The first part uses an embeddinglayer to represent words and their POS tagsin both languages. The second part leveragesa one-dimensional convolution layer to inte-grate local context information for each targetword. The third part applies a stack of feed-forward and recurrent neural networks to fur-ther encode the global context in the sentence

before making the predictions. Syntactic fea-tures, such as ngrams, are then integratedto the final feed-forward layer in the neu-ral model. This model achieves competitiveresults on the English-Czech and English-Latvian word-level QE task.

JU-USAAR (T2):

JU-USAAR presents two approaches toword-level QE: (i) a Bag-of-Words (BoW)model, and (ii) a Paragraph Vector (Doc2Vec)model (Le and Mikolov, 2014). In theBoW model, bag-of-words are prepared fromsource sentences for each target word appear-ing in both the MT and PE output in the train-ing data. For every target word appearing inthe MT output in the development set, thecosine similarity between the correspondingsource sentence and the bag-of-words for thesame target word is computed. From this re-sult, a threshold (for the target word) is de-fined above which the word is retained (i.e.,considered ‘OK’). In the Doc2Vec-based ap-proach, for each target word appearing inboth MT and PE output in the training data,two document vectors are prepared from (i)the corresponding source sentences and (ii)the bag-of-words (as in the BoW model) ofthe target word. Next, the similarity betweenthese two document vectors for every targetword is computed. From the Doc2Vec sim-ilarity score and the corresponding PE deci-sion (i.e., whether or not the target word isretained in the PE in the training dataset), asystem level threshold is defined. For the testset sentences, if the Doc2Vec similarity scorefor a target word exceeds this threshold value,then the target word labelled as ‘OK’, other-wise it is labelled as ‘BAD’.

MEQ (T1)

The Vicomtech team submitted two ap-proaches. uMQE is an unsupervised mini-malist approach based on two simple mea-sures of accuracy and fluency, respectively.Accuracy is computed via overlapping lexicaltranslation bags of words, with a set expan-sion mechanism based on longest commonprefixes and surface-defined named entities.Fluency is computed by taking the inverse ofcross-entropy, according to an in-domain lan-guage model. Both measures are combined

705

ID Participating teamCMU-LTI Carnegie Melon University, US (Hu et al., 2018)

JU-USAAR Jadavpur University, India & University of Saarland, Germany (Basu et al., 2018)MQE Vicomtech, Spain (Etchegoyhen et al., 2018)

QEbrain Alibaba Group Inc, US (Wang et al., 2018)RTM Referential Translation Machines, Turkey (Bicici, 2018)

SHEF University of Sheffield, UK (Ive et al., 2018b)TSKQE University of Hamburg (Duma and Menzel, 2018)

UAlacant University of Alacant, Spain (Sanchez-Martıınez et al., 2018)UNQE Jiangxi Normal University, ChinaUTartu University of Tartu, Estonia (Yankovskaya et al., 2018)

Table 1: Participants in the WMT18 Quality Estimation shared task.

via simple arithmetic means on rescaled val-ues, i.e., no machine learning is used. Sinceit is unsupervised, the method can only bemeaningfully evaluated on the ranking task.sMQE uses the same two features as uMQE,but with supervision. A Support Vector Re-gressor based on these two features is trainedon the available data and used to predict QEscores.

QEbrain (T1, T2):

QE brain uses a conditional target languagemodel as a robust feature extractor with anovel bidirectional transformer which is pre-trained on a large parallel corpus filtered tocontain “in-domain like” sentences. For QEinference, the feature extraction model canproduce not only the high-level joint latentsemantic representation between the sourceand the machine translation, but real-valuedmeasurements of possible erroneous tokensbased on the prior knowledge learned fromthe parallel data. More specifically, it usesthe multi-head self-attention mechanism andtransformer neural networks (Vaswani et al.,2017) to build the language model. It con-tains one transformer encoder for the sourceand a bidirectional transformer encoder forthe target. After the feature extraction modelis trained, the features are extracted and com-bined with human-crafted features from theQE baseline system and fed into a Bi-LSTMpredictive model for QE. A greedy ensembleselection method is used to decrease the in-dividual model errors and increase model di-versity. The bi-LSTM QE model is trainedon the official QE data plus artificially gener-ated data and fine-tuned with only the officialWMT18 QE data.

RTM (T1, T2, T3, T4):

These submissions build on the previ-ous year’s Referential Translation Machine(RTM) approach (Bicici, 2017). RTMs pre-dict data translation between the instances inthe training set and the test set using inter-pretants, data close to the task instances. In-terpretants provide context for the predictiontask and are used during the derivation of thefeatures measuring the closeness of the testsentences to the training data, the difficultyof translating them, and to identify transla-tion acts between any two data sets for build-ing prediction models. Task-specific qual-ity prediction RTM models are built usingthe WMT News translation task corpora, tak-ing MT models as a black-box and predict-ing translation scores independently on theMT model. Multiple machine learning tech-niques are used and averaged based on theirtraining set performance for label prediction.For sequence classification tasks (T2 and T3),Global Linear Models with dynamic learning(Bicici, 2013) are used.

SHEF (T1, T2, T3, T4):

SHEF submitted two systems per task vari-ant: SHEF-PT and SHEF-bRNN. SHEF-PT is based on a re-implementation of thePOSTECH system of (Kim et al., 2017),SHEF-bRNN uses a bidirectional recurrentneural network (bRNN) (Ive et al., 2018a).PT systems are pre-trained using in-domaincorpora provided by the organisers. bRNNsystems uses two encoders to learn repre-sentations of <source, MT> sentence pairs.These representations are used directly tomake word-level predictions. A weightedsum over word representations as definedby an attention mechanism is used to makesentence-level predictions. For phrase-level,

706

a standard attention-based neural MT archi-tecture is used. Different parts of the sourcesentence are attended to produce MT wordvectors. Phrase-level predictions are basedon representations computed as the sum oftheir word vectors. For predicting sourcetags, the source and MT inputs to the mod-els are swapped. The document-level ar-chitecture wraps the sentence-level PT andbRNN architectures. PT systems are pre-trained using either additional in-domain orout-of-domain Europarl data. For the multi-task learning system (SHEF-mtl), weightsof sentence-level modules are pre-trained topredict sentence MQM scores.

TSKQE (T1):

The TSKQE submissions represent an exten-sion over the previous UHH-STK submis-sions to the WMT17 QE shared task, whichcombine the power of sequence and tree ker-nels applied on source segments, candidatetranslation and back-translations of the MToutput into the source language. In addi-tion, in order to predict the HTER scores,one of the current submissions also explorespseudo-references, which were obtained bytranslating the source sentences into the tar-get language using an online MT system.The sequence kernels were applied on the to-kenised data, while tree kernels were appliedto dependency trees.

UAlacant (T1, T2):

The UAlacant submissions use phrase tablesfrom OPUS5 and a two hidden layer feed-forward neural network for word-level MTQE. Phrase tables are used to extract fea-tures for each word and gap in the machine-translated segment for which quality is esti-mated. These features are then used togetherwith the baseline features for predicting theneed of a deletion or an insertion. The neuralnetwork takes as input not only the featuresfor the word and the gap on which a deci-sion is to be made, but also the features ofthe surrounding gaps and words in a sliding-window fashion within a context window ofsize three. The predictions made at the wordlevel allow to obtain an approximate HTER

5http://opus.nlpl.eu/

score which is used for the submissions to thesentence-level task.

UNQE (T1):

The UNQE submissions employ the uni-fied neural network architecture (UNQE) forsentence-level QE tasks (Li et al., 2018).The approach combines a bidirectional RNNencoder-decoder with attention mechanismsub-network and an RNN into a single largeneural network, which extracts the qualityvectors of the translation outputs throughthe bidirectional RNN encoder-decoder, andpredicts the HTER value of the transla-tion output by RNN. The input text goesthrough tokenisation, true casing and sub-word unit segmentation. The models are pre-trained with a large parallel bilingual cor-pus and fine-tuned with the training data ofthe sentence-level QE share task. The re-sults submitted are averages of the predictedHTER scores under different dimension set-tings.

UTartu (T1):

UTartu proposes two methods for thesentence-level task. The first method uses at-tention weights of a neural MT system ap-plied to each sentence pair to compute theprobability of the output sentence under themodel (forced-decoding). The confidence ofthe model is computed via metrics of av-erage entropy of the attention weights pereach input/output token. The second methodcomputes the bleu2vec metric, which ex-tends BLEU with token or n-gram embed-dings, but here the metric is made cross-lingual by means of an unsupervised cross-lingual mapping between the source andtarget language embedding spaces. Threeversions of the resulting metric are used:one based on 3-grams, one with tokens(unigrams) and one with byte-pair encodedsub-words (also unigrams). Both submis-sions use the 17 standard black-box featuresimplemented in QuEst. QuEst+Attentioncombines them with the first approach andQuEst+Att+CrEmb3 combines QuEst andboth approaches together.

707

4 Datasets

This year we further expand the datasets used inWMT17 by adding: more instances (see Table 2),more languages (four language pairs), more MTarchitectures (neural and statistical MT), and dif-ferent types of annotation (manual and extractedfrom manual post-editing). In addition, new datawas collected and provided for Task 4, on a fifthlanguage pair and third text domain.

4.1 Tasks 1 and 2

The initial data was collected as part of the QT21project6 and is fully described in (Specia et al.,2017). However, for all language pairs and MTsystem types, we filtered this data to remove mostcases with no edits performed. A skewed distri-bution towards good quality translations has beenshown to be a problem in previous years, and iseven more critical with NMT outputs, where up toabout half of the MT sentences require no post-editing at all. We kept only a small proportion ofHTER=0 sentences in training, development andtest sets.

The structure used for the data has been thesame since WMT15. Each data instance consistsof (i) a source sentence, (ii) its automatic trans-lation into the target language, (iii) the manuallypost-edited version of the automatic translation,(iv) one or more post-editing effort scores as la-bels. Professional post-edits are used to extractlabels for the two different levels of granularity(word and sentence). Table 2 shows the variousresulting datasets for English-German (EN-DE),German-English (DE-EN), English-Latvian (EN-LV) and English-Czech (EN-CS), for both statisti-cal (SMT) and neural (NMT) outputs.

English-German and English-Czech sentencesare from the IT domain and were translated byan in-house phrase-based SMT system, and in ad-dition by an in-house encoder-decoder attention-based NMT system for English-German. We notethat the original dataset sizes for these languageswas 30,000 sentences in total for English-German(per MT system type), and 45,000 for English-Czech. The large reduction in the NMT version ofthe English-German data indicates the high qual-ity of the NMT system used to produce these sen-tences: a large number of sentences was filteredout for having undergone no edits by translators.

6http://www.qt21.eu/

German-English and English-Latvian sentencesare from the life sciences (pharmaceutical) do-main and were translated by an in-house phrase-based SMT system, and in addition by an in-houseencoder-decoder attention-based NMT system forEnglish-Latvian. The original sentence numbersfor these languages were 45,000 and 20,738, re-spectively (per MT system type).

4.2 Task 3

This task uses a subset of the German-EnglishSMT data from Task 1 (5,921 sentences for train-ing, 1,000 for development and 543 for test) whereeach phrase (as produced by the SMT decoder)has been annotated (as a phrase) by humans withfour labels (see Section 7). This subset was se-lected after post-editing by filtering out transla-tions with HTER=0 and with a HTER=0.30 andabove, and then randomly selecting a subset largeenough while fitting the annotation budget. Thelatter criterion was used to rule out sentences withtoo many errors, since these are generally too hardor impossible to annotate for errors by humans.

We used BRAT7 to perform the phrase la-belling. The annotator – a professional transla-tor – was given the translations to annotate, alongwith their respective source sentence. We pro-vided them with a preset environment where alltranslations were pre-labelled at phrase-level be-forehand as OK. The annotator’s task was then tochange the labels of the incorrect phrases. Thelabelling was done following a ‘pessimistic’ ap-proach, where we requested the annotator to onlyconsider a phrase to be OK if all its words wereOK. This task has two variants, as we describelater: Task3a, where a phrase annotation is prop-agate to all of its words and the task is framed asa word-level prediction task; and Task3b, whereprediction is done at the phrase level. Table 3shows the statistics of the resulting datasets forthese variants of the task.

Since the data used for this task is a subset ofthe dataset of that used for Task 1, we selected astest sentences also a subset of the test set for Task1.

4.3 Task 4

The document-level task data consists of shortproduct descriptions translated from English toFrench, extracted from the Amazon Product Re-

7https://brat.nlplab.org/

708

Train. Dev. TestLanguage pair # Sentences # Words # Sentences # Words # Sentences # WordsDE-EN 25,963 493,010 1,000 18,817 1,254 23,522EN-DE-SMT 26,273 442,074 1,000 16,565 1,926 32,151EN-DE-NMT 13,442 234,725 1,000 17,669 1,023 17,649EN-LV-SMT 11,251 225,347 1,000 20,588 1,315 26,661EN-LV-NMT 12,936 258,125 1,000 19,791 1,448 28,945EN-CS 40,254 728,815 1,000 18,315 1,920 34,606

Table 2: Statistics of the datasets used for Tasks 1 and 2: Total number of (source) sentences and words (aftertokenisation) for training, development and test for each language pair and MT system type.

Task3a # Sentences # Words # BAD

Train. 5,921 126,508 35,532Dev. 1,000 28,710 6,153Test 543 7,464 3,089Task3b # Sentences # Phrases # BAD

Train. 5,921 50,834 10,451Dev. 1,000 8,566 1,795Test 543 4,391 868

Table 3: Statistics of the data used for Task 3. Num-ber of sentences, phrases, words and BAD labels fortraining, development and test.

# Documents # Sentences # WordsTrain. 1,000 6,003 129,099Dev. 200 1,301 28,071Test 269 1,652 39,049

Table 4: Statistics of the data used for Task 4. Numberof documents, sentences and (target) words for train-ing, development and test.

views dataset (McAuley et al., 2015; He andMcAuley, 2016).8 More specifically, the data isa selection of Sports and Outdoors product titlesand descriptions in English which has been ma-chine translated into French using a state of theart online neural MT system. The most popularproducts (those with more reviews) were chosen.This data poses interesting challenges for machinetranslation: titles and descriptions are often shortand not always a complete sentence. Spans cover-ing one or more tokens were annotated with errorlabels following fine-grained error taxonomy, asdescribed in more detail in Section 8. The datasetstatistics are presented in Table 4. This is thelargest ever released collection with word-level er-rors manually annotated.

8http://jmcauley.ucsd.edu/data/amazon/.

5 Task 1: Predicting sentence-levelquality

This task consists in scoring (and ranking) transla-tion sentences according to the proportion of theirwords that need to be fixed. HTER is used asquality score, i.e. the minimum edit distance be-tween the machine translation and its manuallypost-edited version.

Labels Three labels were available: percentageof edits need to be fixed (HTER) (primary label),post-editing time in seconds, and counts of vari-ous types of keystrokes. The PET tool (Aziz et al.,2012)9 was used to collect various types of in-formation during post-editing. HTER labels werecomputed using the TERCOM tool10 with defaultsettings (tokenised, case insensitive, exact match-ing only), with scores capped to 1.

Evaluation Evaluation was performed againstthe true HTER label and/or ranking, using the fol-lowing metrics:

• Scoring: Pearson’s r correlation score (pri-mary metric, official score for ranking systemsubmissions), Mean Absolute Error (MAE)and Root Mean Squared Error (RMSE).

• Ranking: Spearman’s ρ rank correlation.

Statistical significance on Pearson r was com-puted using the William’s test.11

Results For Task 1, Tables 5, 6, 7 and 8 sum-marise the results for English–German, German–English, English–Latvian and English-Czech, re-spectively, ranking participating systems best toworst using Pearson’s r correlation as primary key.

9https://github.com/ghpaetzold/PET10https://github.com/jhclark/tercom11https://github.com/ygraham/mt-qe-eval

709

Spearman’s ρ correlation scores should be used torank systems for the ranking variant of the evalua-tion.

The top two systems for this task, the QEBrainmodel and UNQE models, show a large perfor-mance gap with respect to the rest of the sys-tems, for both SMT and NMT data. It is inter-esting to note that both systems outperform theSHEF-PT system by a large margin. SHEF-PT is areimplementation of the POSTECH system, whichshowed the top performance in 2017.

6 Task 2: Predicting word-level quality

This task evaluates the extent to which we can de-tect word-level errors in MT output. Often theoverall quality of a translated segment is signifi-cantly harmed by specific errors in a small num-ber of words. As in previous years, each tokenof the target sentence is labeled as OK/BAD basedon an available post-edited sentence. In addition tothis, this year we also took into consideration wordomission errors and the detection of words in thesource related to target side errors. These types oferrors become particularly relevant in the contextof NMT systems. The code to produce this new setof tags from any prior WMT corpora is availablefor download.12

Target word labels As in previous years, thebinary labels for each target token (OK andBAD) were derived automatically by aligning eachmachine translated sentence with its post-editedcounterpart sentence. The alignment at token-level was performed using the TERCOM tool. De-fault settings were used and shifts were disabled.Target tokens originating from insertion or substi-tution errors were labeled as BAD. All other to-kens were labeled as OK.

Gap and source word labels To annotate dele-tion errors, gap ‘tokens’ between each word andat the beginning of each target sentence were in-troduced. These gaps tokens were labeled as BADin the presence of one or more deletion errors andOK otherwise. To annotate the source words re-lated to insertion or substitution errors in the ma-chine translated sentence, the IBM Model 2 align-ments from fastalign (Dyer et al., 2013) were used.Each token in the source sentence was aligned tothe post-edited sentence. For each token in the

12https://github.com/Unbabel/word-level-qe-corpus-builder

post-edited sentence deleted or substituted in themachine translated text, the corresponding alignedsource tokens were labeled as BAD. In this way,deletion errors also result in BAD tokens in thesource, related to the missing words. All otherwords were labeled as OK.

Evaluation Analogously to last year’s task, theprimary evaluation metric is the multiplication ofF1-scores for the OK and BAD classes, denotedas F1-Mult. The same metric was applied to gapand source token labels. We also report F1-scoresfor individual classes for completeness. We testthe significance of the results using randomisa-tion tests (Yeh, 2000) with Bonferroni correction(Abdi, 2007).

Results The results for Task 2 are summarisedin Tables 9, 10, 11 and 12, ordered by the F1-multmetric.

The number of submissions per language pairwas different, which limits any conclusions thatcan be made with respect to general rankingsof systems. The English-German and German-English tasks – Tables 9, 10 – had the most sys-tems participating. As in previous years, results inTask1 and Task2 are correlated. In this case thesame system, QEBrain, wins both tasks for theselanguage pairs. Since some of the other systemsfor Task 1 where specific for sentence-level pre-diction, the next system in the ranking is SHEF-PT, which lags behind by a margin slightly smallerthan in Task 1. Another interesting result for thisyear is the differences between SMT and NMTdatasets. For English-German, there is a clear dropin performance from SMT to NMT. This can bedue to changes in the type of errors, or size oftraining sets, as we discuss in Section 9.

Regarding the novel task variants of detectionof gaps and source words that lead to errors, onlya few teams submitted systems. The performancefor these tasks is lower, but correlated with theperformance of the main word-level task – predic-tion of target word errors. It is worth noting thatthe QEBrain system obtains notable performancefor gap error detection, almost doubling the per-formance of other (few) participating systems forSMT data.

The English-Latvian and English-Czech taskshad a lower number of participants, potentially dueto the lower number of resources to pre-processdata and pre-train models. It is interesting to note

710

Model Pearson r MAE RMSE Spearman ρSMT DATASET

• QEBrain DoubleBi w/ BPE+word-tok (ensemble) 0.74 0.09 0.14 0.75QEBrain DoubleBi w/ BPE-tok 0.73 0.10 0.14 0.75UNQE 0.70 0.10 0.14 0.72TSKQE2 0.49 0.13 0.17 0.00SHEF-PT 0.49 0.13 0.17 0.51TSKQE1 0.48 0.13 0.17 0.00UTartu/QuEst+Attention 0.43 0.14 0.17 0.42UTartu/QuEst+Att+CrEmb3 0.42 0.14 0.17 0.42sMQE 0.40 0.19 0.22 0.40RTM MIX7 0.39 0.14 0.18 0.40RTM MIX6 0.39 0.14 0.18 0.40SHEF-bRNN 0.37 0.14 0.18 0.38BASELINE 0.37 0.14 0.18 0.38uMQE – – – 0.38UAlacant** 0.39 0.18 0.23 0.39

NMT DATASET

• UNQE 0.51 0.11 0.17 0.61• QEBrain DoubleBi w/ BPE+word-tok (ensemble) 0.50 0.11 0.17 0.60• QEBrain DoubleBi w/ word-tok 0.50 0.11 0.17 0.60TSKQE1 0.42 0.14 0.18 0.00TSKQE2 0.41 0.14 0.18 0.00SHEF-bRNN 0.38 0.13 0.18 0.48SHEF-PT 0.38 0.13 0.18 0.47UTartu/QuEst+Attention 0.37 0.13 0.18 0.44sMQE 0.37 0.21 0.24 0.44UTartu/QuEst+Att+CrEmb3 0.37 0.13 0.18 0.44BASELINE 0.29 0.13 0.19 0.42uMQE – – – 0.40UAlacant** 0.23 0.21 0.26 0.24RTM MIX5** 0.47 0.12 0.17 0.55

Table 5: Official results of the WMT18 Quality Estimation Task 1 for the English–German dataset. The winningsubmission is indicated by a •. Baseline systems are highlighted in grey, and ** indicates late submissions thatwere not considered for the official ranking of participating systems.

Model Pearson r MAE RMSE Spearman ρ• UNQE 0.77 0.09 0.13 0.73• QEBrain DoubleBi w/ BPE+word-tok (ensemble) 0.76 0.10 0.13 0.73• QEBrain DoubleBi w/ word-tok 0.75 0.10 0.14 0.72sMQE 0.65 0.12 0.15 0.60UTartu/QuEst+Att+CrEmb3 0.57 0.14 0.18 0.47SHEF-PT 0.55 0.13 0.17 0.50UTartu/QuEst+Attention 0.55 0.14 0.17 0.47SHEF-bRNN 0.48 0.14 0.19 0.44BASELINE 0.33 0.15 0.19 0.32UAlacant** 0.63 0.12 0.17 0.60RTM MIX5** 0.54 0.13 0.17 0.49

Table 6: Official results of the WMT18 Quality Estimation Task 1 for the German–English dataset. The winningsubmission is indicated by a •. Baseline systems are highlighted in grey, and ** indicates late submissions thatwere not considered for the official ranking of participating systems.

711

Model Pearson r MAE RMSE Spearman ρSMT DATASET

• UNQE 0.62 0.12 0.16 0.58sMQE 0.46 0.13 0.18 0.41UTartu/QuEst+Att+CrEmb3 0.40 0.16 0.20 0.32UTartu/QuEst+Attention 0.40 0.15 0.19 0.32SHEF-bRNN 0.40 0.14 0.19 0.33SHEF-PT 0.38 0.14 0.19 0.33BASELINE 0.35 0.16 0.19 0.35uMQE – – – 0.40UAlacant** 0.36 0.20 0.26 0.34RTM MIX** 0.35 0.14 0.19 0.28

NMT DATASET

• UNQE 0.68 0.13 0.17 0.67sMQE 0.58 0.15 0.19 0.57UTartu/QuEst+Att+CrEmb3 0.54 0.16 0.20 0.50UTartu/QuEst+Attention 0.53 0.16 0.20 0.49SHEF-PT 0.46 0.17 0.22 0.45BASELINE 0.44 0.16 0.22 0.46SHEF-bRNN 0.42 0.17 0.22 0.41uMQE – – – 0.54UAlacant** 0.56 0.17 0.22 0.55RTM MIX** 0.54 0.16 0.20 0.50

Table 7: Official results of the WMT18 Quality Estimation Task 1 for the English–Latvian dataset. The winningsubmission is indicated by a •. Baseline systems are highlighted in grey, and ** indicates late submissions thatwere not considered for the official ranking of participating systems.

Model Pearson r MAE RMSE Spearman ρ• UNQE 0.69 0.12 0.17 0.71SHEF-PT 0.53 0.15 0.19 0.54SHEF-bRNN 0.50 0.16 0.20 0.51UTartu/QuEst+Attention 0.45 0.16 0.20 0.46UTartu/QuEst+Att+CrEmb3 0.41 0.17 0.21 0.40BASELINE 0.39 0.17 0.21 0.41sMQE 0.39 0.16 0.21 0.42uMQE – – – 0.42UAlacant** 0.44 0.18 0.23 0.46RTM MIX** 0.52 0.15 0.20 0.53

Table 8: Official results of the WMT18 Quality Estimation Task 1 for the English–Czech dataset. The winningsubmission is indicated by a •. Baseline systems are highlighted in grey, and ** indicates late submissions thatwere not considered for the official ranking of participating systems.

the general low performance of systems on theEnglish-Latvian NMT data: all systems are tiedwith the baseline in terms of F1-mult. The reim-plementation of the POSTECH system shows poorresults on the NMT dataset, in this case it is unableto outperform the baseline. Results for English-Czech are very similar across systems.

7 Task 3: Predicting phrase-level quality

This level of granularity was first introduced in theshared task at WMT16. The goal is to predict MTquality at the level of phrases. In the 2016 edition,the data annotation was done automatically basedon post-edits, as in Task 2, but this year humansdirectly labelled each phrase in context.

712

SM

TD

ATA

SE

TW

ords

inM

TG

APs

inM

TW

ords

inSR

CM

odel

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

t•Q

EB

rain

Dou

bleB

iw/B

PE+w

ord-

tok

(ens

embl

e)0.

680.

920.

62–

––

––

–Q

EB

rain

Dou

bleB

iw/w

ord-

tok

0.66

0.92

0.61

0.51

0.98

0.50

––

–SH

EF-

PT0.

510.

850.

430.

290.

960.

280.

420.

800.

34C

MU

-LT

I0.

480.

820.

39–

––

––

–SH

EF-

bRN

N0.

450.

810.

370.

270.

960.

260.

410.

820.

34B

ASE

LIN

E0.

410.

880.

36–

––

––

–D

oc2V

ec0.

290.

750.

22–

––

––

–B

agO

fWor

ds0.

280.

730.

20–

––

––

–U

Ala

cant

**0.

350.

810.

290.

330.

960.

32–

––

RT

M**

0.33

0.88

0.29

0.26

0.98

0.25

0.17

0.86

0.14

NM

TD

ATA

SE

TW

ords

inM

TG

APs

inM

TW

ords

inSR

CM

odel

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

t•Q

EB

rain

Dou

bleB

iw/w

ord-

tok

(usi

ngvo

ting)

0.48

0.91

0.44

––

––

––

•QE

Bra

inD

oubl

eBiw

/wor

d-to

k0.

480.

920.

43–

––

––

–C

MU

-LT

I0.

360.

850.

30–

––

––

–SH

EF-

bRN

N0.

350.

860.

300.

120.

980.

120.

330.

870.

29SH

EF-

PT0.

340.

870.

290.

110.

980.

110.

310.

840.

26B

ASE

LIN

E0.

200.

920.

18–

––

––

–U

Ala

cant

**0.

230.

860.

190.

120.

980.

12–

––

RT

M**

0.14

0.99

0.13

0.14

0.99

0.13

0.03

0.92

0.03

Tabl

e9:

Offi

cial

resu

ltsof

the

WM

T18

Qua

lity

Est

imat

ion

Task

2fo

rthe

Eng

lish–

Ger

man

data

set.

The

win

ning

subm

issi

onis

indi

cate

dby

a•.

Bas

elin

esy

stem

sar

ein

grey

,an

d**

indi

cate

sla

tesu

bmis

sion

sth

atw

ere

notc

onsi

dere

dfo

rthe

offic

ialr

anki

ngof

part

icip

atin

gsy

stem

s.

713

Wor

dsin

MT

GA

Psin

MT

Wor

dsin

SRC

Mod

elF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

•QE

Bra

inD

oubl

eBiw

/BPE

+wor

d-to

k(e

nsem

ble)

0.65

0.92

0.60

––

––

––

•QE

Bra

inD

oubl

eBiw

/wor

d-to

k0.

650.

920.

59–

––

––

–B

ASE

LIN

E0.

490.

900.

44–

––

––

–SH

EF-

PT0.

490.

870.

420.

210.

970.

200.

390.

890.

35C

MU

-LT

I0.

490.

850.

42–

––

––

–SH

EF-

brN

N0.

450.

870.

390.

200.

970.

190.

370.

870.

32U

Ala

cant

**0.

430.

870.

370.

330.

970.

32–

––

RT

M**

0.38

0.90

0.34

0.15

0.98

0.14

0.12

0.90

0.11

Tabl

e10

:Offi

cial

resu

ltsof

the

WM

T18

Qua

lity

Est

imat

ion

Task

2fo

rthe

Ger

man

–Eng

lish

data

set.

The

win

ning

subm

issi

onis

indi

cate

dby

a•.

Bas

elin

esy

stem

sar

ein

grey

,an

d**

indi

cate

sla

tesu

bmis

sion

sth

atw

ere

notc

onsi

dere

dfo

rthe

offic

ialr

anki

ngof

part

icip

atin

gsy

stem

s.

Wor

dsin

MT

GA

Psin

MT

Wor

dsin

SRC

Mod

elF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

•CM

U-L

TI

0.56

0.80

0.45

––

––

––

BA

SEL

INE

0.53

0.83

0.44

––

––

––

•SH

EF-

PT0.

560.

800.

440.

170.

980.

170.

490.

800.

39SH

EF-

bRN

N0.

550.

790.

440.

180.

970.

170.

490.

810.

40U

Ala

cant

**0.

420.

750.

320.

150.

950.

15–

––

RT

M**

0.53

0.83

0.44

0.11

0.98

0.10

0.32

0.80

0.26

Tabl

e11

:O

ffici

alre

sults

ofth

eW

MT

18Q

ualit

yE

stim

atio

nTa

sk2

for

the

Eng

lish–

Cze

chda

tase

t.T

hew

inni

ngsu

bmis

sion

isin

dica

ted

bya•.

Bas

elin

esy

stem

sar

ein

grey

,an

d**

indi

cate

sla

tesu

bmis

sion

sth

atw

ere

notc

onsi

dere

dfo

rthe

offic

ialr

anki

ngof

part

icip

atin

gsy

stem

s.

714

SM

TD

ATA

SE

TW

ords

inM

TG

APs

inM

TW

ords

inSR

CM

odel

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

t•S

HE

F-PT

0.42

0.87

0.36

0.14

0.97

0.14

0.35

0.86

0.30

•SH

EF-

bRN

N0.

410.

860.

350.

120.

980.

110.

360.

860.

31B

ASE

LIN

E0.

380.

910.

34–

––

––

–C

MU

-LT

I0.

220.

850.

19–

––

––

–U

Ala

cant

**0.

270.

820.

220.

110.

960.

11–

––

RT

M**

0.37

0.90

0.33

0.13

0.99

0.13

0.12

0.89

0.11

NM

TD

ATA

SE

TW

ords

inM

TG

APs

inM

TW

ords

inSR

CM

odel

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

t•C

MU

-LT

I0.

520.

830.

43–

––

––

–B

ASE

LIN

E0.

490.

860.

42–

––

––

–•S

HE

F-PT

0.52

0.81

0.42

0.13

0.97

0.13

0.44

0.81

0.36

•SH

EF-

bRN

N0.

500.

830.

420.

120.

940.

110.

440.

800.

36U

Ala

cant

**0.

450.

800.

360.

170.

950.

16–

––

RT

M**

0.43

0.85

0.37

0.08

0.98

0.08

0.20

0.84

0.17

Tabl

e12

:Offi

cial

resu

ltsof

the

WM

T18

Qua

lity

Est

imat

ion

Task

2fo

rthe

Eng

lish–

Lat

vian

data

set.

The

win

ning

subm

issi

onis

indi

cate

dby

a•.

Bas

elin

esy

stem

sar

ein

grey

,an

d**

late

subm

issi

ons

that

wer

eno

tcon

side

red

fort

heof

ficia

lran

king

ofpa

rtic

ipat

ing

syst

ems.

715

Labels We used the phrase segmentation pro-duced by the SMT decoder which generated thetranslations for the dataset. The phrases wereannotated for errors using four classes: ’OK’,’BAD’ – the phrase contain one or more errors,’BAD word order’ – the phrase is in an incorrectposition in the sentence, and ’BAD omission’ – aword is missing before/after a phrase. This taskin further subdivided in two subtasks: word-levelprediction (Task3a), and phrase-level prediction(Task3b).

The data for Task3a propagates the annotationof each phrase to its words, and thus uses word-level segmentation for both source and machine-translated sentences, such that the task can be ad-dressed as a word-level prediction task. In otherwords, all tokens in the target sentence are labelledaccording to the label of the phrase they belongto. Therefore, if the phrase is annotated as either’OK’, ’BAD’ or ’BAD word order’, all tokens(and gap tokens) within that phrase are labelled aseither ’OK’, ’BAD’ or ’BAD word order’. To an-notate omission errors, a gap token is inserted aftereach token and at the start of the sentence.

The data for Task3b has phrase-level segmen-tation with the labels assigned by the human an-notator to each phrase. A gap token is insertedafter each phrase and at the start of the sen-tence. The gap is labelled as follows: ’OK’ or’BAD omission’, where the latter indicates thatone or more words are missing.

Evaluation Similarly to Task 2, our primarymetric for predictions at word-level (Task3a) isthe multiplication of the F1 scores of the OK andBAD classes, F1-Mult, while for predictions atphrase-level (Task3b), our primary metric is thephrase-level version of F1-Mult. The same met-rics were applied to gap and source token labelsfor both sub-tasks, along with F1 scores for indi-vidual classes for completeness. We also reportF1 score for BAD word order labels on the targettokens for Task3b. We computed statistical sig-nificance of the results using randomised test withBonferroni correction, as in Task 2.

Results The results of the phrase-level task aregiven in Tables 13 (Task3a) and 14 (Task3b), or-dered by the F1-Mult metric.

Comparing the results for Task3a with the re-sults on German-English for Task 2 (Table 10), itcan be observed a general degradation of the F1

score on the BAD class, including for the base-line system. We attribute this phenomenon to theway the data for this task was created: for Task 2,the token labels were produced from post-editing,where each word was labelled independently fromeach others; while for this task, the token labelsare deduced from a labelling at more coarse level(phrase), i.e. where words were not consideredas individual tokens. Consequently, words thatwould be considered as correct during post-editingare here labelled as BAD, like to BAD phrase theybelong to. The only two official submissions tothis subtask (SHEF-PT and SHEF-bRNN) slightlyoutperform the baseline system, nevertheless with-out a statistically significant difference.

For the phrase-level predictions, the baselinesystem remains ahead by a significant marginof the only two official submissions, both fromthe University of Sheffield (SHEF-ATT-SUM andSHEF-PT). The overall performance in predictingphrases that are in incorrect position in a sentence(i.e. BAD word order) shows that this problem re-mains a very challenging task, as none of the sub-missions were able to obtain competitive F1 score.

8 Task 4: Predicting document-level QE

This task consists in estimating a document-levelquality score according to the amount of minor,major, and critical errors present in the translation.The predictions are compared to a ground-truthobtained from annotations produced by crowd-sourced human translators from Unbabel.13

Labels The data was annotated for errors at theword level using a fine-grained error taxonomy –Multidimensional Quality Metrics (MQM) (Lom-mel et al., 2014) – similar to the one used in(Sanchez-Torron and Koehn, 2016). MQM iscomposed of three major branches: accuracy (thetranslation does not accurately reflect the sourcetext), fluency (the translation affects the readingof the text) and style (the translation has stylisticproblems, like the use of a wrong register). Thesebranches include more specific issues lower in thehierarchy. Besides the identification of an errorand its classification according to this typology(by applying a specific tag), the errors receive aseverity scale that reflects the impact of each er-ror on the overall meaning, style, and fluency ofthe translation. An error can be minor (if it does

13http://www.unbabel.com.

716

Wor

dsin

MT

GA

Psin

MT

Wor

dsin

SRC

Mod

elF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

•SH

EF-

PT0.

330.

830.

280.

270.

880.

240.

500.

810.

41•S

HE

F-bR

NN

0.33

0.82

0.27

0.26

0.88

0.23

0.49

0.79

0.39

BA

SEL

INE

0.27

0.91

0.25

––

––

––

RT

M**

0.16

0.90

0.15

0.10

0.94

0.10

0.10

0.84

0.08

Tabl

e13

:Offi

cial

resu

ltsof

the

WM

T18

Qua

lity

Est

imat

ion

Task

3a(w

ord-

leve

l)fo

rthe

Ger

man

–Eng

lish

data

set.

The

win

ning

subm

issi

onis

indi

cate

dby

a•.

Bas

elin

esy

stem

isin

grey

,and

**in

dica

tes

late

subm

issi

ons

that

wer

eno

tcon

side

red

fort

heof

ficia

lran

king

ofpa

rtic

ipat

ing

syst

ems.

Wor

dsin

MT

GA

Psin

MT

Wor

dsin

SRC

Mod

elF 1

-BA

DF 1

-OK

F 1-m

ult

F 1-B

AD

wor

der

F 1-B

AD

F 1-O

KF 1

-mul

tF 1

-BA

DF 1

-OK

F 1-m

ult

BA

SEL

INE

0.39

0.92

0.36

0.02

––

––

––

•SH

EF-

AT

T-SU

M0.

290.

760.

220.

110.

100.

940.

10–

––

SHE

F-PT

0.23

0.81

0.18

0.08

0.11

0.93

0.10

––

–R

TM

**0.

270.

920.

240.

040.

050.

980.

050.

100.

900.

09

Tabl

e14

:O

ffici

alre

sults

ofth

eW

MT

18Q

ualit

yE

stim

atio

nTa

sk3b

(phr

ase-

leve

l)fo

rth

eG

erm

an–E

nglis

hda

tase

t.T

hew

inni

ngsu

bmis

sion

isin

dica

ted

bya•.

Bas

elin

esy

stem

isin

grey

,and

**in

dica

tes

late

subm

issi

ons

that

wer

eno

tcon

side

red

fort

heof

ficia

lran

king

ofpa

rtic

ipat

ing

syst

ems.

717

not lead to a loss of meaning and it doesn’t con-fuse or mislead the user), major (if it changes themeaning) or critical (if it changes the meaning andcarry any type of implication, or could be seen asoffensive).

Document-level scores were then generatedfrom the word-level errors and their severity us-ing the method described in Sanchez-Torron andKoehn (2016, footnote 6). Namely, denoting byn the number of words in the document, and bynmin, nmaj, and ncri the number of annotated mi-nor, major, and critical errors, the final qualityscores were computed as:

MQM Score = 1− nmin + 5nmaj + 10ncrin

(1)

Note that MQM values can be negative if the totalseverity exceeds the number of words.

Evaluation Submissions are evaluated as inTask 1 (see Section 5), in terms of Pearson’s corre-lation r between the true and predicted document-level scores.

Results The results of the document-level taskare shown in Table 15. Due to the different nu-meric range, only the Pearson correlation scoresare comparable to those of Task1. Comparing withthe results for Task 1, it can be observed that thebaseline system already obtains very high correla-tion. The neural model SHEF-PT-indomain out-performs the baseline by a modest margin, com-pared to the results obtained in Task 1.

Model Pearson r MAE• SHEF-PT-indomain 0.53 0.56BASELINE 0.51 0.56SHEF-mtl-bRNN 0.47 0.56SHEF-mtl-PT-indomain**

0.52 0.57

RTM MIX1** 0.11 0.58

Table 15: Official results of the WMT18 Quality Es-timation Task 4 for the English–French dataset. Thewinning submission is indicated by a •. Baseline sys-tem is in grey, and ** indicates late submissions thatwere not considered for the official ranking of partici-pating systems.

9 Discussion

In what follows, we discuss the main findings ofthis year’s shared task based on the goals we hadpreviously identified for it.

Performance of QE approaches on the out-put of neural MT systems. As previously men-tioned, some of the data used for Tasks 1 and2 is translated by both an SMT and an NMTsystem: the English-German and English-Latviandata. In Task 1, for English-German, the numbersof translations in the QE training data from thetwo systems are very different (26, 273 for SMTand 13, 442 for NMT) and thus no direct com-parison can be made. This shows that the NMTsystem was of much higher quality than the SMTone, producing many more sentences that led toHTER=0. Of the sentences that remained, theaverage NMT quality in the training data is stillhigher: HTER=0.154 versus 0.253 for SMT. Fromthe results, the top systems do considerably betteron the SMT data (r=0.74 for SMT vs r=0.51 forNMT translations). This difference is also notice-able for the baseline system (r=0.37 for SMT vsr=0.29 for NMT translations). This could how-ever be because of the difference in number ofsamples and/or significant differences in distribu-tions of HTER scores in the two datasets. It isworth pointing out that the winning submissionsare the same for both SMT and NMT transla-tion: QEBrain and UNQE. In fact, QE system areranked very similarly for the two types of transla-tion.

For English-Latvian, the number of NMT andSMT QE training sentences is similar (12, 936 forNMT and 11, 251 for SMT). Their average HTERscores is also more comparable: 0.278 for NMTand 0.215 for SMT. The difference in QE sys-tem performance for this language pair is not asmarked, but the trend is inverted when comparedto English-German: QE systems do better onthe NMT data (the top systems, UNQE, achievesr=0.62 for SMT vs r=0.68 for NMT translations,while the baseline achieves r=0.44 for SMT vsr=0.35 for NMT translations), This could be be-cause of the lower differences in the distribution ofHTER scores in both sets. The ranking of QE sys-tems is exactly the same for both SMT and NMTtranslations.

In both cases, it is important to note that eventhough the initial datasets contained exactly thesame source sentences for SMT and NMT, thesentences in the two final versions of the datasetsfor each language are not all the same, i.e. someNMT sentences may have gotten filtered for hav-ing HTER=0 while their SMT counterparts did

718

not, and vice-versa. The main finding is thatQE models seem to be robust to different typesof translation, since their rankings are the sameacross datasets.

For Task 2, the trend is similar: QE sys-tems for English-German also perform better onSMT translations than on NMT translations (F1-Mult=0.62 for SMT vs F1-Mult=0.44 for NMT),and the inverse is observed for English-Latvian(F1-Mult=0.36 for SMT vs F1-Mult=0.43 forNMT). The ranking of QE systems for the twotypes of translations differs more than for Task 1,especially for English-Latvian.

Task 4 uses NMT output only and it is hard tomake any conclusions about whether the perfor-mance of the systems is good enough because thisis the first time this task is organised. Generallyspeaking, this task proved hard, with the baselinesystem performing as well or better than the othersubmissions.

Predictability of missing words in the MT out-put. Only a subset of the systems that partici-pated in Task 2 submitted results for missing worddetection. From the results obtained it seems clearthat while this task is more difficult than targetword error detection, high scores could be attainedfor the SMT data. Due to the small number of sub-mitted systems, it is unclear whether or not gapdetection is more difficult for NMT data.

Predictability of source words that lead to er-rors in the MT output. Only a small set ofteams submitted predictions for source words.From the submitted results, it can be observed thatprediction of source words related to errors is aharder problem than detecting errors in the targetlanguage. This may be due to the fact that theremay be more ambiguity with regards to whichwords should be related to errors in the target. Inother words, in some cases a source word in agiven context leads to incorrect translations, whilein other cases the same source word in the samecontext will not lead to errors.

Effectiveness of manually assigned labels forphrases. With only one official (and one late)submission to the phrase-level QE task this year,it is hard to conclude whether having manual la-bels makes the task harder (although the baselinesystem performs as well as in the last edition), orwhether the reason lies in the design of the neuralmodels, which may not be suitable for this task.

Quality prediction for documents from errorsannotated at word-level with added severityjudgements. Since this is a new task and notmany systems were submitted. Results show how-ever that it is possible to attain Pearson correlationscores that are comparable with those of sentence-level post-editing effort prediction. The perfor-mance gap between the neural model and the SVMbaseline is smaller than in Task 1, which may bean indication for further potential gains using newdeep learning architectures tailored for document-level.

Utility of additional evidence To investigatethe utility of detailed information logged dur-ing post-editing, we offered to participantsother sources of information: post-editing time,keystrokes, and actual edits. Surprisingly, no par-ticipating system requested these additional labels,and therefore this remains an open question.

10 Conclusions

This year’s edition of the QE shared task was thelargest ever organised in many respects: numberof tasks, number of languages, variety of tasks(three granularity levels), types of annotation (de-rived from post-editing or manual, source or tar-get), and number of samples annotated.

Over the years, we have attempted to find abalance between keeping the shared task as closeas possible to previous editions – so as to makesome form of comparison across years possible –and proposing new tasks and new interesting chal-lenges – so as to keep up to date with new de-velopments in the field, such as neural machinetranslations. We believe the current set of taskscovers a broad enough range of challenges that arefar from solved, such as improving performancegiven smaller sets of instances, predicting sourcewords that lead to errors, predicting gaps, use ofadditional evidence from post-editing, etc.

In order to allow for future benchmarking ona ’blind’ basis without access to the gold stan-dard labels, we have set up CodaLab competitionsthat will remain open after this shared task. Anyteam can register and submit any number of sys-tems (limited to five submissions per day per taskand language pair) and get immediate feedbackthrough the official evaluation metrics, as well ascomparison to top submissions from other teams(on the leaderboard). Each team’s best submis-sion per task and language pair will feature on the

719

leaderboard. The submission pages for each taskare as follows, where languages and task variantsare frames as ‘phases’:

• Sentence level: https://competitions.codalab.org/competitions/19316

• Word level: https://competitions.codalab.org/competitions/19306

• Phrase level: https://competitions.codalab.org/competitions/19308

• Document level: https://competitions.codalab.org/competitions/19309

Acknowledgments

The data and annotations collected for Tasks1, 2 and 3 was supported by the EC H2020QT21 project (grant agreement no. 645452).The shared task organisation was also sup-ported by the QT21 project, national fundsthrough Fundacao para a Ciencia e Tecnologia(FCT), with references UID/CEC/50021/2013 andUID/EEA/50008/2013, and by the European Re-search Council (ERC StG DeepSPIN 758969). Wewould also like to thank Julie Beliao and the Unba-bel Quality Team for coordinating the annotationof the dataset used in Task 4.

ReferencesHerve Abdi. 2007. The bonferroni and sidak correc-

tions for multiple comparisons. Encyclopedia ofmeasurement and statistics, 3:103–107.

Wilker Aziz, Sheila Castilho Monteiro de Sousa, andLucia Specia. 2012. Pet: a tool for post-editing andassessing machine translation. In Eighth Interna-tional Conference on Language Resources and Eval-uation, LREC, pages 3982–3987, Istanbul, Turkey.

Prasenjit Basu, Santanu Pal, and Sudip Kumar Naskar.2018. Keep it or not: Word level quality estimationfor post-editing. In Proceedings of the Third Con-ference on Machine Translation, Volume 2: SharedTasks Papers, Brussels, Belgium. Association forComputational Linguistics.

Ergun Bicici. 2017. Predicting translation performancewith referential translation machines. In Proceed-ings of the Second Conference on Machine Trans-lation, Volume 2: Shared Task Papers, pages 540–544, Copenhagen, Denmark. Association for Com-putational Linguistics.

Ergun Bicici. 2018. Rtm results for predicting transla-tion performance. In Proceedings of the Third Con-ference on Machine Translation, Volume 2: SharedTasks Papers, Brussels, Belgium. Association forComputational Linguistics.

Ergun Bicici. 2013. Referential translation machinesfor quality estimation. In Proceedings of the EighthWorkshop on Statistical Machine Translation, pages343–351, Sofia, Bulgaria. Association for Computa-tional Linguistics.

Ondrej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut,and Lucia Specia. 2013. Findings of the 2013Workshop on Statistical Machine Translation. InEighth Workshop on Statistical Machine Transla-tion, WMT, pages 1–44, Sofia, Bulgaria.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlesTamchyna. 2014. Findings of the 2014 workshop onstatistical machine translation. In Ninth Workshopon Statistical Machine Translation, WMT, pages12–58, Baltimore, Maryland.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Antonio Jimeno Yepes, Philipp Koehn, VarvaraLogacheva, Christof Monz, Matteo Negri, Aure-lie Neveol, Mariana Neves, Martin Popel, MattPost, Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016. Findings of the 2016 conferenceon machine translation. In Proceedings of the FirstConference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computa-tional Linguistics.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Raphael Rubino, LuciaSpecia, and Marco Turchi. 2017. Findings of the2017 conference on machine translation (wmt17).In Proceedings of the Second Conference on Ma-chine Translation, Volume 2: Shared Tasks Papers,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 Workshop on Statistical Machine Translation.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, pages 1–46, Lisbon, Portugal.Association for Computational Linguistics.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Matt Post, Radu Soricut, and Lucia Specia. 2012.

720

Findings of the 2012 workshop on statistical ma-chine translation. In Seventh Workshop on Sta-tistical Machine Translation, WMT, pages 10–51,Montreal, Canada.

Melania Duma and Wolfgang Menzel. 2018. The bene-fit of pseudo-reference translations in quality estima-tion of mt output. In Proceedings of the Third Con-ference on Machine Translation, Volume 2: SharedTasks Papers, Brussels, Belgium. Association forComputational Linguistics.

Chris Dyer, Victor Chahuneau, and Noah A Smith.2013. A simple, fast, and effective reparameteriza-tion of ibm model 2. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 644–648.

Thierry Etchegoyhen, Eva Martınez Garcia, and An-doni Azpeitia. 2018. Supervised and unsupervisedminimalist quality estimators: Vicomtech’s partici-pation in the wmt 2018 quality estimation task. InProceedings of the Third Conference on MachineTranslation, Volume 2: Shared Tasks Papers, Brus-sels, Belgium. Association for Computational Lin-guistics.

Ruining He and Julian McAuley. 2016. Ups anddowns: Modeling the visual evolution of fashiontrends with one-class collaborative filtering. Inproceedings of the 25th international conferenceon world wide web, pages 507–517. InternationalWorld Wide Web Conferences Steering Committee.

Junjie Hu, Wei-Cheng Chang, Yuexin Wu, and GrahamNeubig. 2018. Contextual encoding for translationquality estimation. In Proceedings of the Third Con-ference on Machine Translation, Volume 2: SharedTasks Papers, Brussels, Belgium. Association forComputational Linguistics.

Julia Ive, Frederic Blain, and Lucia Specia. 2018a.DeepQuest: a framework for neural-based qual-ity estimation. In Proceedings of COLING 2018,the 27th International Conference on ComputationalLinguistics: Technical Papers, Santa Fe, New Mex-ico.

Julia Ive, Carolina Scarton, Frederic Blain, and LuciaSpecia. 2018b. Sheffield submissions for the wmt18quality estimation shared task. In Proceedings of theThird Conference on Machine Translation, Volume2: Shared Tasks Papers, Brussels, Belgium. Associ-ation for Computational Linguistics.

Hyun Kim, Jong-Hyeok Lee, and Seung-Hoon Na.2017. Predictor-estimator using multilevel tasklearning with stack propagation for neural qualityestimation. In Proceedings of the Second Confer-ence on Machine Translation, Volume 2: SharedTasks Papers, pages 562–568, Copenhagen, Den-mark. Association for Computational Linguistics.

Julia Kreutzer, Shigehiko Schamoni, and Stefan Rie-zler. 2015. QUality Estimation from ScraTCH(QUETCH): Deep Learning for Word-level Transla-tion Quality Estimation. In Proceedings of the TenthWorkshop on Statistical Machine Translation, pages297–303, Lisboa, Portugal. Association for Compu-tational Linguistics.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In Proceed-ings of the 31st International Conference on Ma-chine Learning, pages 1188–1196.

Maoxi Li, Qingyu Xiang, Zhiming Chen, and Ming-weng Wang. 2018. A unified neural network forquality estimation of machine translation. IEICETrans. Information and Systems, E101-D(9).

Varvara Logacheva, Chris Hokamp, and Lucia Specia.2016. Marmot: A toolkit for translation quality es-timation at the word level. In Tenth InternationalConference on Language Resources and Evaluation,LREC, pages 3671–3674, Portoroz, Slovenia.

Arle Richard Lommel, Aljoscha Burchardt, and HansUszkoreit. 2014. Multidimensional quality metrics(MQM): A framework for declaring and describingtranslation quality metrics. Tradumatica: tecnolo-gies de la traduccio, 0(12):455–463.

Ngoc Quang Luong, Laurent Besacier, and BenjaminLecouteux. 2014. Word confidence estimation forsmt n-best list re-ranking. In Proceedings of theEACL 2014 Workshop on Humans and Computer-assisted Translation, pages 1–9, Gothenburg, Swe-den. Association for Computational Linguistics.

Andre F. T. Martins, Ramon Astudillo, Chris Hokamp,and Fabio Kepler. 2016. Unbabel’s participation inthe wmt16 word-level translation quality estimationshared task. In Proceedings of the First Conferenceon Machine Translation, pages 806–811, Berlin,Germany. Association for Computational Linguis-tics.

Julian McAuley, Christopher Targett, Qinfeng Shi, andAnton Van Den Hengel. 2015. Image-based recom-mendations on styles and substitutes. In Proceed-ings of the 38th International ACM SIGIR Confer-ence on Research and Development in InformationRetrieval, pages 43–52. ACM.

Sylvain Raybaud, David Langlois, and Kamel Smaili.2011. ”this sentence is wrong.” detecting errors inmachine-translated sentences. Machine Translation,25(1):1–34.

Felipe Sanchez-Martıınez, Miquel Espla-Gomis, andMikel L. Forcada. 2018. Ualacant machine trans-lation quality estimation at wmt 2018: a simple ap-proach using phrase tables and feed-forward neuralnetworks. In Proceedings of the Third Conferenceon Machine Translation, Volume 2: Shared TasksPapers, Brussels, Belgium. Association for Compu-tational Linguistics.

721

Marina Sanchez-Torron and Philipp Koehn. 2016. Ma-chine translation quality and post-editor productiv-ity. AMTA 2016, Vol., page 16.

Lucia Specia, Kim Harris, Frederic Blain, AljoschaBurchardt, Viviven Macketanz, Inguna Skadina,Matteo Negri, , and Marco Turchi. 2017. Transla-tion quality and productivity: A study on rich mor-phology languages. In Machine Translation SummitXVI, pages 55–71, Nagoya, Japan.

Lucia Specia, Gustavo Paetzold, and Carolina Scarton.2015. Multi-level translation quality prediction withquest++. In ACL-IJCNLP 2015 System Demonstra-tions, pages 115–120, Beijing, China.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran As-sociates, Inc.

Jiayi Wang, Kai Fan, Bo Li, Fengming Zhou, Box-ing Chen, Yangbin Shi, and Luo Si. 2018. Alibabasubmission for wmt18 quality estimation task. InProceedings of the Third Conference on MachineTranslation, Volume 2: Shared Tasks Papers, Brus-sels, Belgium. Association for Computational Lin-guistics.

Elizaveta Yankovskaya, Andre Tattar, and Mark Fishel.2018. Quality estimation with force-decoded atten-tion and cross-lingual embeddings. In Proceedingsof the Third Conference on Machine Translation,Volume 2: Shared Tasks Papers, Brussels, Belgium.Association for Computational Linguistics.

Alexander Yeh. 2000. More Accurate Tests for theStatistical Significance of Result Differences. InColing-2000: the 18th Conference on Computa-tional Linguistics, pages 947–953, Saarbrucken,Germany.

722