1
1. Jump features (words transi4ons): word-level forward and backward gaze jumps 2. Total jump distance: total gaze distance covered while evalua:ng 3. Inter-region jumps: gaze jumps between transla:on and the reference 4. Dwell 4me: longer :me eyes spend on a region 5. Lexicalized features: extract streams of lexical sequences R score using a trigram language model Iden:fy sound Find the leAer Introduction & Motivation Eyes Don’t Lie: Predicting Machine Translation Quality Using Eye Movement Features Experimental Setup Results Conclusion Hassan Sajjad, Francisco Guzman, Nadir Durrani, Houda Bouamor*, Ahmed Abdelali, Irina Temnikova, Stephan Vogel Qatar Compu:ng Research Ins:tute, HBKU Qatar; Carnegie Mellon University, Qatar* Problem: Human evalua:on suffers from inter- and intra-annotator agreements Evalua:on scores are too subjec:ve Hypothesis: Reading paLerns from evaluators can help to shed light into the evalua:on process understand which parts of the sentences are difficult to evaluate develop a semi-automa:c evalua:on system based on reading paAerns Our Solu4on: use reading paAerns as a method to dis:nguish between good and bad transla:ons In addi:on: iden:fied novel features from gaze data model and predict the quality of transla:ons as perceived by evaluators Linear regression model with ridge regulariza:on Ridge coefficient minimizes the error Parameter λ controls the amount of shrink applied to regression coefficients Used the glmnet package of R for cross-valida:on to find the best value of λ on the training data Data Subset of the Spanish-English WMT’12 Evalua:on task Selected 60 medium-length sentences, evaluated by at least 2 different annotators Selected the best & the worst transla:ons, according to a human evalua:on score, based on expected wins. Total 120 evalua:on tasks x 6 different evaluators = 720 evalua:ons Eye-tracking Annota4ons Present evaluators with a transla:on-reference pair The best/worst transla:ons of the same sentence have been shown with at least 40 different tasks in between Assign a 0-100 score to each task Inter-annotator kappa = 0.321 (slightly higher than the overall IAA in WMT’12 for Spanish-English – 0.284) Evalua4on Protocol similar to WMT’12 Pairwise evalua:on Computed the Kendall’s tau coefficient Evaluated using 10-fold cross- valida:on Model Tool EyeTribe eye-tracker Sampling frequency of 30Hz. Evalua:on environment: iAppraise Lack predic:ve power Reading paAerns on transla:on and inter-region bring useful informa:on Lexicalized gaze jumps brings addi:onal value than an LM Combina4ons with BLEU B bleu 0.34 B bleu + EyeTra bj 0.38 B bleu + EyeLex all 0.42 Combining the best features with BLEU brings: reading paAerns capture more than just fluency and adequacy Conclusions: Eye–tracking features extracted captures addi:onal informa:on and can complement tradi:onal measures (BLEU). Future work: more users, language pairs, early termina:on features, deepen analysis.

Eyes Don’t Lie: Predicting Machine Translation Quality ...guzmanhe.github.io/media/NAACL2016-Sajjad.pdf · 5. Lexicalized features: • extract streams of lexical sequences R •

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Eyes Don’t Lie: Predicting Machine Translation Quality ...guzmanhe.github.io/media/NAACL2016-Sajjad.pdf · 5. Lexicalized features: • extract streams of lexical sequences R •

1.   Jump features (words transi4ons): word-level forward and backwardgazejumps

2.   Totaljumpdistance:totalgazedistancecoveredwhileevalua:ng3.   Inter-regionjumps:gazejumpsbetweentransla:onandthereference4.   Dwell4me:longer:meeyesspendonaregion5.   Lexicalizedfeatures:

•  extractstreamsoflexicalsequencesR•  scoreusingatrigramlanguagemodel

Iden:fysoundFindtheleAer

Introduction & Motivation

Eyes Don’t Lie: Predicting Machine Translation Quality Using Eye Movement

Features

Experimental Setup

Results

Conclusion

HassanSajjad,FranciscoGuzman,NadirDurrani,HoudaBouamor*,AhmedAbdelali,IrinaTemnikova,StephanVogelQatarCompu:ngResearchIns:tute,HBKUQatar;CarnegieMellonUniversity,Qatar*

Problem:Humanevalua:onsuffersfrominter-andintra-annotatoragreementsEvalua:onscoresaretoosubjec:ve

Hypothesis:ReadingpaLernsfromevaluatorscanhelpto

•  shedlightintotheevalua:onprocess•  understandwhichpartsofthesentencesaredifficulttoevaluate•  developasemi-automa:cevalua:onsystembasedonreadingpaAerns

OurSolu4on:usereadingpaAernsasamethodtodis:nguishbetweengoodandbadtransla:onsInaddi:on:•  iden:fiednovelfeaturesfromgazedata•  modelandpredictthequalityoftransla:onsasperceivedbyevaluators

•  Linearregressionmodelwithridgeregulariza:on•  Ridgecoefficientminimizestheerror

•  Parameter λ controls the amount of shrink applied to regressioncoefficients

•  UsedtheglmnetpackageofRforcross-valida:ontofindthebestvalueofλonthetrainingdata

Data•  SubsetoftheSpanish-EnglishWMT’12Evalua:ontask•  Selected 60 medium-length sentences, evaluated by at least 2 differentannotators

•  Selected the best & the worst transla:ons, according to a humanevalua:onscore,basedonexpectedwins.

•  Total120evalua:ontasksx6differentevaluators=720evalua:onsEye-trackingAnnota4ons•  Presentevaluatorswithatransla:on-referencepair•  Thebest/worsttransla:onsofthesamesentencehavebeenshownwithatleast40differenttasksinbetween

•  Assigna0-100scoretoeachtask•  Inter-annotator kappa = 0.321 (slightly higher than the overall IAA inWMT’12forSpanish-English–0.284)

Evalua4onProtocolsimilartoWMT’12•  Pairwiseevalua:on•  Computed the Kendall’s tau

coefficient•  Evaluated using 10-fold cross-

valida:on

Model

Tool•  EyeTribeeye-tracker•  Samplingfrequencyof30Hz.•  Evalua:onenvironment:iAppraise

Lackpredic:vepower

ReadingpaAernsontransla:onandinter-regionbringusefulinforma:on

Lexicalizedgazejumpsbrings

addi:onalvaluethananLM

Combina4onswithBLEUBbleu 0.34

Bbleu+EyeTrabj 0.38

Bbleu+EyeLexall 0.42

Combining the best features with BLEU brings: readingpaAernscapturemorethanjustfluencyandadequacy

Conclusions: Eye–tracking features extracted captures addi:onalinforma:onandcancomplementtradi:onalmeasures(BLEU).Future work: more users, language pairs, early termina:on features,deepenanalysis.