What machine translation developers are doing to make post-editors happy

What MT developers are doing… …to make post-editors happy

John TinsleyCEO and Co-founder

WPTP4 @ MT Summit. Miami. 3rd November 2015

We provide Machine Translation solutions with Subject Matter Expertise

MT solutions and services provider, specializing in providing customised solutions with subject matter expertise for specific technical sectors, such as Patents/IP, life sciences, and financial.

MT for Information Purposes

MT Application Areas

MT for Post-editing Productivity

•  Development focuses on improving key information translation•  Terminology is important•  Evaluation driven by “usability”

•  Development focuses on reducing edits required•  Feedback loop is crucial•  Evaluation through practical translation tasks

Use cases in practice

Product descriptions to open new markets

MT for post-editing productivity across

industries

Developer, and user for web content

Tens of thousands of people using online

tools daily

TRANSLATION

“Four Pillars of Happiness”

QUALITY

EVALUATION

INTEGRATION

FEEDBACK

Ensuring the the output is the highest quality possible!

Making sure the MT fits seamlessly into the workflow

Letting users know how good to expect output to be

Bringing the translator into the loop to affect change

Quality There’s no silver bullet when it comes to improving MT quality

•  What is being done to improve MT*a)  on a broader, technology level?b)  on a lower level for specific languages / domains?

Quality

*not with the express purpose of making post-editors happy J


Quality


–  Neural networks and deep learning•  something new, totally different, the future?

–  Online adaptive MT•  improving specific engines rapidly [feedback]

–  Syntax-based MT (tree-to-string, etc.)•  incorporating elements of linguistics


Quality


–  Chinese•  segmentation, 的 (de) particle

–  German•  long-distance verb movements, compound splitting / joining

–  Irish•  more fundamental, data collection, resource development


Quality



Quality






Quality



Quality






Quality


–  MT for User Generated Content @ …•  how do handle misspellings, text speak, etc.

–  Patent focused MT @ Iconic•  concentrating on mix of technical language and style

–  MT for online course materials @ TraMOOC•  European H2020 project

Evaluation •  Objectively provide stakeholders information such as:

a)  general quality expectations of an MT engineb)  how it’s impacting individual translators’ performancec)  what specific areas could be improved

Lots of different ways to do evaluation–  automatic scores

•  BLEU, METEOR, GTM, TER

–  fluency, adequacy, comparative ranking–  task-based evaluation

•  error analysis, post-edit productivity

Different metrics, different intelligence–  what does each type of metric tell us?–  which ones are usable at which stage of evaluation?

e.g. can we really use automatic scores to assess productivity?

e.g. does productivity delta really tell us how good the output is?

MT Evaluation – where do we start!?

ProblemLarge Chinese to English patent translation project. Challenging content and language

QuestionWhat if any efficiencies can machine translation add to the workflow of RWS translators?

How we applied different types of MT evaluation and different stages in the process, at various go/no stages, to help RWS to assess whether MT is viable for this project

Evaluation Case Study – RWS

- UK headquartered public company- Founded 1958- 9th largest LSP (CSA 2013 report)- Leader in specialist IP translations

Can we improve our baseline engines through customisation? Step 1: Are the engines any good?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

BLEU TER

Iconic Baseline

Iconic Customised

What next?

How good is the output relative to the task, i.e. post-editing?- fluency/adequacy not going to tell us- let’s start with segment level TER

-  Huge improvement

-  Intuitively, scores reflect well but don’t really say anything

-  Let’s dig deeper

Translation Edit Rate: correlates well with practical evaluations

If we look deeper, what can we learn?

INTELLIGENCE

• Proportion of full matches (i.e. big savings)

• Proportion of close matches (i.e. faster that fuzzy matches)

• Proportion of poor matches

ACTIONABLE INFORMATION

• Type of sentence with high/low matches

• Weaknesses and gaps

• Segments to compare and analyse in translation memory

TER

sco

re

Step 2: Are they any good for post-editing?

Distribution of segment-level TER scores

segment length

With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework

Step 3: Quantifying with ACTUAL translators

Productivity Test

Productivity Test

With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework

Beware the variables!•  Translators: different experience, speed, perceptions of MT

–  24 translators: senior, staff, and interns

•  Test sets: not representative; particularly difficult–  2 tests sets, comprising 5 documents, and cross-fold validation

•  Environment and task: inexperience and unfamiliarity–  Training materials, videos, and “dummy” segments

Step 3: Productivity testing

Overall average

Findings and Learnings

25% productivity gain

Experienced: 22%Staff: 23%

Interns: 30%

Test set 1.1: 25%Test set 1.2: 35%Test set 2.1: 06%Test set 2.2: 35%

Correlates with TER

Rollout with junior staff for more immediate impact on bottom line?

Don’t be over concerned by outliers.Use data to facilitate source content profiling?

What it tells us

By Translator Profile

By Test Set

Evaluation •  Objectively provide stakeholders information such as:

a)  general quality expectations of an MT engine ✔b)  how it’s impacting individual translators’ performance ✔c)  what specific areas could be improved ✔

Now we actually talk to the translators to get their feedback on the task, the MT output, and start that virtuous loop…we’ll come back to this

Metrics•  WMT metrics shared task•  New(er) metrics designed to correlate with post-editing effort•  Optimising MT engines on new / different metrics

Estimating the quality of MT output in real-time at runtime

•  Binary classification (good/bad)•  Multi-label classification, scores•  Word level, error categorisation

Quality Estimation and other features

Engaging end-users – post-editors, LSPs – both directly and indirect to take on-board feedback for the betterment of MT

Feedback

Direct Feedback•  talking to the translators (imagine!)•  collecting structured feedback

–  error categorisation–  correction–  severity

•  commenting on error types and actions

Establish a relationship and understanding to foster acceptance

The machine translation engine will never be 100% perfect. Certain types of sentences will always lend themselves better to MT than others. Our joint goal is to get the machine translation quality to a level that a majority of the sentences are translated well, and the process of post-editing will be faster and more efficient than piecing together translations from a combination of fuzzy matches, terminology, and reference translations.

There are certain types of MT output errors that can be fixed quickly and easily, while others are more fundamental issues that will get fixed with general improvement of the engines and technology itself over time. Here are some examples of each:

If we encounter an error that is just a "minor" mistake and, in general, the contextaround it is ok, sometimes the best approach is to simply leave it for post-editing.

Understanding the MT developer

Fixed Over Time- General grammatical errors- Sentence-level disfluency- Noun phrase ordering

Quick Fixes- Technical terminology- Frequent, consistent set phrases - Stylistic/formatting errors

Engaging end-users – post-editors, LSPs – both directly and indirect to take on-board feedback for the betterment of MT

Feedback

Indirect Feedback•  terminology management•  automatic post-edit rules•  templates for

generalisation

Empowering the translator to affect change themselves

•  Make MT fit as seamlessly as possible into the translator workflowa)  directly into existing CAT toolsb)  new CAT toolsc)  what else would you like? J

Integration

•  Most CAT tools have MT plugins for most MT vendors•  Studio, MemoQ, Wordfast, MultiTrans

•  Matecat making MT more central•  facilitating online learning technology too

•  Highlighting, instrumentation, TM / MT cooperation

“The biggest room in the world is the room for improvement”

Thank You! [email protected]

@IconicTrans

Software

What machine translation developers are doing to make post-editors happy