Incremental Adaptation Personalized Deep Learning...

Personalized Deep Learning with Incremental Adaptation

Joern Wuebkerjoern@lilt.com

NVIDIA GTC, Washington DC, Oct 24, 2018

What will you learn in this talk?

● How to adapt neural machine translation models in real time, to learn domain-specific terminology, translator word choice and writing style

● How to encourage sparsity in personalized models using structured regularization (here: 70% reduction in network size)

● How to make personalized models available in a large-scale distributed environment

● Applicable to all tasks in which users are generating supervised data as they work

What is Lilt?

● Browser-based Computer-Aided Translation (CAT) tool

● Predictive typing / interactive machine translation○ Input: source language sentence, target language prefix○ Predict: target language sentence completion

● Difference to autocomplete on the phone:○ Larger context: source language sentence, target language prefix○ Prediction of full sentence completion

Short history of neural machine translation

System Description (English > German) BLEU[%] (newstest2014) GPU training hours

Statistical MT (Sennrich & Haddow, 2015) 22.6 n/a

Attention-based Neural MT (Bahdanau et al., 2014) 19.9 252 (K6000)

+ Monolingual training data (Sennrich et al., 2016) 22.7 670 (Titan Black)

+ Ensemble of neural models (Sennrich et al., 2016) 23.8 670 (Titan Black)

+ Deep network (Wu et al., 2016) -- Google 26.3 18,000 (K80)

Transformer network (Vaswani et al., 2017) 28.4 670 (P100)

Sennrich et al. Improving Neural Machine Translation Models with Monolingual Data.Bojar et al. Findings of the 2016 Conference on Machine Translation.Wu et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.Vaswani et al. Attention is All You Need.

Why Predictive Typing?

Post-editing vs. predictive typing

Post-editing

user action machine action Notation:

enter/upload source sentence

full sentence suggestion

edit full sentence

enter/upload source sentence

full sentence suggestion

correct first error

sentence completion suggestion

Predictive Typing

Post-editing machine translation

● Post-edited translations are generated more quickly and ranked as more accurate than unaided translations by professionals (Green et al., 2013).

○ Comparison of professional translators for English to French, Arabic, and German.

● But: Translators hate post-editing!● Expert translators make more edits in less time (Moorkens & O'Brien, 2015).

○ Professional translators were 3x more productive at post-editing than translation students.

● NMT doesn't speed up post-editing much vs Statistical MT (Castilho et al., 2017)○ For En-{De,Pt,El,Ru}, post-editing MT was ~5% faster, but required ~15% fewer keystrokes.○ Participants indicated that they found NMT errors more difficult to identify.

Green, Spence, Jeffrey Heer, and Christopher D. Manning. "The efficacy of human post-editing for language translation."Moorkens, Joss, and Sharon O’Brien. "Post-editing evaluations: Trade-offs between novice and professional participants."Castilho, Sheila, et al. "A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators."

Predictive typing

Wuebker, Joern et al. “Models and Inference for Prefix-Constrained Machine Translation.” Green, Spence, et al. "Predictive Translation Memory: A mixed-initiative system for human language translation." Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 2014.

● NMT helps for full-sentence MT, but even more on prefix-constrained MT (Wuebker et al., 2016)

● Predictive typing leads to more edits and higher quality (Green et al., 2014)○ Comparison of professional translators for English to French and German.○ Predictive typing did take ~20% longer than post-editing.○ When asked, “I would use interactive translation features if they were integrated into a CAT

product,” 20 out of 25 translators responded "agree" or "strongly agree."

● End translation quality is higher with predictive typing (Client Evaluation, 2017)○ Error frequency, detected by review, was 1.1% for post-editing & 0.3% for predictive typing.○ Throughput with predictive typing was 700+ words/hour, double a typical unassisted speed.

Transformer Network Architecture

10 Eine Glühstiftkerze (1) dient ...

Embedding lookup

Encoder Decoder

<s> A sheathed-element glow plug …

Filter

Self-attention

Filter

Self-attention

Filter

Self-attention...

...4×

Embedding lookup

Filter

Encoder attention

Self-attention

Filter

Encoder attention

Self-attention

Filter

Encoder attention

Self-attention

Output projection

A sheathed-element glow plug ...

# parameters

Adaptation

Incremental adaptation: Document context

Example, a patent (https://www.google.com/patents/WO2007000372A1)

Sheathed-element glow plug

A sheathed-element glow plug (1) is to be placed inside a chamber (3) of an internal combustion engine. The sheathed-element glow plug (1) comprises a heating body (2) that has a glow tube (6) connected to a housing (4). The heating body (2) also comprises a ceramic heating element (15), which is placed inside the glow tube (6) and which serves to heat the glow tube (6). The glow tube (6) guarantees a thermal and mechanical protection for the ceramic heating element (15).

sheathed-element glow plug ↔ Glühstiftkerze

https://translate.google.com

Incremental adaptation: Example

PersonalizedMT

System

1. Initial MT suggestion

2. User correction

4. Improved suggestion

3. learn from correction

Personalized MT: Translation process1. Incoming translation request for User X2. Load User X’s model from cache or persistent storage 3. Apply model parameters to computation graph in TensorFlow4. Generate translation5. Respond to translation request (max. response time: ~300ms)

Full model: ~36M parameters

Personalized model: (2.) + (3.) ⇒ max. ~10M parameters

Solution: - Store personalized models as offsets from baseline model W = Wb + Wu

- Intelligent selection of sparse parameter subset Wu ⇒ Group Lasso

Structured Sparsity - “Group Lasso”

● Simultaneous regularization and tensor selection● Treat entire tensors/columns as individual parameters w.r.t. L1-regularization

● Can be easily implemented with any neural model● Applicable for any interactive machine assistance task

Adaptation results

16Wuebker et al. "Compact Personalized Models for Neural Machine Translation", To appear in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, October 2018

Group Lasso size

Learning Curve

Word prediction accuracy (WPA): Evaluated by predicting each word of each segment, conditioned on all previous words in the segment.

Y-axis: Performance difference between an incrementally adapted model & a static baseline model.

Approach WPA

Unadapted 34.8%

Adapted 40.3%

Lilt(Demo)

Availability of personalized models

● Translation services are deployed as auto-scalable kubernetes pods● Personalized models are stored in a three-level cache:

○ Local LRU (least-recently used) cache on each translator node○ Region-specific high-availability in-memory database (Redis)○ Permanent cloud data storage

● Provides balance between availability, memory footprint and performance● Multiple users can work together using the same personalized model

Conclusions

Lilt research team

Summary

● If end-to-end translation quality is a primary concern, then interactive human translation using predictive typing appears to be the best cost-efficient option.

● Online incremental adaptation is very effective, even using small data sets.

● Impact of adaptation is on par or larger than the difference between Neural MT and Statistical MT

● Sparse personalized models by structured regularization: Reduction of model size by ~70%

Thank you!

Joern Wuebkerjoern@lilt.com

NVIDIA GTC, Washington DC, Oct 24, 2018

Production Architecture

General architecture

Front end / app server

ServicesBackbone

Translation Service

Updater Service

Lexicon Service

Translation Memory ServiceBrowser Message queue

MySQL DB

Other storage types(GCS, Redis)

response

Translation service

Converter service

Updater service

Lexicon service

Translation memory service

response

Why we care about human translators

● The majority of translations (perhaps 99.7%) are generated by computers

● 1000x price ratio: 15 ¢/word from an LSP vs 0.015 ¢/word from an MT API

● The volume of human translation is still large and growing

○ An estimated $21 billion was spent on text translation in 2017 (Common Sense Advisory)

○ Year-over-year growth in the language services industry is 7%

○ ~130 billion words translated per year at 2500 words/day & 250 days/year = 200k+ people

Incremental Adaptation Personalized Deep Learning...

Documents

NVIDIA VRWORKS SDK GTC DCon-demand.gputechconf.com/gtcdc/2017/presentation/... · 6 THE WORLD’S MOST ADVANCED VR GAME Fire Archer - Flow Balloon Knight - PhysX Clown Painter - Flex

DeepStream SDK: Towards Large Scale Deployment of ...on-demand.gputechconf.com/gtcdc/2017/presentation/dc7198-jerem… · DeepStream from the Ground Up DeepStream modules maintain

Translators club

DC7119 MAKING DEEP LEARNING SCALE: DEFENSE …on-demand.gputechconf.com/gtcdc/2017/presentation/dc7119-willim-rorrer-making-deep...- Post-processing with CRF to enforce nearest neighbor

Counting Passenger Vehicles from Satellite Imageryon-demand.gputechconf.com/gtcdc/2017/presentation/...Counting Passenger Vehicles from Satellite Imagery “ Not everything that can

PROTOTYPE WARFARE: DoD in a Data-Driven Ageon-demand.gputechconf.com/gtcdc/2017/presentation/dc7259-general...Lieutenant General Jack Shanahan OUSDI Director for Defense Intelligence

4.1 three translators

Презентация PowerPoint - NVIDIAon-demand.gputechconf.com › gtcdc › 2017 › presentation › ...Презентация PowerPoint Author: Brovchenko Created Date: 11/1/2017

WELCOME [on-demand.gputechconf.com]on-demand.gputechconf.com/gtcdc/2018/pdf/dc8101-ai-for...KEY NOTE: INTRODUCTION TO AI 1:30-1:50PM Kirk Borne, Booz Allen GOVERNMENT PROJECTS Computer

Guide to translators

IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

THE ROLE OF AI IN A VR WORLD - NVIDIAon-demand.gputechconf.com/gtcdc/2018/pdf/dc8209-the-role...VIRTUAL REALITY AS THE INTERFACE Booz Allen Hamilton GTC DC 2018 16 TRAINING ROBOTS

Nexperia Logic Translators

Nicaraguan Translators

GTC DC 2017 Isaac Presentation Final Speakernoteson-demand.gputechconf.com/gtcdc/2017/presentation/...ODEV DQG LQVWUXFWRU OHG ZRUNVKRSV 7DNH VHOI SDFHG ODEV DW ZZZ QYLGLD FRP GOLODEV)LQG

Mps - Linked Translators

Machine Learning for Malware Analysis - GPU …on-demand.gputechconf.com/gtcdc/2017/presentation/dc7134...Introduction - What is Malware? - Software intended to cause harm or inflict

Accelera’ng Drug Discovery with Free Energy Calculaons on GPUson-demand.gputechconf.com/gtcdc/2016/presentation/dcs... · 2016-10-19 · comparable to other drug discovery project

Corpora for Translators

Computer Science Translators