View
5
Download
0
Category
Preview:
Citation preview
Personalized Deep Learning with Incremental Adaptation
Joern Wuebkerjoern@lilt.com
NVIDIA GTC, Washington DC, Oct 24, 2018
What will you learn in this talk?
2
● How to adapt neural machine translation models in real time, to learn domain-specific terminology, translator word choice and writing style
● How to encourage sparsity in personalized models using structured regularization (here: 70% reduction in network size)
● How to make personalized models available in a large-scale distributed environment
● Applicable to all tasks in which users are generating supervised data as they work
What is Lilt?
3
● Browser-based Computer-Aided Translation (CAT) tool
● Predictive typing / interactive machine translation○ Input: source language sentence, target language prefix○ Predict: target language sentence completion
● Difference to autocomplete on the phone:○ Larger context: source language sentence, target language prefix○ Prediction of full sentence completion
4
Short history of neural machine translation
System Description (English > German) BLEU[%] (newstest2014) GPU training hours
Statistical MT (Sennrich & Haddow, 2015) 22.6 n/a
Attention-based Neural MT (Bahdanau et al., 2014) 19.9 252 (K6000)
+ Monolingual training data (Sennrich et al., 2016) 22.7 670 (Titan Black)
+ Ensemble of neural models (Sennrich et al., 2016) 23.8 670 (Titan Black)
+ Deep network (Wu et al., 2016) -- Google 26.3 18,000 (K80)
Transformer network (Vaswani et al., 2017) 28.4 670 (P100)
Sennrich et al. Improving Neural Machine Translation Models with Monolingual Data.Bojar et al. Findings of the 2016 Conference on Machine Translation.Wu et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.Vaswani et al. Attention is All You Need.
Why Predictive Typing?
5
Post-editing vs. predictive typing
6
Post-editing
user action machine action Notation:
enter/upload source sentence
full sentence suggestion
edit full sentence
enter/upload source sentence
full sentence suggestion
correct first error
sentence completion suggestion
Predictive Typing
Post-editing machine translation
7
● Post-edited translations are generated more quickly and ranked as more accurate than unaided translations by professionals (Green et al., 2013).
○ Comparison of professional translators for English to French, Arabic, and German.
● But: Translators hate post-editing!● Expert translators make more edits in less time (Moorkens & O'Brien, 2015).
○ Professional translators were 3x more productive at post-editing than translation students.
● NMT doesn't speed up post-editing much vs Statistical MT (Castilho et al., 2017)○ For En-{De,Pt,El,Ru}, post-editing MT was ~5% faster, but required ~15% fewer keystrokes.○ Participants indicated that they found NMT errors more difficult to identify.
Green, Spence, Jeffrey Heer, and Christopher D. Manning. "The efficacy of human post-editing for language translation."Moorkens, Joss, and Sharon O’Brien. "Post-editing evaluations: Trade-offs between novice and professional participants."Castilho, Sheila, et al. "A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators."
Predictive typing
8
Wuebker, Joern et al. “Models and Inference for Prefix-Constrained Machine Translation.” Green, Spence, et al. "Predictive Translation Memory: A mixed-initiative system for human language translation." Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 2014.
● NMT helps for full-sentence MT, but even more on prefix-constrained MT (Wuebker et al., 2016)
● Predictive typing leads to more edits and higher quality (Green et al., 2014)○ Comparison of professional translators for English to French and German.○ Predictive typing did take ~20% longer than post-editing.○ When asked, “I would use interactive translation features if they were integrated into a CAT
product,” 20 out of 25 translators responded "agree" or "strongly agree."
● End translation quality is higher with predictive typing (Client Evaluation, 2017)○ Error frequency, detected by review, was 1.1% for post-editing & 0.3% for predictive typing.○ Throughput with predictive typing was 700+ words/hour, double a typical unassisted speed.
Transformer Network Architecture
9
10 Eine Glühstiftkerze (1) dient ...
Embedding lookup
Encoder Decoder
<s> A sheathed-element glow plug …
Filter
Self-attention
Filter
Self-attention
Filter
Self-attention...
...4×
10.3M
526K
Embedding lookup
Filter
Encoder attention
Self-attention
Filter
Encoder attention
Self-attention
Filter
Encoder attention
Self-attention
Output projection
A sheathed-element glow plug ...
10.3M
10.3M
788K
# parameters
Adaptation
11
12
Incremental adaptation: Document context
Example, a patent (https://www.google.com/patents/WO2007000372A1)
Sheathed-element glow plug
A sheathed-element glow plug (1) is to be placed inside a chamber (3) of an internal combustion engine. The sheathed-element glow plug (1) comprises a heating body (2) that has a glow tube (6) connected to a housing (4). The heating body (2) also comprises a ceramic heating element (15), which is placed inside the glow tube (6) and which serves to heat the glow tube (6). The glow tube (6) guarantees a thermal and mechanical protection for the ceramic heating element (15).
sheathed-element glow plug ↔ Glühstiftkerze
https://translate.google.com
13
Incremental adaptation: Example
PersonalizedMT
System
1. Initial MT suggestion
2. User correction
4. Improved suggestion
3. learn from correction
14
Personalized MT: Translation process1. Incoming translation request for User X2. Load User X’s model from cache or persistent storage 3. Apply model parameters to computation graph in TensorFlow4. Generate translation5. Respond to translation request (max. response time: ~300ms)
Full model: ~36M parameters
Personalized model: (2.) + (3.) ⇒ max. ~10M parameters
Solution: - Store personalized models as offsets from baseline model W = Wb + Wu
- Intelligent selection of sparse parameter subset Wu ⇒ Group Lasso
Structured Sparsity - “Group Lasso”
15
● Simultaneous regularization and tensor selection● Treat entire tensors/columns as individual parameters w.r.t. L1-regularization
● Can be easily implemented with any neural model● Applicable for any interactive machine assistance task
Adaptation results
16Wuebker et al. "Compact Personalized Models for Neural Machine Translation", To appear in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, October 2018
Group Lasso size
Learning Curve
17
Word prediction accuracy (WPA): Evaluated by predicting each word of each segment, conditioned on all previous words in the segment.
Y-axis: Performance difference between an incrementally adapted model & a static baseline model.
Approach WPA
Unadapted 34.8%
Adapted 40.3%
Lilt(Demo)
18
Availability of personalized models
19
● Translation services are deployed as auto-scalable kubernetes pods● Personalized models are stored in a three-level cache:
○ Local LRU (least-recently used) cache on each translator node○ Region-specific high-availability in-memory database (Redis)○ Permanent cloud data storage
● Provides balance between availability, memory footprint and performance● Multiple users can work together using the same personalized model
Conclusions
20
Lilt research team
21
Summary
22
● If end-to-end translation quality is a primary concern, then interactive human translation using predictive typing appears to be the best cost-efficient option.
● Online incremental adaptation is very effective, even using small data sets.
● Impact of adaptation is on par or larger than the difference between Neural MT and Statistical MT
● Sparse personalized models by structured regularization: Reduction of model size by ~70%
Thank you!
Joern Wuebkerjoern@lilt.com
NVIDIA GTC, Washington DC, Oct 24, 2018
Production Architecture
24
General architecture
25
Front end / app server
ServicesBackbone
Translation Service
Updater Service
Lexicon Service
Translation Memory ServiceBrowser Message queue
MySQL DB
Other storage types(GCS, Redis)
query
response
Translation service
Converter service
Updater service
Lexicon service
Translation memory service
query
response
26
Why we care about human translators
● The majority of translations (perhaps 99.7%) are generated by computers
● 1000x price ratio: 15 ¢/word from an LSP vs 0.015 ¢/word from an MT API
● The volume of human translation is still large and growing
○ An estimated $21 billion was spent on text translation in 2017 (Common Sense Advisory)
○ Year-over-year growth in the language services industry is 7%
○ ~130 billion words translated per year at 2500 words/day & 250 days/year = 200k+ people
Recommended