Research and Practice of Simultaneous Speech Translation in Huawei … · 2021. 1. 29. · Simultaneous Speech Translation in Huawei Noah’s Ark Lab Qun Liu Background A General

Research and Practice of

Simultaneous Speech Translation

in Huawei Noah’s Ark Lab

Prof. Dr. Qun LiuChief Scientist of Speech and Language Computing

Huawei Noah’s Ark Lab

10 July 2020, AutoSimTrans Workshop

SimultaneousSpeech Translation

in HuaweiNoah’s Ark Lab

Qun Liu

Background

A General SSTFramework

Adapting ASR &NMT to SST

Implementationand Optimization

VideoDemonstrations

References

Content

1 Background

2 A General SST Framework

3 Adapting ASR & NMT to SST

4 Implementation and Optimization

5 Video Demonstrations



Qun Liu

Background




VideoDemonstrations

References

Content

1 Background







Qun Liu

Background




VideoDemonstrations

References

Background

• Along with the success of deep learning in MT, ASR and TTS,

Simultaneous Speech Translation (SST) became commercially feasible:

Microsoft Skype Translator Google Pixel Buds Tencent Conference Interpretor



Qun Liu

Background




VideoDemonstrations

References

Potential SST Scenarios in Huawei

• Huawei has a very long product line.

• There are many potential scenarios where SST can be applied:

Mobile phones Conference systems Customer service systems



Qun Liu

Background




VideoDemonstrations

References

SST Research in Huawei

• We started our research on SST in 2019.

• We propose a General SST Framework

• We adopt techniques to adapt our existing ASR and NMT systems to a

SST system

• We implement and optimize our SST systems in the followingscenarios:

• Conference speech translator on the cloud• Travel assistant on mobile phones



Qun Liu

Background




VideoDemonstrations

References

Content

1 Background







Qun Liu

Background




VideoDemonstrations

References

A General Framework for Simultaneous NMT [1]

• Motivation:

• Adapt an existing NMT system into a simultaneous translation system,

without retraining the NMT model.

• Components:

• An external ASR system• A pretrained NMT System• An input buffer• An output buffer

• Preprint: https://arxiv.org/abs/1911.03154

https://arxiv.org/abs/1911.03154



Qun Liu

Background




VideoDemonstrations

References

A General Framework for Simultaneous NMT [1]

• Procedure as two nested loops:

• Outer loop: update input buffer with ASR streaming output• Inner loop: update output buffer with NMT model.• The input buffer is updated only when the inner loop stops to update the

output buffer.





Qun Liu

Background




VideoDemonstrations

References

Key Problems

• How to continue translation

• Streaming source prefix→ rebuild encoder hidden states• Force decoding with the NMT model

• How to terminate

• Terminate when predicting the EOS token• Terminate when the system is not confident and prefer to wait for more

input tokens

• A Controller is used to determinate when to terminate• We define two controllers for this purpose



Qun Liu

Background




VideoDemonstrations

References

Length and EOS (LE) Controller

• The inner loop stops to generate output tokens when:

• A EOS token is generated;• The output buffer is full;• The delay between the output sequence and the input sequence is

smaller than a predefined threshold (e.g. 3 words).



Qun Liu

Background




VideoDemonstrations

References

Trainable (TN) Controller

• The inner loop stops to generate output tokens when:

• The output buffer is full;• The TN Controller output a CONTINUE signal.



Qun Liu

Background




VideoDemonstrations

References

TN Controller Training: Policy Gradient

• Fix the NMT model and train the TN controller with policy gradient:



Qun Liu

Background




VideoDemonstrations

References

TN Controller Training: Reward

• The reward function trades off quality and delay:



Qun Liu

Background




VideoDemonstrations

References

Experiments

• baselines:

• test-time-waitk: Ma et al. [2018]• SL: Zheng et al. [2019]• RWAgent: Gu et al. [2017]



Qun Liu

Background



Accurate Alignment for

Terminology Translation

Punctuation Prediction

Disfluency Detection and

Correction

Data Augmentation for

Robust Speech Translation


VideoDemonstrations

References

Content

1 Background







Qun Liu

Background







Correction




VideoDemonstrations

References

Content


Accurate Alignment for Terminology Translation in NMT


Disfluency Detection and Correction

Data Augmentation for Robust Speech Translation



Qun Liu

Background







Correction




VideoDemonstrations

References

Terminology Translation in NMT

• Terminology (including user-defined terms, entities, acronyms, etc.)

translation is crutial in many scenarios.

• It is not a trivial task in NMT because there is no explicit phrase table

used in NMT process.

• Solutions:• Without word alignment: Constrained Decoding

• Time consuming or inaccurate [6; 9].

• With word alignment: terminology replacement in automatic post-editing

• Word alignment induced from NMT is not accurate enough [5; 15; 8; 3].



Qun Liu

Background







Correction




VideoDemonstrations

References

Accurate Alignment for Terminology Translation [2]

• Our observation:

The hidden states at decoding step i + 1 are better representing target

word yi than the hidden states at step i for inducing word alignment





Qun Liu

Background







Correction




VideoDemonstrations

References

Accurate Alignment for Terminology Translation [2]

• Our work:• We introduce SHIFT-ATT, a pure interpretation method to induce

alignments from attention weights of vanilla Transformer.

• To select the best layer to induce alignments, we propose a surrogate layer

selection criterion, to ensure the induced word alignments agree best in

both translation directions, without manually labelled word alignments.

• We further propose SHIFT-AET, which extracts alignments from an

additional alignment module.





Qun Liu

Background







Correction




VideoDemonstrations

References

Experiments and Results



Qun Liu

Background







Correction




VideoDemonstrations

References

Content








Qun Liu

Background







Correction




VideoDemonstrations

References


• Models

• Baseline: BiLSTM Yi et al. [13]• BERT with Fine-tune• TinyBERT [7] (Distilled from BERT)

• Data: IWSLT2011 TED Corpus

• Labels:4 labels 3 labels

Comma(,)

Period(.) Period(.)

Question(?) Question(?)

None(O) None(O)



Qun Liu

Background







Correction




VideoDemonstrations

References




Qun Liu

Background







Correction




VideoDemonstrations

References

Content








Qun Liu

Background







Correction




VideoDemonstrations

References


• English Switchboard Corpus:

(from Wang et al. [11])

• IWSLT2020 TED Corpus (Chinese):

• ASR:黑胡桃木，黑胡桃木是我们店内最好的木材。• REF:黑胡桃木是我们店内最好的木材。• ASR:啊它的性质很好不容易开裂，• REF:它的性质很好不容易开裂，



Qun Liu

Background







Correction




VideoDemonstrations

References


• BERT with Multitask Fine-tuning

• follow Wang et al. [11]

• Task1: Sequence labeling

• RM and IM→ D (DELETE)• RP and Others→ O (OTHER)

• Task2: Sentence pair classification

• Smooth/Disfluent→ DF_0• Disfluent/Smooth→ DF_1

• Training Data: English Switchboard

• Data augmentation:

• Randomly insert IM• Swap RM & RP



Qun Liu

Background







Correction




VideoDemonstrations

References


• Experiment (English Switchboard Corpus):P R F1

UBT [12] 90.3 80.5 85.1

Semi-CRF [4] 90 81.2 85.4

Bi-LSTM [14] 91.8 80.6 85.9

Transition-based [10] 91.1 84.1 87.5

MTL Self-supervised [11] 93.4 87.3 90.2Our MTL-DA 91.3 87.7 89.4

• Experiment (IWSLT2020 TED Corpus, Chinese):TER (Translation Error Rate) Train Dev Test

Baseline 35.2 36.3 33.1

+ Expert Rules (pre-processing) 31.8(-3.4) 31.9(-4.4) 30.1(-3.0)

+ BERT Fine-tuning 28.3(-3.5) 28.3(-3.6) 56.7(-3.4)

+ Dictionary 26.4(-1.9) 25.9(-2.4) 24.9(-1.8)



Qun Liu

Background







Correction




VideoDemonstrations

References

Content








Qun Liu

Background







Correction




VideoDemonstrations

References

Motivation

• ASR outputs always contain errors which severely harms the quality of

the downstream MT system.

• Ideally, fine-tuning the MT system with the pairs of noisy ASR outputs

and their translations will benefit.

• However, there is very little parallel data with ASR-output in the source

side.

• Our solution:

• Using ASR training data (SPEECH, TEXT) and an existing ASR system

to train a GPT-2 system to generate ASR-like texts from normal texts.• Using the obtained system to transfer the source text of a bilingual

courpus to ASR-like text.



Qun Liu

Background







Correction




VideoDemonstrations

References

Robust MT Training



Qun Liu

Background







Correction




VideoDemonstrations

References

Noisy Data Filtering

• The noisy data generated by the GPT-2 system may be TOO noisy.

• We use an edit rate threshold on pronunciations to filter the GPT-2outputs.

• Sentence→ pronunciation using cmudict1.• Edit Rate: Edit distance normalized by the original length.

1http://www.speech.cs.cmu.edu/cgi-bin/cmudict

http://www.speech.cs.cmu.edu/cgi-bin/cmudict



Qun Liu

Background







Correction




VideoDemonstrations

References

Experiments

• Data: EN→ZH, Huawei STW meeting dataset and MSLT v1.1 2

• Models:

• transformer-small, emb size:256, intermediate size:1024, attention

heads:4, training data: 2M pairs• transformer-big, emb size:1024, intermediate size:4096, attention

heads:16, training data: 1.4B pairs

2https://github.com/MicrosoftTranslator/MSLT-Corpus

https://github.com/MicrosoftTranslator/MSLT-Corpus



Qun Liu

Background




VideoDemonstrations

References

Content

1 Background







Qun Liu

Background




VideoDemonstrations

References

System Architecture



Qun Liu

Background




VideoDemonstrations

References

Adapting ASR from Cloud to Mobile Devices



Qun Liu

Background




VideoDemonstrations

References

Conv-Transformer Transducer

• Streamble: unidirectional Transformer for audio encoding



Qun Liu

Background




VideoDemonstrations

References


• Low Frame Rate: interleaved convolutions for gradually downsampling



Qun Liu

Background




VideoDemonstrations

References


• Reduced Computation Cost: self-attention with fixed context window



Qun Liu

Background




VideoDemonstrations

References


• Reduced Computation Cost: hidden state reuse with relative positional

encoding



Qun Liu

Background




VideoDemonstrations

References

Adapting NMT from Cloud to Mobile Devices



Qun Liu

Background




VideoDemonstrations

References

Model Simplication on Model Devices

• Use RNN layers instead of masked multi-head self-attention layers;

• Decrease number of decoder model stacks, from N layers on cloud, to

M layers on devides;

• Share model parameters for different languages;

• Use smaller vocab size, from x to y .



Qun Liu

Background




VideoDemonstrations

References

Content

1 Background







Qun Liu

Background




VideoDemonstrations

References

Video Demonstrations



Qun Liu

Background




VideoDemonstrations

References

References I

[1] Yun Chen, Liangyou Li, Xin Jiang, Xiao Chen, and Qun Liu. How to do simultaneous translation better with consecutive neuralmachine translation?, 2019.

[2] Yun Chen, Yang Liu, Guanhua Chen, Xin Jiang, and Qun Liu. Accurate word alignment induction from neural machine translation,2020.

[3] Shuoyang Ding, Hainan Xu, and Philipp Koehn. Saliency-driven word alignment interpretation for neural machine translation. InProceedings of WMT 2019), pages 1–12, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-5201. URL https://www.aclweb.org/anthology/W19-5201.

[4] James Ferguson, Greg Durrett, and Dan Klein. Disfluency detection with a semi-markov model and prosodic features. InProceedings of the 2015 Conference of NAACL-HLT 2015, pages 257–262, 2015.

[5] Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, and Matthias Paulik. Jointly learning to align and translate with transformermodels. In Proceedings of EMNLP-IJCNLP 2019, pages 4453–4462, Hong Kong, China, November 2019. Association forComputational Linguistics. doi: 10.18653/v1/D19-1453. URL https://www.aclweb.org/anthology/D19-1453.

[6] Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search. arXiv preprintarXiv:1704.07138, 2017.

[7] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for naturallanguage understanding. In proceedings of ICLR 2020, 2020.

[8] Xintong Li, Guanlin Li, Lemao Liu, Max Meng, and Shuming Shi. On the word alignment from neural machine translation. InProceedings of ACL 2019, pages 1293–1303, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1124. URL https://www.aclweb.org/anthology/P19-1124.

https://www.aclweb.org/anthology/W19-5201

https://www.aclweb.org/anthology/D19-1453

https://www.aclweb.org/anthology/P19-1124



Qun Liu

Background




VideoDemonstrations

References

References II

[9] Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. arXivpreprint arXiv:1804.06609, 2018.

[10] Shaolei Wang, Wanxiang Che, Yue Zhang, Meishan Zhang, and Ting Liu. Transition-based disfluency detection using lstms. InProceedings of EMNLP 2017, pages 2785–2794, 2017.

[11] Shaolei Wang, Wanxiang Che, Qi Liu, Pengda Qin, Ting Liu, and William Yang Wang. Multi-task self-supervised learning fordisfluency detection. In Proceedings of AAAI 2020, 2020.

[12] Shuangzhi Wu, Dongdong Zhang, Ming Zhou, and Tiejun Zhao. Efficient disfluency detection with transition-based parsing. InProceedings of ACL-IJCNLP 2015, pages 495–503, 2015.

[13] Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Ya Li, et al. Distilling knowledge from an ensemble of models for punctuation prediction. InProceedings of Interspeech 2017, 2017.

[14] Vicky Zayats, Mari Ostendorf, and Hannaneh Hajishirzi. Disfluency detection using a bidirectional lstm. arXiv preprintarXiv:1604.03209, 2016.

[15] Thomas Zenkel, Joern Wuebker, and John DeNero. End-to-end neural word alignment outperforms giza++, 2020.



Qun Liu

Background




VideoDemonstrations

References

Thank for listening!

Team members:

CHEN Xiao, CHEN Yun, CUI Tong, DENG Yao, FU Xu, ZHANG Guchun,

GUO Weisheng, HU Wenchao, HUANG Wenyong, JIANG Xin, LI Liangyou,

LIN Fuguo, LIN Xia, LIU Jianfeng, LIU Kai, LIU Qun, LIU Xin, PENG Wei,

WANG Minghan, XIAO Jinghui, XIA Hairong, YANG Hao, ZHOU Huan.

Documents

Research and Practice of Simultaneous Speech Translation in Huawei … · 2021. 1. 29. · Simultaneous Speech Translation in Huawei Noah’s Ark Lab Qun Liu Background A General