Upload
vimukthi-wickramasinghe
View
54
Download
1
Embed Size (px)
Citation preview
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation
by Shixiang Lu, Zhenbiao Chen, Bo Xu
Presented By V B Wickramasinghe (148245F)
Overview● Introduction● Input features for DNN feature learning● Semi-supervised deep auto-encoder
features learning for SMT● Experiments and Results● Conclusion
Introduction● Paper describes a novel approach to statistical machine
translation(SMT).● Uses two deep neural network architectures specifically,
○ Deep belief networks(DBN)○ Deep auto encoders(DAE)
● The goal is to extract useful features of languages automatically using DAEs instead of doing it manually.
● Achieves statistically significant improvements over unsupervised DBN and baseline features.
Input features for DNN feature learning
● Uses a phrase-based translation model.● Four phrase features are used as the baseline. With f as source and e as
target,
Other features,● Bidirectional phrase pair similarity.
● Bidirectional Phrase generative probability.
Input features for DNN feature learning
● Phrase frequency.
● Phrase length.
In total there 16 input features which are represented by 16 input nodes in the DAE.
Semi-supervised deep auto-encoder features learning for SMT
● The introduced set of features(X) is then fed to a set of RBMs.
● Combined together these form a DBN.● These RBMs are layerwise pretrained to learn deep higher
order correlations between the input features.● Then unrolling each performed on this DBN to form a DAE.● Which is then finetuned using back propagation.● Final step is to stack a number of these trained DAEs to
form a 16-32-32-32-16-16-8 architecture after tuning.
Semi-supervised deep auto-encoder features learning for SMT
Experiments & Results
● Experimental SetupIWSLT. The bilingual corpus is the Chinese English part of Basic Traveling Expression corpus (BTEC) and China-Japan-Korea (CJK) corpus (0.38M sentence pairs with 3.5/3.8M Chinese/English words). NIST. The bilingual corpus is LDC4 (3.4M sentence pairs with 64/70M Chinese/English words). The LM corpus is the English side of the parallel data as well as the English Gigaword corpus (LDC2007T07) (11.3M sentences).
Experiments & Results
Thank you