Machine Translation, Seq2Seq and Attention CS6101 Deep...

CS6101 Deep Learning for NLP

Recess Week Machine Translation, Seq2Seq and Attention

Presenters:Yesha, Gary, Vikash, Nick, Hoang

Outline

• Introduction - the evolution of machine translation systems (Yesha)

• Statistical machine translation (Gary)

• Neural machine translation and sequence-to-sequence architecture (Vikash)

• Attention (Nick)

• Other applications of seq2seq and attention (Hoang)

Evolution of Machine Translation

Diversification of Strategies1966 to 1975

Birth of Ruled-Based Translation1954 to 1966

Early Experiments1920s to 1950s

● The first known machine translation proposal was made in 1924 in Estonia and involved a typewriter-translator

● In 1933, French-Armenian inventor Georges Artsrouni received a patent for a mechanical machine translator that looked like a typewriter

● From then to 1946, numerous proposals were made.

● Russian scholar Peter Petrovich Troyanskii got a patent for proposing a mechanised dictionary

● In 1946, Booth and Richens in Britain did a demo on automated dictionary

● In 1949, Warren Weaver wrote a memorandum that first mentioned the possibility of using digital computers to translate documents between natural human languages

Birth of Rule-Based Translation1954 to 1966

● In 1954, IBM and Georgetown University conduct a public demo○ Direct translation systems

for pairs of languages○ Example: Russian-English

systems to US Atomic Energy Commision

● Direct translation - limited scope in terms of languages and grammar rules used

● Research continues - better understanding of linguistics and indirect approaches to system design

● Different types of rule-based translation systems: ○ Interlingual MT Early Experiments1920s to

● Different types of rule-based translation systems: ○ Transfer MT

Modern MT1975 to present

● Statistical Machine Translation● Example Based Machine

Translation● Neural Machine Translation

Statistical Machine Translation

Basic Approach

Languagemodel

Translationmodel

Basic Approach

Languagemodel

Translationmodel

Basic Approach

Languagemodel

Translationmodel

General Formulation

E.g. On voit Jon à la télévision (French)

good match to French? P(F|E)

good English? P(E)

Jon appeared in TV. ✓

It back twelve saw.In Jon appeared TV. ✓

Jon is happy today. ✓

Jon appeared on TV. ✓ ✓

TV appeared on Jon. ✓

Jon was not happy. ✓

Overview of SMT (from http://courses.engr.illinois.edu/cs447; lecture 22 slides)

via chain rule

Markov assumption

Maximum Likelihood Estimation

● Learns the lexical mapping

between languages + ordering

● Typically focus on word-level or phrase-level mapping(insufficient data to learn entire sentences mapping directly)

● Chicken-and-egg problem:

○ Given a translation table, we can easily extract the mappings

○ Given the mappings, we can easily estimate the probabilities

English Norwegian Probability

Translation probability table

Lexical mapping between languages

Word Alignments

1-1 mapping 1-many mapping many-1 mapping many-many mapping

Spurious words

Word Alignments

Spurious words

j=0 → means NULLi=0

Word Alignments

Spurious words

Word Alignments

Spurious words

IBM Translation Models

IBM Translation Model Example

Produces 3

Produces 0

Produces 3

Produces 0

Inserts

Produces 3

Produces 0

Inserts

Distorts

IBM SMT Translation Models

● Models are typically built progressively, starting from the simplest to the most complex

○ IBM1 – lexical transitions only ○ IBM2 – lexicon plus absolute position ○ HMM – lexicon plus relative position ○ IBM3 – plus fertilities ○ IBM4 – inverted relative position alignment ○ IBM5 – non-deficient version of model 4

IBM1 Translation Model

Training Translation Models

● How can t, n, d and p1 be estimated?

● Assume corpora is already sentence-aligned

● Use Expectation-Maximization (EM) method to iteratively infer the parameters

Start with uniform transition probabilities

Iterate until converge

Expectation step

Maximization step

Training Translation Models

● How can t, n, d and p1 be estimated?

● Use Expectation-Maximization (EM) method to iteratively infer the parameters

Start with uniform transition probabilities

Iterate until converge

Expectation step

Maximization step

Decoder – Search problem● Finds a list of possible

target translation candidates, with the highest probabilities

● Exhaustive search is infeasible

● Typically, search heuristics are used insteade.g. beam search greedy decoding

Other things not discussed

● Phrase-based alignment

● Sentence alignment

● Syntactic-based or grammar-based models

● Hierarchical phrase-based translation

● …

Shortcomings of SMT● Models can grow to be overly complex● Requires a lot of hand-designed feature engineering in

order to be effective● Difficult to manage out-of-vocabulary words ● Components are independently trained; no synergy

between models vs. end-to-end training● Memory-intensive, as decoder relies on huge lookup

tables → not mobile-tech friendly➔ Let’s look into how Neural Machine Translation (NMT) models solves the above issues

Neural machine translation and sequence-to-sequence architecture

What is Neural Machine Translation?

● Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network.

● The neural network architecture is called sequence-to-sequence (aka seq2seq) and it involves two RNNs.

The sequence-to-sequence model

Simplest Model

Model Extensions

6.Use more complex hidden units GRU or LSTM.

Greedy Decoding

● Target sentence is generated by taking argmax on each step of the decoder.

Greedy Decoding Problems

Beam Search Decoding

Multilingual NMT

Google’s Multilingual NMT

Google’s Multilingual NMT Architecture

Four big wins of NMT

So is Machine Translation solved?

Zero Shot Word Prediction

Predicting Unseen Words

Pointer Details

Attention please!

Seq2Seq: the bottleneck problem

Attention: thinking processHow to solve the bottleneck problem?

Attention: thinking process

How to solve the bottleneck problem?

How do we deal with variable length input sequence?

Let’s do a weighted sum of all encoder hidden states!

context vector

Step 1: dot product

context vector

Step 1: dot product

context vector

Step 1: dot product

Step 2: softmax

context vector

Step 1: dot product

Step 2: softmax

Q1: Why dot product?

Seq2Seq with Attention

Q3: How does decoder state participate in the attention mechanism?

Attention: in equations

Attention is great

Seq2Seq Applications

● Summarization● Dialogue systems● Speech recognition● Image captioning

Seq2seq and attention applications

“Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document.”

Text summarization

volume of transactions at the nigerian stock exchange has continued its decline since last week , a nse official said thursday . the latest statistics showed that a total of ##.### million shares valued at ###.### million naira -lrb- about #.### million us dollars -rrb- were traded on wednesday in #,### deals.

transactions dip at nigerian stock exchange

Allahyari, Mehdi, et al. "Text summarization techniques: a brief survey."

Extractive Topic Representation Approaches:

● Latent Semantic Analysis

● Bayesian Topic Models○ Latent Dirichlet

Allocation

Text summarization (before seq2seq)

Their drawbacks:

● They consider the sentences as independent of each other

● Many of these previous text summarization techniques do not consider the semantics of words.

● The soundness and readability of generated summaries are not satisfactory.

Text summarization (before seq2seq)

Nallapati, Ramesh, et al. "Abstractive text summarization using sequence-to-sequence rnns and beyond."

Abstractive Summary - not a mere selection of a few existing passages or sentences extracted from the source

Text summarization (seq2seq)

Neural Machine Translation Abstractive Text Summarization

Output depends on source length Output is short

Retains the original content Compresses the original content

Strong notion of one-to-one word level alignment

Less obvious notion of such one-to-one alignment

Text summarization (seq2seq)Feature-rich encoder:

● POS, NER, TF, IDF

Text summarization (seq2seq)Switch generator/pointer for OOV

● Switch is on → produces a word from itstarget vocabulary

● Switch is off → generates a pointer to a word-position in theSource → copied into the summary

Text summarization (seq2seq)Identify the key sentences:

● 2 levels of importance → 2 BiRNNs

● Attention mechanisms operate at both levels simultaneously.

● Word-level attention is reweighted:

“Dialogue systems, also known as interactive conversational agents, virtual agents and sometimes chatterbots, are used in a wide set of applications ranging from technical support services to language learning tools and entertainment.”

Dialogue systems

Dialogue systems (before seq2seq)Young, Steve, et al. "Pomdp-based statistical spoken dialog systems: A review."

Goal-driven system:

● At each t, SLU converts input to a semantic representation ut

● System updates internal state and determines next action at, then converted to output speech

Their drawbacks:

● Use handcrafted features for the state and action space representations● Require a large corpora of annotated task-specific simulated conversations

○ → Time consuming to deploy

Dialogue systems (before seq2seq)

Serban, Iulian Vlad, et al. "Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models."

● End-to-end trainable, non-goal-driven systems based on generative probabilistic models.

Dialogue systems (seq2seq)

Dialogue systems (seq2seq)Hierarchical Recurrent Encoder - Decoder

● Encoder map each utterance (last token) to an utterance vector

● Higher-level context RNN:○ cn = f(cn-1, un)

From RNN to Speech Recognition:

● RNN requires pre-segmented input data

● RNNs can only be trained to make a series of independent label classifications

Speech recognition

Speech recognitionPrabhavalkar, Rohit, et al. "A comparison of sequence-to-sequence models for speech recognition."

Image captioningXu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention."

Location variable st: where to put attention when generating the tth word

http://kelvinxu.github.io/projects/capgen.html

Image captioning

Thank You!

Machine Translation, Seq2Seq and Attention CS6101 Deep...

Documents

Natural Language Dialogues with Seq2Seq · Natural Language Dialogues with Sequence-To-Sequence Learning Dirk von Grünigen Deep Learning Day 2017, 22nd September 2017

New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

>T0515 ...bliulab.net/IDP-Seq2Seq/static/dataset/CASP.pdf · >t0515 mietpyylidkakltrnmeriahvreksgakallalkcfatwsvfdlmrdymdgttssslfevrlgrer fgkethaysvaygdneidevvshadkiifnsisqlerfadkaagiarglrlnpqvssssfdladparpfs

Transformers and CNNskanmy/courses/6101_1810/w8-transformer.pdf · • Motivation • CNN is faster since the computations can be more parallelized compared to RNN • RNN needs to

8. Appendix - openaccess.thecvf.comopenaccess.thecvf.com/content_ICCV_2019/...Helps...The RNN and Seq2seq models are implemented in Ten-sorﬂow [1]. For the QuaterNet-SPL model we

USER SIMULATIONFORDIALOGUESYSTEMS · §The Seq2One is slightly better than Seq2Seq because it‘s an easier task §The Seq2Seq has better scalability (the number of possible acts

Seq2Seq RNN based Gait Anomaly Detection from Smartphone Acquired Multimodal … › pdf › 1911.08608v1.pdf · 2019-11-21 · 1 Seq2Seq RNN based Gait Anomaly Detection from Smartphone

A Sequence Modeling Approach for Structured Data ...declerck/semdeep-5/presentations/Deck___Unstructured_to_Structured_Data.pdfNovel Aspects 1 Use of seq2seq models for information

Multilingual Dialogue Generation with Shared-Private Memorytcci.ccf.org.cn/conference/2019/papers/24.pdf · Keywords: Multilingual dialogue system Memory network Seq2Seq Multi-task

Abstract - arXiv · important application of natural language generation. We ﬁnd that, predominantly, the approaches for building dialogue systems use seq2seq or language models

Registration of 3D Faces Leow Wee Kheng CS6101 AY2012-13 Semester 1

Dependency Parsingkanmy/courses/6101_1810/w4-depe… · Arc labels for all those words (except for 6 words of stack/buffer). [Chen and Manning. 2014] Output The last layer of the

Discreteness in Generative Models 25... · Multiple data inputs and multiple data outputs Input Hidden state Output asynchronous (Seq2Seq) ... Discreteness in Generative Models –Discrete

Neural Machine Plan Translation and - Unbabel...Neural Machine Translation and Universal Multilingual Representa-tions Holger Schwenk Plan Introduction Vision NLP Embeddings RNN Seq2Seq

Sequence-to-Sequence Learning as Beam-Search OptimizationRoom for Improvement? Despite its tremendous success, there are some potential issues with standard Seq2Seq [Ranzato et al

Coreference Resolution - NUS Computing - Homekanmy/courses/6101_1810/w9-coref.… · Coreference Resolution Daniel Biro, Joel Lee, Louis, Ding Feng, Mohit. 1.Introduction Daniel Biro

Table-to-text Generation by Structure-aware Seq2seq Learning · Table-to-text Generation by Structure-aware Seq2seq Learning Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang and Zhifang

Encoder-Decoder Approach to Predict Airport Operational ... · A. Basic encoder-decoder Conventional seq2seq models used in many applications consist of two parts: the encoder and

20160204 CS6101 MOOC Reading groupkanmy/courses/6101_2016/cs... · 2016. 3. 3. · Reproducible#research#in#MOOCs • Apressingissue!Stallingprogress#on#all# important#MOOC#problems

arXiv:2004.11019v3 [cs.CL] 11 Jun 20202.1 Seq2Seq Dialogue Generation We deﬁne the Seq2Seq task-oriented dialogue gen-eration as ﬁnding the system response Yaccording to the input