105
CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention Presenters: Yesha, Gary, Vikash, Nick, Hoang

Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

CS6101 Deep Learning for NLP

Recess Week Machine Translation, Seq2Seq and Attention

Presenters:Yesha, Gary, Vikash, Nick, Hoang

Page 2: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

2

Outline

• Introduction - the evolution of machine translation systems (Yesha)

• Statistical machine translation (Gary)

• Neural machine translation and sequence-to-sequence architecture (Vikash)

• Attention (Nick)

• Other applications of seq2seq and attention (Hoang)

Page 3: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Evolution of Machine Translation

Diversification of Strategies1966 to 1975

Birth of Ruled-Based Translation1954 to 1966

Early Experiments1920s to 1950s

● The first known machine translation proposal was made in 1924 in Estonia and involved a typewriter-translator

● In 1933, French-Armenian inventor Georges Artsrouni received a patent for a mechanical machine translator that looked like a typewriter

Page 4: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Evolution of Machine Translation

Diversification of Strategies1966 to 1975

Birth of Ruled-Based Translation1954 to 1966

Early Experiments1920s to 1950s

● From then to 1946, numerous proposals were made.

● Russian scholar Peter Petrovich Troyanskii got a patent for proposing a mechanised dictionary

● In 1946, Booth and Richens in Britain did a demo on automated dictionary

● In 1949, Warren Weaver wrote a memorandum that first mentioned the possibility of using digital computers to translate documents between natural human languages

Page 5: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Evolution of Machine Translation

Diversification of Strategies1966 to 1975

Birth of Rule-Based Translation1954 to 1966

● In 1954, IBM and Georgetown University conduct a public demo○ Direct translation systems

for pairs of languages○ Example: Russian-English

systems to US Atomic Energy Commision

● Direct translation - limited scope in terms of languages and grammar rules used

Early Experiments1920s to 1950s

Page 6: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Evolution of Machine Translation

Diversification of Strategies1966 to 1975

● Research continues - better understanding of linguistics and indirect approaches to system design

● Different types of rule-based translation systems: ○ Interlingual MT Early Experiments1920s to

1950s

Birth of Ruled-Based Translation1954 to 1966

Page 7: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Evolution of Machine Translation

Diversification of Strategies1966 to 1975

● Different types of rule-based translation systems: ○ Transfer MT

Early Experiments1920s to 1950s

Birth of Ruled-Based Translation1954 to 1966

Page 8: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Evolution of Machine Translation

Modern MT1975 to present

● Statistical Machine Translation● Example Based Machine

Translation● Neural Machine Translation

Diversification of Strategies1966 to 1975

Birth of Ruled-Based Translation1954 to 1966

Page 9: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Statistical Machine Translation

Page 10: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Basic Approach

Languagemodel

Translationmodel

Page 11: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Basic Approach

Languagemodel

Translationmodel

Page 12: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Basic Approach

Languagemodel

Translationmodel

Page 13: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

General Formulation

Page 14: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

General Formulation

Page 15: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

General Formulation

Page 16: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

E.g. On voit Jon à la télévision (French)

good match to French? P(F|E)

good English? P(E)

Jon appeared in TV. ✓

It back twelve saw.In Jon appeared TV. ✓

Jon is happy today. ✓

Jon appeared on TV. ✓ ✓

TV appeared on Jon. ✓

Jon was not happy. ✓

Page 17: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Overview of SMT (from http://courses.engr.illinois.edu/cs447; lecture 22 slides)

Page 18: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

via chain rule

Markov assumption

Maximum Likelihood Estimation

Page 19: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

● Learns the lexical mapping

between languages + ordering

● Typically focus on word-level or phrase-level mapping(insufficient data to learn entire sentences mapping directly)

● Chicken-and-egg problem:

○ Given a translation table, we can easily extract the mappings

○ Given the mappings, we can easily estimate the probabilities

English Norwegian Probability

Translation probability table

Lexical mapping between languages

Page 20: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

● Learns the lexical mapping

between languages + ordering

● Typically focus on word-level or phrase-level mapping(insufficient data to learn entire sentences mapping directly)

● Chicken-and-egg problem:

○ Given a translation table, we can easily extract the mappings

○ Given the mappings, we can easily estimate the probabilities

English Norwegian Probability

Translation probability table

Lexical mapping between languages

Page 21: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

● Learns the lexical mapping

between languages + ordering

● Typically focus on word-level or phrase-level mapping(insufficient data to learn entire sentences mapping directly)

● Chicken-and-egg problem:

○ Given a translation table, we can easily extract the mappings

○ Given the mappings, we can easily estimate the probabilities

English Norwegian Probability

Translation probability table

Lexical mapping between languages

Page 22: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Word Alignments

1-1 mapping 1-many mapping many-1 mapping many-many mapping

Spurious words

Page 23: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Word Alignments

1-1 mapping 1-many mapping many-1 mapping many-many mapping

Spurious words

j=0 → means NULLi=0

Page 24: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Word Alignments

1-1 mapping 1-many mapping many-1 mapping many-many mapping

Spurious words

j=0 → means NULLi=0

Page 25: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Word Alignments

1-1 mapping 1-many mapping many-1 mapping many-many mapping

Spurious words

j=0 → means NULLi=0

Page 26: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM Translation Models

Page 27: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM Translation Model Example

Page 28: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM Translation Model Example

Produces 3

Produces 0

Page 29: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM Translation Model Example

Produces 3

Produces 0

Inserts

Page 30: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM Translation Model Example

Produces 3

Produces 0

Inserts

Distorts

Page 31: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM SMT Translation Models

● Models are typically built progressively, starting from the simplest to the most complex

○ IBM1 – lexical transitions only ○ IBM2 – lexicon plus absolute position ○ HMM – lexicon plus relative position ○ IBM3 – plus fertilities ○ IBM4 – inverted relative position alignment ○ IBM5 – non-deficient version of model 4

Page 32: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM1 Translation Model

e.g.

Page 33: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

IBM1 Translation Model

e.g.

Page 34: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Training Translation Models

● How can t, n, d and p1 be estimated?

● Assume corpora is already sentence-aligned

● Use Expectation-Maximization (EM) method to iteratively infer the parameters

Start with uniform transition probabilities

Iterate until converge

Expectation step

Maximization step

Page 35: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Training Translation Models

● How can t, n, d and p1 be estimated?

● Use Expectation-Maximization (EM) method to iteratively infer the parameters

Start with uniform transition probabilities

Iterate until converge

Expectation step

Maximization step

Page 36: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Decoder – Search problem● Finds a list of possible

target translation candidates, with the highest probabilities

● Exhaustive search is infeasible

● Typically, search heuristics are used insteade.g. beam search greedy decoding

Page 37: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Other things not discussed

● Phrase-based alignment

● Sentence alignment

● Syntactic-based or grammar-based models

● Hierarchical phrase-based translation

● …

Page 38: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Shortcomings of SMT● Models can grow to be overly complex● Requires a lot of hand-designed feature engineering in

order to be effective● Difficult to manage out-of-vocabulary words ● Components are independently trained; no synergy

between models vs. end-to-end training● Memory-intensive, as decoder relies on huge lookup

tables → not mobile-tech friendly➔ Let’s look into how Neural Machine Translation (NMT) models solves the above issues

Page 39: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Neural machine translation and sequence-to-sequence architecture

Page 40: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

What is Neural Machine Translation?

● Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network.

● The neural network architecture is called sequence-to-sequence (aka seq2seq) and it involves two RNNs.

Page 41: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

The sequence-to-sequence model

Page 42: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Simplest Model

Page 43: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Model Extensions

Page 44: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Model Extensions

Page 45: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Model Extensions

Page 46: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Model Extensions

Page 47: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Model Extensions

6.Use more complex hidden units GRU or LSTM.

Page 48: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Greedy Decoding

● Target sentence is generated by taking argmax on each step of the decoder.

Page 49: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Greedy Decoding Problems

Page 50: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Beam Search Decoding

Page 51: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Multilingual NMT

Page 52: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Google’s Multilingual NMT

Page 53: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Google’s Multilingual NMT Architecture

Page 54: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Four big wins of NMT

Page 55: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

So is Machine Translation solved?

Page 56: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Zero Shot Word Prediction

Page 57: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Predicting Unseen Words

Page 58: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Pointer Details

Page 59: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

59

Attention please!

Page 60: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

60

Seq2Seq: the bottleneck problem

Page 61: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

61

Seq2Seq: the bottleneck problem

Page 62: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

62

Seq2Seq: the bottleneck problem

Page 63: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

63

Attention: thinking processHow to solve the bottleneck problem?

Page 64: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

64

Attention: thinking process

How to solve the bottleneck problem?

Page 65: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

65

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Page 66: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

66

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Let’s do a weighted sum of all encoder hidden states!

context vector

Page 67: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

67

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Let’s do a weighted sum of all encoder hidden states!

context vector

Page 68: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

68

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Let’s do a weighted sum of all encoder hidden states!

context vector

Step 1: dot product

Page 69: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

69

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Let’s do a weighted sum of all encoder hidden states!

context vector

Step 1: dot product

Page 70: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

70

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Let’s do a weighted sum of all encoder hidden states!

context vector

Step 1: dot product

Step 2: softmax

Page 71: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

71

Attention: thinking process

How do we deal with variable length input sequence?

How to solve the bottleneck problem?

Let’s do a weighted sum of all encoder hidden states!

context vector

Step 1: dot product

Step 2: softmax

Q1: Why dot product?

Page 72: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

72

Seq2Seq with Attention

Page 73: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

73

Seq2Seq with Attention

Page 74: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

74

Seq2Seq with Attention

Page 75: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

75

Seq2Seq with Attention

Page 76: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

76

Seq2Seq with Attention

Page 77: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

77

Seq2Seq with Attention

Page 78: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

78

Seq2Seq with Attention

Page 79: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

79

Seq2Seq with Attention

Page 80: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

80

Seq2Seq with Attention

Page 81: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

81

Seq2Seq with Attention

Page 82: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

82

Seq2Seq with Attention

Page 83: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

83

Seq2Seq with Attention

Page 84: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

84

Seq2Seq with Attention

Q3: How does decoder state participate in the attention mechanism?

Page 85: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

85

Attention: in equations

Page 86: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

86

Attention is great

Page 87: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

87

Seq2Seq Applications

Page 88: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

● Summarization● Dialogue systems● Speech recognition● Image captioning

Seq2seq and attention applications

Page 89: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

“Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document.”

Text summarization

volume of transactions at the nigerian stock exchange has continued its decline since last week , a nse official said thursday . the latest statistics showed that a total of ##.### million shares valued at ###.### million naira -lrb- about #.### million us dollars -rrb- were traded on wednesday in #,### deals.

transactions dip at nigerian stock exchange

Page 90: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Allahyari, Mehdi, et al. "Text summarization techniques: a brief survey."

Extractive Topic Representation Approaches:

● Latent Semantic Analysis

● Bayesian Topic Models○ Latent Dirichlet

Allocation

Text summarization (before seq2seq)

Page 91: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Their drawbacks:

● They consider the sentences as independent of each other

● Many of these previous text summarization techniques do not consider the semantics of words.

● The soundness and readability of generated summaries are not satisfactory.

Text summarization (before seq2seq)

Page 92: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Nallapati, Ramesh, et al. "Abstractive text summarization using sequence-to-sequence rnns and beyond."

Abstractive Summary - not a mere selection of a few existing passages or sentences extracted from the source

Text summarization (seq2seq)

Neural Machine Translation Abstractive Text Summarization

Output depends on source length Output is short

Retains the original content Compresses the original content

Strong notion of one-to-one word level alignment

Less obvious notion of such one-to-one alignment

Page 93: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Text summarization (seq2seq)Feature-rich encoder:

● POS, NER, TF, IDF

Page 94: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Text summarization (seq2seq)Switch generator/pointer for OOV

● Switch is on → produces a word from itstarget vocabulary

● Switch is off → generates a pointer to a word-position in theSource → copied into the summary

Page 95: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Text summarization (seq2seq)Identify the key sentences:

● 2 levels of importance → 2 BiRNNs

● Attention mechanisms operate at both levels simultaneously.

● Word-level attention is reweighted:

Page 96: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

“Dialogue systems, also known as interactive conversational agents, virtual agents and sometimes chatterbots, are used in a wide set of applications ranging from technical support services to language learning tools and entertainment.”

Dialogue systems

Page 97: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Dialogue systems (before seq2seq)Young, Steve, et al. "Pomdp-based statistical spoken dialog systems: A review."

Goal-driven system:

● At each t, SLU converts input to a semantic representation ut

● System updates internal state and determines next action at, then converted to output speech

Page 98: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Their drawbacks:

● Use handcrafted features for the state and action space representations● Require a large corpora of annotated task-specific simulated conversations

○ → Time consuming to deploy

Dialogue systems (before seq2seq)

Page 99: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Serban, Iulian Vlad, et al. "Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models."

● End-to-end trainable, non-goal-driven systems based on generative probabilistic models.

Dialogue systems (seq2seq)

Page 100: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Dialogue systems (seq2seq)Hierarchical Recurrent Encoder - Decoder

● Encoder map each utterance (last token) to an utterance vector

● Higher-level context RNN:○ cn = f(cn-1, un)

Page 101: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

From RNN to Speech Recognition:

● RNN requires pre-segmented input data

● RNNs can only be trained to make a series of independent label classifications

Speech recognition

Page 102: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Speech recognitionPrabhavalkar, Rohit, et al. "A comparison of sequence-to-sequence models for speech recognition."

Page 103: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Image captioningXu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention."

Page 104: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

Location variable st: where to put attention when generating the tth word

http://kelvinxu.github.io/projects/capgen.html

Image captioning

Page 105: Machine Translation, Seq2Seq and Attention CS6101 Deep ...kanmy/courses/6101_1810/wrecess-mt-seq2... · CS6101 Deep Learning for NLP Recess Week Machine Translation, Seq2Seq and Attention

105

Thank You!