86
Harvard IACS CS109B Pavlos Protopapas, Mark Glickman, and Chris Tanner NLP Lectures: Part 3 of 4 Lecture 24: Attention

Lecture 24: Attention - GitHub Pages

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 24: Attention - GitHub Pages

Harvard IACSCS109BPavlos Protopapas, Mark Glickman, and Chris Tanner

NLP Lectures: Part 3 of 4

Lecture 24: Attention

Page 2: Lecture 24: Attention - GitHub Pages

2

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 3: Lecture 24: Attention - GitHub Pages

3

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 4: Lecture 24: Attention - GitHub Pages

Previously, we learned about word embeddings

4

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

millions of books word2vec

Page 5: Lecture 24: Attention - GitHub Pages

Previously, we learned about word embeddings

5

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

millions of books word2vec word embeddings

aardvark

apple

before

zoo

Page 6: Lecture 24: Attention - GitHub Pages

6

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

“The food was delicious. Amazing!” 4.8/5

Page 7: Lecture 24: Attention - GitHub Pages

7

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

“The food was delicious. Amazing!” 4.8/5

the

food

was

delicious

amazing

+

+

+

+

=

average embedding

Feed-forwardNeural Net

average embedding

4.8/5

Page 8: Lecture 24: Attention - GitHub Pages

8

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

“Waste of money. Tasteless!” 2.4/5

waste

of

money

tasteless

+

+

+

=

average embedding

Feed-forwardNeural Net

average embedding

2.4/5

Page 9: Lecture 24: Attention - GitHub Pages

9

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

“Daaang. What?! Supa Lit” 4.9/5

daaang

what

supa

lit

+

+

+

=

average embedding

Strengths and weaknesses of word embeddings (type-based)?

Page 10: Lecture 24: Attention - GitHub Pages

10

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

daaang

what

supa

lit

+

+

+

=

average embedding

Strengths:

• Leverages tons of existing data

• Don’t need to depend on our data to create embeddings

Page 11: Lecture 24: Attention - GitHub Pages

11

word embeddings (type-based)approaches:• count-based/DSMs (e.g., SVD, LSA)• Predictive models (e.g., word2vec, GloVe)

daaang

what

supa

lit

+

+

+

=

average embedding

Issues:

• Out-of-vocabulary (OOV) words

• Not tailored to this dataset

Page 12: Lecture 24: Attention - GitHub Pages

Previously, we learned about word embeddings

12

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

LSTM

Input Layer

Hidden layer

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑥! 𝑥" 𝑥#

𝑉 𝑉

Output layer

Page 13: Lecture 24: Attention - GitHub Pages

13

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

Review #1

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

the food was

𝑊

𝑈

#𝑦#

𝑉

delicious

𝑊

𝑈

#𝑦#

𝑉

amazing

4.8

7192.

Page 14: Lecture 24: Attention - GitHub Pages

14

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

it was cold

𝑊

𝑈

#𝑦#

𝑉

and

𝑊

𝑈

#𝑦#

𝑉

tasteless

2.5

7192.

Review #2

Page 15: Lecture 24: Attention - GitHub Pages

15

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

7192.

Review #53,781

Page 16: Lecture 24: Attention - GitHub Pages

16

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

7192.

Review #53,781

Every token in the corpus has a contextualized embedding

Page 17: Lecture 24: Attention - GitHub Pages

17

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

7192.

Review #53,781

This is where the “meaning” is captured

Page 18: Lecture 24: Attention - GitHub Pages

18

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

Review #53,781

Strengths and weaknesses of contextualized embeddings

(aka token-based)?

Page 19: Lecture 24: Attention - GitHub Pages

19

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

Review #53,781

Strengths:

• Tailored to your particular corpus

• No out-of-vocabulary (OOV) words

Page 20: Lecture 24: Attention - GitHub Pages

20

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

Review #53,781

Weaknesses:

• May not have enough data to produce good results

• Have to train new model for each use case

• Can’t leverage a wealth of existed text data (millions of books)???

Page 21: Lecture 24: Attention - GitHub Pages

21

contextualized embeddings (token-based)approaches:• Predictive models (e.g., BiLSTMs, GPT-2, BERT)

𝑊

𝑈

#𝑦!

𝑊

𝑈

𝑊

𝑈

#𝑦" #𝑦#

𝑉 𝑉

found a hair

𝑊

𝑈

#𝑦#

𝑉

in

𝑊

𝑈

#𝑦#

𝑉

the

1.0

Review #53,781

Weaknesses:

• May not have enough data to produce good results

• Have to train new model for each use case

• Can’t leverage a wealth of existed text data (millions of books)???

WRONG! We can leverage millions of books!

Page 22: Lecture 24: Attention - GitHub Pages

22

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈𝑉 𝑉 𝑉

Language Modelling

Call me Ishmael. It

wasme Ishmael. It

(let’s input 1 million documents)

Page 23: Lecture 24: Attention - GitHub Pages

23

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈𝑉 𝑉 𝑉

Language Modelling

An oak tree belongs

tooak tree. belongs

(let’s input 1 million documents)

Page 24: Lecture 24: Attention - GitHub Pages

24

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈𝑉 𝑉 𝑉

An oak tree belongs

tooak tree. belongs

The contextualized embeddings for 1 million docs aren’t useful to us for a new task (e.g., predicting Yelp reviews),but the learned weights could be!

Learn a rich, robust 𝑊 𝑉and

Page 25: Lecture 24: Attention - GitHub Pages

25

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈

𝑊

𝑈𝑉 𝑉 𝑉

An oak tree belongs

tooak tree. belongs

, we can possibly𝑊 𝑉andUsing these “pre-trained”

increase our performance on other tasks (e.g., Yelp reviews), since they’re very experienced with producing/capturing “meaning”

Page 26: Lecture 24: Attention - GitHub Pages

26

• Language Modelling may help us for other tasks

• LSTMs do a great job of capturing “meaning”, which can be used for almost every task

• Given a sequence of N words, we can produce 1 output

• Given a sequence of N words, we can produce N outputs

RECAP

Page 27: Lecture 24: Attention - GitHub Pages

27

• Language Modelling may help us for other tasks

• LSTMs do a great job of capturing “meaning”, which can be used for almost every task

• Given a sequence of N words, we can produce 1 output

• Given a sequence of N words, we can produce N outputs

• What if we wish to have M outputs?

RECAP

Page 28: Lecture 24: Attention - GitHub Pages

28

We want to produce a variable-length output(e.g., n à m predictions)

Thank you for visiting! Děkujeme za návštěvu!

Page 29: Lecture 24: Attention - GitHub Pages

29

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 30: Lecture 24: Attention - GitHub Pages

30

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 31: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

• If our input is a sentence in Language A, and we wish to translate it to

Language B, it is clearly sub-optimal to translate word by word (like our

current models are suited to do).

• Instead, let a sequence of tokens be the unit that we ultimately wish to

work with (a sequence of length N may emit a sequences of length M)

• Seq2seq models are comprised of 2 RNNs: 1 encoder, 1 decoder

Page 32: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

Page 33: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

The final hidden state of the encoder RNN is the initial state of the decoder RNN

ENCODER RNN

Page 34: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!&

<s>

DECODER RNN

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 35: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!&

<s>

DECODER RNN

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Le

Page 36: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s>

DECODER RNN

Le

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Le

Page 37: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s>

DECODER RNN

Le

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Le chien

Page 38: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien

DECODER RNN

ℎ#&

Le

Le chien

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 39: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien

DECODER RNN

ℎ#&

Le

Le chien brun

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 40: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun

DECODER RNN

ℎ#& ℎ%&

Le

Le chien brun

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 41: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun

DECODER RNN

ℎ#& ℎ%&

Le

Le chien brun a

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 42: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

ℎ#& ℎ%& ℎ'&

Le

Le chien brun a

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 43: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

ℎ#& ℎ%& ℎ'&

Le

Le chien brun a couru

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 44: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

Le chien brun a couru

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 45: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

Le chien brun a couru <s>

The final hidden state of the encoder RNN is the initial state of the decoder RNN

Page 46: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(

Page 47: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(

Training occurs like RNNs typically do; the loss (from the decoder outputs) is calculated, and we update weights all the way to the beginning (encoder)

Page 48: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(

Testing generates decoder outputs one word at a time, until we generate a <S> token.

Each decoder’s !𝒚𝒊 becomes the input 𝒙𝒊"𝟏

Page 49: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

See any issues with this traditional seq2seq paradigm?

Page 50: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(

Page 51: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Input layer

Hidden layer

ℎ!$ ℎ"$ ℎ#$ ℎ%$

The brown dog ran

ENCODER RNN

ℎ!& ℎ"&

<s> chien brun a

DECODER RNN

couru

ℎ#& ℎ%& ℎ'&

Le

ℎ(&

#𝑦! #𝑦" #𝑦# #𝑦% #𝑦' #𝑦(

It’s crazy that the entire “meaning” of the 1st sequence is expected to be packed into this one embedding, and that the encoder then never interacts w/ the decoder again. Hands free.

Page 52: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoder’s hidden states?

Page 53: Lecture 24: Attention - GitHub Pages

Sequence-to-Sequence (seq2seq)

Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoder’s hidden states?

Intuition: when we (humans) translate a sentence, we don’t just consume the original sentence then regurgitate in a new language; we continuously look back at the original while focusing on different parts.

Page 54: Lecture 24: Attention - GitHub Pages

54

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 55: Lecture 24: Attention - GitHub Pages

55

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 56: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

.4? .3? .1? .2?ℎ!$ ℎ"$ ℎ#$ ℎ%$

Page 57: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

ℎ!&

<s>

DECODER RNN

Page 58: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

Separate FFNN

ℎ!$ ℎ!&

𝑒! 1.5

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

<s>

DECODER RNN

Page 59: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

Separate FFNN

ℎ"$ ℎ!&

𝑒" 0.9

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

<s>

DECODER RNN

Page 60: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

Separate FFNN

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

ℎ#$ ℎ!&

𝑒# 0.2

<s>

DECODER RNN

Page 61: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

Separate FFNN

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

ℎ%$ ℎ!&

𝑒% −0.5

<s>

DECODER RNN

Page 62: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

𝑒% −0.5

Attention (raw scores)

𝑒# 0.2𝑒" 0.9𝑒! 1.5

<s>

DECODER RNN

Page 63: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

𝑒% −0.5

Attention (raw scores)

𝑒# 0.2𝑒" 0.9𝑒! 1.5

Attention (softmax’d)

𝑎)! =exp(𝑒))

∑)* exp(𝑒!)<s>

DECODER RNN

Page 64: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

Q: How do we determine how much to pay attention to each of the encoder’s hidden layers?

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

A: Let’s base it on our decoder’s current hidden state (our current representation of meaning) and all of the encoder’s hidden layers!

𝑒% −0.5

Attention (raw scores)

𝑒# 0.2𝑒" 0.9𝑒! 1.5

Attention (softmax’d)

𝑎$$ = 0.51𝑎%$ = 0.28𝑎&$ = 0.14𝑎&$ = 0.07

<s>

DECODER RNN

Page 65: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

ℎ!$ ℎ"$ ℎ#$ ℎ%$ ℎ!&

Attention (softmax’d)

𝑎$$ = 0.51𝑎%$ = 0.28𝑎&$ = 0.14𝑎&$ = 0.07

𝑎!! 𝑎"! 𝑎#! 𝑎%!

We multiply each encoder’s hidden layer

by its 𝑎'$ attention weights to create a context vector 𝑐$(

𝑐!&

<s>

DECODER RNN

Page 66: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

ℎ!$ ℎ"$ ℎ#$ ℎ%$

<s>

[ℎ!&; 𝑐!&]

𝑐!&

𝑎!! 𝑎"! 𝑎#! 𝑎%!

REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.

#𝑦!Le

DECODER RNN

Page 67: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

ℎ!$ ℎ"$ ℎ#$ ℎ%$

<s>

[ℎ!&; 𝑐!&]

𝑐"&

𝑎!" 𝑎"" 𝑎#" 𝑎%"

REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.

#𝑦!Le

Le

[ℎ"&; 𝑐"&]

#𝑦"chien

DECODER RNN

Page 68: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

ℎ!$ ℎ"$ ℎ#$ ℎ%$

<s>

[ℎ!&; 𝑐!&]

𝑐#&

𝑎!# 𝑎"# 𝑎## 𝑎%#

REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.

#𝑦!Le

Le

[ℎ"&; 𝑐"&]

#𝑦"chien

[ℎ#&; 𝑐#&]

#𝑦#brun

chien

DECODER RNN

Page 69: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

ℎ!$ ℎ"$ ℎ#$ ℎ%$

<s>

[ℎ!&; 𝑐!&]

𝑐%&

𝑎!% 𝑎"% 𝑎#% 𝑎%%

REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.

#𝑦!Le

Le

[ℎ"&; 𝑐"&]

#𝑦"chien

[ℎ#&; 𝑐#&]

#𝑦#brun

chien

[ℎ%&; 𝑐%&]

#𝑦%a

brun

DECODER RNN

Page 70: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Input layer

Hidden layer

The brown dog ran

ENCODER RNN

ℎ!$ ℎ"$ ℎ#$ ℎ%$

<s>

[ℎ!&; 𝑐!&]

𝑐'&

𝑎!' 𝑎"' 𝑎#' 𝑎%'

REMEMBER: each attention weight 𝑎') is based on the decoder’s current hidden state, too.

#𝑦!Le

Le

[ℎ"&; 𝑐"&]

#𝑦"chien

[ℎ#&; 𝑐#&]

#𝑦#brun

chien

[ℎ%&; 𝑐%&]

#𝑦%a

brun

[ℎ'&; 𝑐'&]

#𝑦%couru

a

DECODER RNN

Page 71: Lecture 24: Attention - GitHub Pages

71Photo credit: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

For convenience, here’s the Attention calculation summarized on 1 slide

Page 72: Lecture 24: Attention - GitHub Pages

72Photo credit: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

For convenience, here’s the Attention calculation summarized on 1 slide

The Attention mechanism that produces

scores doesn’t have to be a FFNN like I

illustrated. It can be any function you wish.

Page 73: Lecture 24: Attention - GitHub Pages

73Photo credit: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

Popular Attention Scoring functions:

Page 74: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Attention:

• greatly improves seq2seq results

• allows us to visualize the

contribution each encoding word

gave for each decoder’s word

Image source: Fig 3 in Bahdanau et al., 2015

Page 75: Lecture 24: Attention - GitHub Pages

seq2seq + Attention

Attention:

• greatly improves seq2seq results

• allows us to visualize the

contribution each encoding word

gave for each decoder’s word

Image source: Fig 3 in Bahdanau et al., 2015

Takeaway:

Having a separate encoder and decoderallows for n à m length predictions.

Attention is powerful; allows us to conditionally weight our focus

Page 76: Lecture 24: Attention - GitHub Pages

76

• LSTMs yielded state-of-the-art results on most NLP tasks (2014-2018)

• seq2seq+Attention was an even more revolutionary idea (Google Translate used it)

• Attention allows us to place appropriate weight to the encoder’s hidden states

• But, LSTMs require us to iteratively scan each word and wait until we’re at the end before we can do anything

SUMMARY

Page 77: Lecture 24: Attention - GitHub Pages

77

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 78: Lecture 24: Attention - GitHub Pages

78

Outline

How to use embeddings

seq2seq

seq2seq + Attention

Transformers (preview)

Page 79: Lecture 24: Attention - GitHub Pages

The brown dog ranx1 x2 x3 x4

z4z3z2z1

Self-attention Head

FFNN

r2

FFNN

r3

FFNN

r4

Transformer Encoder uses attention on itself (self-attention) to create very rich embeddings which can be used for any task.

FFNN

r1

Encoder

Transformer Encoder

BERT is a BidirectionalTransformer Encoder. You can attach a final layer that performs whatever task you’re interested in (e.g., Yelp reviews).

Its results are unbelievably good.

Page 80: Lecture 24: Attention - GitHub Pages

BERT is trained on a lot of text data:

• BooksCorpus (800M words)

• English Wikipedia (2.5B words)

BERT-Base model has 12 transformer blocks, 12 attention heads,

110M parameters!

BERT-Large model has 24 transformer blocks, 16 attention heads,

340M parameters!

BERT (a Transformer variant)

Yay, for transfer learning!

Page 81: Lecture 24: Attention - GitHub Pages

BERT (a Transformer variant)

The brown dog

x1 x2 x3

Encoder #1

Encoder #8

r2 r3r1 r4

ran

x4

Typically, one uses BERT’s awesome

embeddings to fine-tune toward a

different NLP task (this is called

Sequential Transfer Learning)

yTakeaway:BERT is incredible for learning context-aware representations of words and using transfer learning for other tasks (e.g., classification).

Can’t generate new sentences though, due to no decoders.

Page 82: Lecture 24: Attention - GitHub Pages

Transformer

What if we want to generate a new output sequence?

GPT-2 model to the rescue!Generative Pre-trained Transformer 2

Page 83: Lecture 24: Attention - GitHub Pages

GPT-2 (a Transformer variant)

• GPT-2 uses only Transformer Decoders (no Encoders) to generate

new sequences

• As it processes each word/token, it cleverly masks the “future”

words and conditions itself on the previous words

• Can generate text from scratch or from a starting sequence.

• Easy to fine-tune on your own dataset (language)

Page 84: Lecture 24: Attention - GitHub Pages

GPT-2

• GPT-2 uses only Transformer Decoders (no Encoders) to generate

new sequences

• As it processes each word/token, it cleverly masks the “future”

words and conditions itself on the previous words

• Can generate text from scratch or from a starting sequence.

• Easy to fine-tune on your own dataset (language)

• GPT-3 is an even bigger version of GPT-2, but isn’t open-source

Takeaway:GPT-2 is astounding for generating realistic-looking new text

Can fine-tune toward other tasks, too.

Page 85: Lecture 24: Attention - GitHub Pages

GPT-2 is:

• trained on 40GB of text data (8M webpages)!

• 1.5B parameters

GPT-3 is an even bigger version (175B parameters) of GPT-2, but

isn’t open-source

Yay, for transfer learning!

GPT-2 (a Transformer variant)

Page 86: Lecture 24: Attention - GitHub Pages

QUESTIONS?