62
1

New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

1

Page 2: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Lecture 16: Seq2Seq and Attention

2

Page 3: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Agenda● Continuing CIS 522 at this point of time● (Review) Seq2Seq● Attention Models

○ Seq2Seq○ Image Captioning

Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

3

Page 4: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Questions about the courseUse the show hand option please

4

Page 5: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Breakout roomsHow are things going in life?How are the projects going?

5

Page 6: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

6

Page 7: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Part One: Seq2Seq

7

Page 8: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Problems we would like to solve• Parse language

– Understand the role of all the words• Translate language

– Incidentally, please someone solve this for computer languages• Summarize text

– See next slide• Dialogue

– Remember AIDungeon?

All these are Seq2Seq flavor

8

Page 9: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Typical automatic summary pre deep learningDeep learning techniques now touch on data systems of all varieties. The purpose of this course is to deconstruct the hype by teaching deep learning theories, models, skills, and applications that are useful for applications. Suppose a problem has variable-length input. We will cover popular models, discuss how to find interesting niche research, and gain intuition for the best approaches. The success of a deep learning project often depends on the interpretation and presentation of results.

9

Page 10: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Setting

Diagram from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

x y10

Page 11: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Which y is best?

Logic adapted from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

● We will produce a sequence of length T● This sequence y should be likely given x● The sequence y itself should be probable (e.g. good sentence) ● We want to evaluate the probability of y (p(y)) but we can only readily

evaluate p(y_i)

11

Page 12: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Perfect algorithmEnumerate all possible output token sequences.Choose the one that is most probable.

12

Page 13: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

13

Page 14: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Let us zoom summarize (1 word at a time)

14

Page 15: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Naive Decoder

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

● The “Greedy” decoder takes the most probable outcome and returns it

● It then uses this outcome to infer the next word in the sequence

● What if it gets one of the earlier inferences wrong?

15

Page 16: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search as an alternative

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

● Beam Search: Keep track of the k most probable partial translations (hypotheses) at any given point in time (k from roughly 5 to 10, for example)

○ Scores are all negative, and higher score is better○ We search for high-scoring hypotheses, tracking top k on each step

● Not guaranteed to find the optimal sequence, but a good start!

16

Page 17: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

<START>

Calculate prob dist of next word

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm17

Page 18: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

<START>

he

I

Take top k words and compute scores

-0.7 = log PLM(he|<START>)

-0.9 = log PLM(I|<START>)

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm18

Page 19: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

<START>

he

I

-0.7

-0.9

For each of the k hypotheses, find top k next words and calculate scores

-1.7 = log PLM(hit|<START> he) + -0.7

-2.9= log PLM(struck|<START> he) + -0.7

-1.6= log PLM(was|<START> I) + -0.9

-1.8 = log PLM(got|<START> I) + -0.9

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

19

Page 20: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

he

I

-0.7

-0.9-1.8

-1.7

-2.9<START>

Of these k2 hypotheses, just keep k with highest scores

-1.6

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

20

Page 21: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9

-2.8= log PLM(a|<START> he hit) + -1.7

-2.5 = log PLM(me|<START> he hit) + -1.7-2.9 = log PLM(hit|<START> I was) + -1.6

LM-3.8 = log P (struck|<START> I was) + -1.6

For each of the k hypotheses, find top k next words and calculate scores

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

21

Page 22: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9

Of these k2 hypotheses, just keep k with highest scores

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

22

Page 23: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.4-3.3

For each of the k hypotheses, find top k next words and calculate scores

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

23

Page 24: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.3-3.4

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

24

Page 25: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on

in

with

a

one<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.4

-4.3

-4.5

-4.8

-3.3 -3.7

For each of the k hypotheses, find top k next words and calculate scores

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

25

Page 26: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on

in

with

a

one<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.4

-4.3

-4.5

-4.8

-3.3 -3.7

Of these k2 hypotheses, just keep k with highest scores

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

26

Page 27: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on

in

with

a

one<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.4

-4.3

-4.5

-4.8

-3.3 -3.7

Of these k2 hypotheses, just keep k with highest scores

pie

tart

pie

tart

-4.3

-4.6

-5.0

-5.3

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

27

Page 28: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on

in

with

a

one<START>

he

I

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.4

-4.3

-4.5

-4.8

-3.3 -3.7

pie

tart

pie

tart

-4.3

-4.6

-5.0

-5.3

This is the top-scoring hypothesis!Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

28

Page 29: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Beam search decoding: exampleBeam size = k = 2. Blue numbers =

hit

struck

was

got

a

me

hit

struck

tart

pie

with

on

in

with

a

one<START>

he

I

37

-0.7

-0.9

-1.6

-1.8

-1.7

-2.9-2.5

-2.8

-3.8

-2.9-3.5

-4.0

-3.4

-4.3

-4.5

-4.8

-3.3 -3.7

pie

tart

pie

tart

-4.3

-4.6

-5.0

-5.3

Backtrack to obtain full hypothesisSlides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

29

Page 30: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Stopping Criterion: Greedy vs. Beam● In greedy decoding, usually we decode until the model produces a <END>

token (for example: <START> he hit me with a pie <END>)● In beam search decoding, different hypotheses may produce <END>

tokens on different timesteps○ When a hypothesis produces <END>, that hypothesis is complete.

Place it aside and continue exploring other hypotheses via beam search.

● Usually we continue beam search until: we reach timestep T (where T is some pre-defined cutoff), or we have at least n completed hypotheses (where n is pre-defined cutoff)

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

30

Page 31: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Final Touches: Beam Search● Our scores for each sequence may be biased towards shorter-length

sequences (since scores for longer sequences are lower)● We must normalize by the length of the sequence:

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

31

Page 32: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Part Two: Attention

32

Page 33: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence: the bottleneck problemEn

code

r RN

N

Source sentence (input)

<START>

he hit me with a pieil a m’ entarté

he hit me with a pie <END>

Decoder

RNN

Encoding of the source sentence must retain all the information

Target sentence (output)

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

33

Page 34: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Maybe Attention might help?● We want to be able to focus on different parts of the sequence as we

decode it○ Different contextual cues in our input sequence can imply different

outputs (e.g. “Let’s table this discussion” vs “I put my cup on the table”)● But first, is it possible to draw inspiration from neuroscience?

34

Page 35: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Overt Attention

35

Page 36: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Covert Attention

36

Page 37: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Object-based Attention

37

Page 38: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Spotlight

38

Page 39: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Attention in Deep Learning● Big Idea: we can use some weighted sum of our encoder hidden states to

inform us about our current prediction in the decoding phase○ N → One and One → M is too constricting - we need a “bridge” of

information between N and M○ We can give the network the ability to selectively focus on different

parts of the input○ This solves the bottleneck problem because we have a direct

connection to our encoding phase

39

Page 40: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il A m’ entarté

Decoder RN

NAt

tent

ion

sc

oresdot product

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

40

Page 41: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il a m’ entarté

Decoder RN

NAt

tent

ion

sc

oresdot product

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

41

Page 42: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il a m’ entarté

Decoder RN

NAt

tent

ion

sc

oresdot product

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

42

Page 43: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il a m’ entarté

Decoder RN

NAt

tent

ion

sc

oresdot product

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

43

Page 44: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il a m’ entarté

Decoder RN

Nsc

ores

On this decoder timestep, we’re mostly focusing on the first encoder hidden state (”he”)

Atte

ntio

nAt

tent

ion

dist

ribu

tion

Take softmax to turn the scores into a probability

distribution

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

44

Page 45: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il a m’ entarté

Decoder RN

Ndi

stri

buti

onAt

tent

ion

Atte

ntio

nsc

ores

Attention output

Use the attention distribution to take a weighted sum of the encoder hidden states.

The attention output mostly contains information from the hidden states that received high attention.

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

45

Page 46: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

<START>

il a m’ entarté

Decoder RN

Ndi

stri

buti

onAt

tent

ion

Atte

ntio

nsc

ores

Attention output Concatenate attention

output with decoder hidden state, then use to compute as before

he

Source sentence (input)

y1

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

46

Page 47: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

il a m’ entarté

Source sentence (input)

Decoder RN

Nsc

ores

<START> he

Atte

ntio

nAt

tent

ion

dist

ribu

tion

Attention output

hit

Sometimes we take the attention output from the previous step, and also feed it into the decoder (along with the usual decoder input).

y2

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

47

Page 48: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

il a m’ entarté

Decoder RN

Nsc

ores

Atte

ntio

nAt

tent

ion

dist

ribu

tion

Attention output

<START> he hit

y3

me

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

48

Page 49: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

il a m’ entarté

Decoder RN

Nsc

ores

Atte

ntio

nAt

tent

ion

dist

ribu

tion

Attention output

<START> he hit me

y44

with

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

49

Page 50: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

il a m’ entarté

Decoder RN

Nsc

ores

Atte

ntio

nAt

tent

ion

dist

ribu

tion

Attention output

<START> he

y5

a

hit me with

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

50

Page 51: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Sequence-to-sequence with attention

Enco

der

RN

N

il a m’ entarté

Decoder RN

Nsc

ores

Atte

ntio

nAt

tent

ion

dist

ribu

tion

Attention output

<START> he hit me with a

y6

pie

Source sentence (input) Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

51

Page 52: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

The Mathematics Behind Attention (1)● We have encoder hidden states● On timestep t, we have some decoder hidden state ● We get the attention scores

● We’ll take a softmax of these scores to get a probability distribution

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

52

Page 53: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

The Mathematics Behind Attention (2)● We use this attention distribution to generate a weighted sum of encoder

hidden states

● We concatenate the output of the attention with the decoder hidden state and do the same operations as in the non-attention model

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

53

Page 54: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Benefits of Attention● Improves Performance - allows decoder to focus on certain parts of the

source input● Solves the bottleneck problem - allows decoder to look directly at the

source, bypassing bottleneck● Helps with vanishing gradients - similar to skip connections● Provides interpretability

○ We can see what the decoder is focusing on○ We get (soft) alignment between terms for free

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

54

Page 55: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Attention, Generalized● More broadly, attention is defined in the following way:

○ Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

● There are several variants on attention, many of which involve changing how the attention weights are computed (dot-product, multiplicative, additive, etc)

Slides and Diagrams from: Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

55

Page 56: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Show, Attend, Tell: Image Captioning

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. 56

Page 57: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Show, Attend, Tell

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. 57

Page 58: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Show, Attend, Tell

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. 58

Page 59: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

Show, Attend, Tell

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. 59

Page 60: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

What we learned today● Clarity about how we are going forward● Seq2Seq● Attention Models

○ Seq2Seq○ Image Captioning

Stanford CS224N/Ling284 by Abigail See, Matthew Lamm

60

Page 61: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

61

Page 62: New Lecture 16: Seq2Seq and Attention - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020. 3. 24. · learning Deep learning techniques now touch on data systems of all varieties

62