Transcript
Page 1: 344.063/163 KV Special Topic: Natural Language Processing

Institute ofComputational Perception

344.063/163 KV Special Topic: Natural Language Processing with Deep LearningNeural Machine Translation with Attention Networks

Navid [email protected]

Institute of Computational Perception

Page 2: 344.063/163 KV Special Topic: Natural Language Processing

Agenda

โ€ข Machine Translationโ€ข Attention Networksโ€ข Attention in practice

โ€ข Seq2seq with attentionโ€ข Hierarchical document classification

Page 3: 344.063/163 KV Special Topic: Natural Language Processing

Agenda

โ€ข Machine Translationโ€ข Attention Networksโ€ข Attention in practice

โ€ข Seq2seq with attentionโ€ข Hierarchical document classification

Slides of this section are mostly adopted from https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/cs224n-2020-lecture08-nmt.pdf

Page 4: 344.063/163 KV Special Topic: Natural Language Processing

4

Machine Translation (MT)

ยง Machine Translation is the task of translating a sentence ๐‘‹from source language to sentence ๐‘Œ in target language

ยง A long-history (since 1950)- Early systems were mostly rule-based

ยง Challenges:- Common sense- Idioms!- Typological differences between the source and target language- Alignment- Low-resource language pairs

Page 5: 344.063/163 KV Special Topic: Natural Language Processing

5

Statistical Machine Translation (SMT)

ยง Statistical Machine Translation (1990-2010) learns a probabilistic model using large amount of parallel data

ยง The model aims to find the best target language sentence ๐‘Œโˆ—, given the source language sentence ๐‘‹:

๐‘Œโˆ—= argmax"

๐‘ƒ(๐‘Œ|๐‘‹)

ยง SMT uses Bayes Rule to split this probability into two components that can be learnt separately:

= argmax"

๐‘ƒ(๐‘‹|๐‘Œ)๐‘ƒ(๐‘Œ)

https://en.wikipedia.org/wiki/Rosetta_Stone

Translation ModelThe statistical model that

defines how words and phrases should be translated

(learnt from parallel data)

Language ModelThe statistical model that tells us how to write good sentences in

the target language (learnt from monolingual data)

Page 6: 344.063/163 KV Special Topic: Natural Language Processing

6

Learning Translation model

ยง To learn the Translation model ๐‘ƒ(๐‘Œ|๐‘‹), we need to break ๐‘‹and ๐‘Œ down to aligned words and phrases:

ยง To this end, the alignment latent variable ๐‘Ž is added to the formulation of Translation model:

๐‘ƒ(๐‘‹, ๐‘Ž|๐‘Œ)

ยง Alignment โ€ฆ- is a latent variable โ†’ is not explicitly defined in the data!- defines the correspondence between particular words/phrases in the

translation sentence pair

Page 7: 344.063/163 KV Special Topic: Natural Language Processing

7

Alignment!

Examples originally from: โ€œThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.

Page 8: 344.063/163 KV Special Topic: Natural Language Processing

8

Alignment!

Examples originally from: โ€œThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.

Page 9: 344.063/163 KV Special Topic: Natural Language Processing

9

Alignment!

Examples originally from: โ€œThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.

Page 10: 344.063/163 KV Special Topic: Natural Language Processing

10

Alignment!

Examples originally from: โ€œThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.

Page 11: 344.063/163 KV Special Topic: Natural Language Processing

11

SMT โ€“ summary

ยง Defining alignment is complex!- The Translation model should jointly estimate distributions of

both variables (๐‘‹ and ๐‘Ž)

ยง SMT systems โ€ฆ- were extremely complex with lots of features engineering- required extra resources like dictionaries and mapping tables

between phrases and words- required โ€œspecial attentionโ€ for each language pair and lots of

human efforts

Page 12: 344.063/163 KV Special Topic: Natural Language Processing

12

MT โ€“ Evaluation

ยง BLEU (Bilingual Evaluation Understudy)

ยง BLEU computes a similarity score between the machine-written translation to one or several human-written translation(s), based on:- n-gram precision (usually for 1, 2, 3 and 4-grams)- plus a penalty for too-short machine translations

ยง BLEU is precision-based, while ROUGE is recall-based

Details of how to calculate BLEU: https://www.coursera.org/lecture/nlp-sequence-models/bleu-score-optional-kC2HD

Page 13: 344.063/163 KV Special Topic: Natural Language Processing

Agenda

โ€ข Machine Translationโ€ข Attention Networksโ€ข Attention in practice

โ€ข Seq2seq with attentionโ€ข Hierarchical document classification

Page 14: 344.063/163 KV Special Topic: Natural Language Processing

14

Attention Networks

ยง Attention is a generic deep learning method โ€ฆ - to obtain a composed representation

(output ๐’) โ€ฆ- from an arbitrary size of input

representations (values ๐‘ฝ) โ€ฆ- based on another given representation

(query ๐’’)ยง General form of an attention network:

๐’ = ATT(๐’’, ๐‘ฝ)

ยง when we have a set of queries, it will be:

๐‘ถ = ATT(๐‘ธ, ๐‘ฝ)๐‘ฝ

ATT

๐‘ถ๐‘ธ

๐‘ฝ

ATT

๐’๐’’

Page 15: 344.063/163 KV Special Topic: Natural Language Processing

15

Attention Networks

๐‘ธ ร—๐‘‘!

๐‘ฝ ร—๐‘‘"

๐‘ธ ร—๐‘‘"

๐‘ฝ

ATT

๐‘ถ๐‘ธ

๐‘ถ = ATT(๐‘ธ, ๐‘ฝ)

We sometime say, each query vector ๐’’ โ€œattendsโ€ to value vectors

โ€ข ๐‘‘!, ๐‘‘" are embedding dimensions of query and value vectors, respectively

ATT

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 16: 344.063/163 KV Special Topic: Natural Language Processing

16

Attention Networks โ€“ definition

Formal definition: ยง Given a set of vectors of values ๐‘ฝ and queries ๐‘ธ โ€ฆ

- for each query in ๐‘ธ, attention computes a weighted sum of the values ๐‘ฝ as output of that query

ยง To this end, each query vector puts some amount of attention on each value vector- This attention is used as the weight in the weighted sum

ยง The weighted sum in attention networks can be seen as a selective summary of the information of the value vectors, where the query defines what portion (attention weight) of each value should be taken

Page 17: 344.063/163 KV Special Topic: Natural Language Processing

17

Attentions!

ATT

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 18: 344.063/163 KV Special Topic: Natural Language Processing

18

Attentions!

๐›ผ!,# is the attention of query ๐’’! on value ๐’—#๐œถ! is the vector of attentions of query ๐’’! on value vectors ๐‘ฝ๐œถ! is a probability distribution๐‘“ is the attention function

๐‘“

๐‘“

๐›ผ#,#๐›ผ#,% ๐›ผ#,& ๐›ผ#,'

๐›ผ%,#๐›ผ%,%๐›ผ%,&๐›ผ%,'

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 19: 344.063/163 KV Special Topic: Natural Language Processing

19

Attention Networks โ€“ formulation

ยง Given the query vector ๐’’%, an attention network assigns attention ๐›ผ%,' to each value vector ๐’—' using attention function ๐‘“:

๐›ผ%,' = ๐‘“(๐’’% , ๐’—')

where ๐œถ% forms a probability distribution over vector values:

-'()

๐‘ฝ

๐›ผ%,' = 1

ยง The output regarding each query is the weighted sum of the value vectors using attentions as weights:

๐’% =-'()

๐‘ฝ

๐›ผ%,'๐’—'

Page 20: 344.063/163 KV Special Topic: Natural Language Processing

20

Attention โ€“ first formulation

Basic dot-product attentionยง First, non-normalized attention scores:

7๐›ผ-,/ = ๐’’-๐’—-0

- In this variant ๐‘‘" = ๐‘‘#- There is no parameter to learn!

ยง Then, softmax over values:๐œถ- = softmax(>๐œถ-)

ยง Output (weighted sum): ๐’- = โˆ‘/1$๐‘ฝ ๐›ผ-,/๐’—/

๐‘“

๐‘“

๐›ผ#,# ๐›ผ#,% ๐›ผ#,& ๐›ผ#,'

๐›ผ%,#๐›ผ%,%๐›ผ%,&๐›ผ%,'

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 21: 344.063/163 KV Special Topic: Natural Language Processing

21

Example

๐‘“

๐‘“

0.0000.007 0.004 0.989

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

3โˆ’10

14โˆ’3

222

0.5โˆ’21

301

2.9830.0061.007

>๐œถ$ =

๐’’$๐’—$% = โˆ’1๐’’$๐’—&% = 4๐’’$๐’—'% = 3.5๐’’$๐’—(% = 9

โ†’ ๐œถ$ =

0.0000.0070.0040.989

๐’$ = 0.00014โˆ’3

+ 0.007222+ 0.004

0.5โˆ’21

+ 0.989301

๐’$ =2.9830.0061.007

Page 22: 344.063/163 KV Special Topic: Natural Language Processing

22

Example

๐‘“

๐‘“

0.0000.007 0.004 0.989

0.0000.2680.0050.727

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%2013โˆ’10

14โˆ’3

222

0.5โˆ’21

301

2.9830.0061.007

2.7190.5261.268

>๐œถ& =

๐’’&๐’—$% = โˆ’1๐’’&๐’—&% = 6๐’’&๐’—'% = 2๐’’&๐’—(% = 7

โ†’ ๐œถ& =

0.0000.2680.0050.727

๐’& = 0.00014โˆ’3

+ 0.268222+ 0.005

0.5โˆ’21

+ 0.727301

๐’& =2.7190.5261.268

Page 23: 344.063/163 KV Special Topic: Natural Language Processing

23

Attention variants

Basic dot-product attention (recap)ยง First, non-normalized attention scores:

7๐›ผ-,/ = ๐’’-๐’—-0

- In this variant ๐‘‘" = ๐‘‘#- There is no parameter to learn!

ยง Then, softmax over values:๐œถ- = softmax(>๐œถ-)

ยง Output (weighted sum): ๐’- = โˆ‘/1$๐‘ฝ ๐›ผ-,/๐’—/

๐‘“

๐‘“

๐›ผ#,# ๐›ผ#,% ๐›ผ#,& ๐›ผ#,'

๐›ผ%,#๐›ผ%,%๐›ผ%,&๐›ผ%,'

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 24: 344.063/163 KV Special Topic: Natural Language Processing

24

Attention variants

Multiplicative attentionยง First, non-normalized attention scores:

7๐›ผ-,/ = ๐’’-๐‘พ๐’—-0

- ๐‘พ is a matrix of model parameters- adds a linear function to measure the similarity

between query and valueยง Then, softmax over values:

๐œถ- = softmax(>๐œถ-)

ยง Output (weighted sum): ๐’- = โˆ‘/1$๐‘ฝ ๐›ผ-,/๐’—/

๐‘“

๐‘“

๐›ผ#,# ๐›ผ#,% ๐›ผ#,& ๐›ผ#,'

๐›ผ%,#๐›ผ%,%๐›ผ%,&๐›ผ%,'

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 25: 344.063/163 KV Special Topic: Natural Language Processing

25

Attention variants

Additive attentionยง First, non-normalized attention scores:

7๐›ผ-,/ = ๐’–0tanh(๐’’-๐‘พ$ +๐’—/๐‘พ%)

- ๐‘พ$, ๐‘พ%, and ๐’– are model parameters- adds a non-linear function to measure the

similarity between query and valueยง Then, softmax over values:

๐œถ- = softmax(>๐œถ-)

ยง Output (weighted sum): ๐’- = โˆ‘/1$๐‘ฝ ๐›ผ-,/๐’—/

๐‘“

๐‘“

๐›ผ#,# ๐›ผ#,% ๐›ผ#,& ๐›ผ#,'

๐›ผ%,#๐›ผ%,%๐›ผ%,&๐›ผ%,'

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’%

๐’’$

๐’$ ๐’%

Page 26: 344.063/163 KV Special Topic: Natural Language Processing

26

Attention in practice

ยง Attention is used to create a compositional embedding of value vectors according to a query- E.g., in seq2seq models or document classification (will be

discussed in this lecture)

ATT

๐’—$ ๐’—% ๐’—& ๐’—'

๐’’

๐’

Page 27: 344.063/163 KV Special Topic: Natural Language Processing

27

Self-attention (next lectures)

ยง No query is given; queries are the same as values: ๐‘ธ = ๐‘ฝยง Self-attention is used to encode a sequence ๐‘ฝ to a

contextualized sequence 2๐‘ฝ- Each encoded vector >๐’—! is a contextual embedding of the corresponding

input vector ๐’—&

ATT

๐’—$ ๐’—% ๐’—&

>๐’—$ >๐’—%๐’—&

๐’—%๐’—$

>๐’—&

๐‘ฝ

ATT

E๐‘ฝ๐‘ธ = ๐‘ฝ

๐‘ฝ

Self-ATT

E๐‘ฝ

Page 28: 344.063/163 KV Special Topic: Natural Language Processing

28

Attention โ€“ summary

ยง Attention is a way to define the distribution of focus on inputs based on a query, and create a compositional embedding of inputs

ยง Attention networks define an attention distribution over inputs and calculate their weighted sum

ยง The original definition of attention network has two inputs: key vectors ๐‘ฒ, and value vectors ๐‘ฝ- Key vectors are used to calculate attentions- and output is the weighted sum of value vectors- In practice, in most cases ๐‘ฒ = ๐‘ฝ. - In this course, we use our slightly simplified definition

Page 29: 344.063/163 KV Special Topic: Natural Language Processing

Agenda

โ€ข Machine Translationโ€ข Attention Networksโ€ข Attention in practice

โ€ข Seq2seq with attentionโ€ข Hierarchical document classification

Page 30: 344.063/163 KV Special Topic: Natural Language Processing

30

Neural Machine Translation (NMT)

ยง Given the source language sentence ๐‘‹ and target language sentence ๐‘Œ, NMT uses seq2seq models to calculate the conditional language model:

๐‘ƒ(๐‘Œ|๐‘‹)- A language model of the target language- Conditioned on the source language

ยง In contrast to SMT, no need for pre-defined alignments! ๐ŸŽ‰

ยง We can simply use a seq2seq with two RNNs

Page 31: 344.063/163 KV Special Topic: Natural Language Processing

31

๐‘ฌ๐’™($)

๐’‰($)

RNN' RNN' RNN' RNN'

๐‘ฌ๐’™(%)

๐‘ฅ(%)๐‘ฌ

๐’™(&)

๐‘ฅ(&)๐‘ฌ

๐’™(')

๐’‰(%) ๐’‰(&) ๐’‰(')

๐‘ฅ(')<eos>

๐‘ฅ($)<bos>

J๐’š(&)

๐‘พ

RNN( RNN( RNN(

๐‘ผ๐’š($)

๐’”($)

๐‘ฆ($)<bos>

๐‘ผ๐’š(%)

๐‘ฆ(%)๐‘ผ

๐’š(&)

๐‘ฆ(&)

๐’”(%) ๐’”(&)

โ€ฆ

D๐’š(!): predicted probability distribution of the next target word, given the source sequence and previous target words

ENCODER DECODER

Seq2seq with two RNNs (recap)

Page 32: 344.063/163 KV Special Topic: Natural Language Processing

32

Seq2seq with two RNNs โ€“ training (recap)

Look here for more: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

Page 33: 344.063/163 KV Special Topic: Natural Language Processing

33

Seq2seq with two RNNs โ€“ decoding / beam search (recap)

Look here for more: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

Page 34: 344.063/163 KV Special Topic: Natural Language Processing

34

Sentence-level semantic representations (recap)

ยง 2-dimensional projection of the last hidden states (๐’‰ F ) of RNNG that are obtained from different phrases

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks." Advances in Neural Information Processing Systems 27 (2014): 3104-3112.

Page 35: 344.063/163 KV Special Topic: Natural Language Processing

35

RNN' RNN' RNN'

๐‘ฌ๐’™($)

RNN'

๐‘ฌ๐’™(%)

๐‘ฅ(%)๐‘ฌ

๐’™(&)

๐‘ฅ(&)๐‘ฌ

๐’™(')

RNN( RNN( RNN(

๐‘ผ๐’š($)

๐‘ฆ($)<bos>

๐‘ผ๐’š(%)

๐‘ฆ(%)๐‘ผ

๐’š(&)

๐‘ฆ(&)๐‘ฅ(')<eos>

๐‘ฅ($)<bos>

โ€ฆ

ENCODER DECODER

Bottleneck problem in seq2seq with two RNNs

All information of source sequence must be embedded in the last hidden

state. Information bottleneck!

๐’‰($) ๐’‰(%) ๐’‰(&) ๐’‰(') ๐’”($) ๐’”(%) ๐’”(&)

Page 36: 344.063/163 KV Special Topic: Natural Language Processing

36

Seq2seq + Attention

ยง It can be useful, if we allow decoder the direct access to all elements of source sequence,- Decoder can decide on which element of source

sequence, it wants to put attentionยง Attention is a solution to the bottleneck problemยง Seq2seq with attention

- adds an attention network to the architecture of basic seq2seq (two RNNs)

- At each time step, decoder uses the attention network to attend to all contextualized vectors of the source sequence

- Training and inference (decoding) processes are the same as basic seq2seq

Bahdanau, Dzmitry, Kyung Hyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate."3rd International Conference on Learning Representations, ICLR. 2015.

Page 37: 344.063/163 KV Special Topic: Natural Language Processing

37

RNN' RNN' RNN'

๐‘ฌ๐’™($)

RNN'

๐‘ฌ๐’™(%)

๐‘ฅ(%)๐‘ฌ

๐’™(&)

๐‘ฅ(&)๐‘ฌ

๐’™(')

RNN(

๐‘ผ๐’š($)

๐‘ฆ($)<bos>

๐‘ฆ(%) ๐‘ฆ(&)๐‘ฅ(')<eos>

๐‘ฅ($)<bos>

Seq2seq with attention

ATT

โจ๐‘พ

J๐’š($)context vectorof time step 1 of decoding

vectors concatenation

(query)

(values)

๐’‰($) ๐’‰(%) ๐’‰(&) ๐’‰(') ๐’”($)

๐’‰โˆ—($)

Page 38: 344.063/163 KV Special Topic: Natural Language Processing

38

RNN' RNN' RNN'

๐‘ฌ๐’™($)

RNN'

๐‘ฌ๐’™(%)

๐‘ฅ(%)๐‘ฌ

๐’™(&)

๐‘ฅ(&)๐‘ฌ

๐’™(')

RNN( RNN(

๐‘ผ๐’š($)

๐‘ฆ($)<bos>

๐‘ผ๐’š(%)

๐‘ฆ(%) ๐‘ฆ(&)๐‘ฅ(')<eos>

๐‘ฅ($)<bos>

ATT

โจ๐‘พ

J๐’š(%)Seq2seq with attention

context vector of time step 2 of decoding

(values)

(query)

๐’‰($) ๐’‰(%) ๐’‰(&) ๐’‰(') ๐’”($) ๐’”(%)

๐’‰โˆ—(%)

Page 39: 344.063/163 KV Special Topic: Natural Language Processing

39

RNN' RNN' RNN'

๐‘ฌ๐’™($)

RNN'

๐‘ฌ๐’™(%)

๐‘ฅ(%)๐‘ฌ

๐’™(&)

๐‘ฅ(&)๐‘ฌ

๐’™(')

RNN( RNN( RNN(

๐‘ผ๐’š($)

๐‘ฆ($)<bos>

๐‘ผ๐’š(%)

๐‘ฆ(%)๐‘ผ

๐’š(&)

๐‘ฆ(&)๐‘ฅ(')<eos>

๐‘ฅ($)<bos>

โ€ฆ

ATT

โจ๐‘พ

J๐’š(&)Seq2seq with attention

context vector of time step 3 of decoding

(values)

(query)

๐’‰($) ๐’‰(%) ๐’‰(&) ๐’‰(') ๐’”($) ๐’”(%) ๐’”(&)

๐’‰โˆ—(&)

Page 40: 344.063/163 KV Special Topic: Natural Language Processing

40

Seq2seq with attention โ€“ formulation

ยง Two sets of vocabularies- ๐•0 is the set of vocabularies for source sequences- ๐•1 is the set of vocabularies for target sequences

ENCODER is the same as seq2seq with two RNNsยง Encoder embedding

- Encoder embeddings for source words (๐•0) โ†’ ๐‘ฌ- Embedding of the source word ๐‘ฅ(*) (at time step ๐‘™) โ†’ ๐’™(3)

ยง Encoder RNN:๐’‰(3) = RNN0 (๐’‰ 35) , ๐’™(3))

Parameters are shown in red

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks." Advances in Neural Information Processing Systems 27 (2014): 3104-3112.

Page 41: 344.063/163 KV Special Topic: Natural Language Processing

41

Seq2seq with attention โ€“ formulation

DECODER โ€“ input

ยง Decoder embedding- Decoder embeddings at input for target words (๐•1) โ†’ ๐‘ผ- Embedding of the target word ๐‘ฆ(6) (at time step ๐‘ก) โ†’ ๐’š(6)

ยง Decoder RNN๐’”(I) = RNNJ(๐’” IK$ , ๐’š(I))

- The values of the last hidden state of the encoder RNN are passed to the initial hidden state of the decoder RNN:

๐’”(L) = ๐’‰ F

Parameters are shown in red

Page 42: 344.063/163 KV Special Topic: Natural Language Processing

42

Seq2seq with attention โ€“ formulation

DECODER - attention

ยง Attention context vector

๐’‰โˆ—(I) = ATT ๐’” I , ๐’‰ $ , โ€ฆ , ๐’‰ F

For instance, if ATT is a โ€œbasic dot-product attentionโ€, this is done by:- First calculating non-normalized attentions:

9๐›ผ*, = ๐’” , -๐’‰*

- Then, normalizing the attentions:

๐œถ , = softmax(E๐œถ , )

- and finally calculating the weighted sum of encoder hidden states

๐’‰โˆ—(,) =F*.$

๐‘ณ

๐›ผ*, ๐’‰*

Parameters are shown in red

Page 43: 344.063/163 KV Special Topic: Natural Language Processing

43

Seq2seq with attention โ€“ formulation

DECODER - output

ยง Decoder output prediction- Predicted probability distribution of words at the next time step:

J๐’š(I) = softmax ๐‘พ[๐’” I ; ๐’‰โˆ—(I)] + ๐’ƒ โˆˆ โ„ ๐•

; denotes the concatenation of two vectors

Parameters are shown in red

Page 44: 344.063/163 KV Special Topic: Natural Language Processing

44

Alignment in NMT (seq2seq with attention)

ยง Attention automatically learns (nearly) alignment

Bahdanau et al. [2015]Try more here: https://distill.pub/2016/augmented-rnns/#attentional-interfaces

English to French translation

Page 45: 344.063/163 KV Special Topic: Natural Language Processing

45

Seq2seq with attention โ€“ summary

ยง Attention on source sequence facilitates the focus on relevant words and better flow of information- It also helps avoiding vanishing gradient problem by providing a

shortcut to faraway states

ยง Attention provides some interpretability- Looking at attention distributions, we can assume what the

decoder is focusing on- It is however disputable whether attention distributions should

be taken as model explanations (particularly in Transformers)!โ€ข Jain, Sarthak, and Byron C. Wallace. "Attention is not Explanation." In proc. of NAACL-HTL 2019.

https://www.aclweb.org/anthology/N19-1357.pdfโ€ข Wiegreffe, Sarah, and Yuval Pinter. "Attention is not not Explanation." In proc. of EMNLP-IJCNLP.

2019. https://www.aclweb.org/anthology/D19-1002/

Page 46: 344.063/163 KV Special Topic: Natural Language Processing

Agenda

โ€ข Machine Translationโ€ข Attention Networksโ€ข Attention in practice

โ€ข Seq2seq with attentionโ€ข Hierarchical document classification

Page 47: 344.063/163 KV Special Topic: Natural Language Processing

47

Hierarchical document classification with attention

ยง Document classification with attention- An attention network is applied to word embeddings as values

(inputs) to compose a document vector (output)โ€ข Document embedding is then used as features for classification

- The query of the attention network is a randomly initialized parameter vector, whose weights are trained end-to-end with the model

ยง Hierarchical document classification- Split the document into sentences- Use a word-level attention to create a sentence embedding from

the word embeddings of each sentence- Use a sentence-level attention to create the document

embedding from the sentence embeddings

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016, June). Hierarchical attention networks for document classification.In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies

Page 48: 344.063/163 KV Special Topic: Natural Language Processing

48

Query vector of the word-level attention

(trained end-to-end with other parameters)

Sentence embedding

Architecture

GRU over word embeddings

(creates contextualized word

embeddings)

Page 49: 344.063/163 KV Special Topic: Natural Language Processing

49

Query vector of the word-level attention

(trained end-to-end with other parameters)

Sentence embedding

GRU over word embeddings

(creates contextualized word

embeddings)

GRU over sentence embeddings

(creates contextualized

sentence embeddings)

Query vector of the sentence-level attention (trained end-to-end with

other parameters)

document embedding

Architecture

Page 50: 344.063/163 KV Special Topic: Natural Language Processing

50

Examples

https://www.aclweb.org/anthology/N16-1174.pdf

Page 51: 344.063/163 KV Special Topic: Natural Language Processing

51

Example

https://www.aclweb.org/anthology/N16-1174.pdf

Page 52: 344.063/163 KV Special Topic: Natural Language Processing

52

Recap

ยง Attention is a general deep learning approach to learn to distribute the focus on certain parts, and compose outputs

ยง Attention significantly helps seq2seq models in machine translation!

ยง Attention can also be used for encoding text, e.g., in document classification

๐‘ฝ

ATT

๐‘ถ๐‘ธ


Recommended