Institute ofComputational Perception
344.063/163 KV Special Topic: Natural Language Processing with Deep LearningNeural Machine Translation with Attention Networks
Navid [email protected]
Institute of Computational Perception
Agenda
โข Machine Translationโข Attention Networksโข Attention in practice
โข Seq2seq with attentionโข Hierarchical document classification
Agenda
โข Machine Translationโข Attention Networksโข Attention in practice
โข Seq2seq with attentionโข Hierarchical document classification
Slides of this section are mostly adopted from https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/cs224n-2020-lecture08-nmt.pdf
4
Machine Translation (MT)
ยง Machine Translation is the task of translating a sentence ๐from source language to sentence ๐ in target language
ยง A long-history (since 1950)- Early systems were mostly rule-based
ยง Challenges:- Common sense- Idioms!- Typological differences between the source and target language- Alignment- Low-resource language pairs
5
Statistical Machine Translation (SMT)
ยง Statistical Machine Translation (1990-2010) learns a probabilistic model using large amount of parallel data
ยง The model aims to find the best target language sentence ๐โ, given the source language sentence ๐:
๐โ= argmax"
๐(๐|๐)
ยง SMT uses Bayes Rule to split this probability into two components that can be learnt separately:
= argmax"
๐(๐|๐)๐(๐)
https://en.wikipedia.org/wiki/Rosetta_Stone
Translation ModelThe statistical model that
defines how words and phrases should be translated
(learnt from parallel data)
Language ModelThe statistical model that tells us how to write good sentences in
the target language (learnt from monolingual data)
6
Learning Translation model
ยง To learn the Translation model ๐(๐|๐), we need to break ๐and ๐ down to aligned words and phrases:
ยง To this end, the alignment latent variable ๐ is added to the formulation of Translation model:
๐(๐, ๐|๐)
ยง Alignment โฆ- is a latent variable โ is not explicitly defined in the data!- defines the correspondence between particular words/phrases in the
translation sentence pair
7
Alignment!
Examples originally from: โThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.
8
Alignment!
Examples originally from: โThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.
9
Alignment!
Examples originally from: โThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.
10
Alignment!
Examples originally from: โThe Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993.
11
SMT โ summary
ยง Defining alignment is complex!- The Translation model should jointly estimate distributions of
both variables (๐ and ๐)
ยง SMT systems โฆ- were extremely complex with lots of features engineering- required extra resources like dictionaries and mapping tables
between phrases and words- required โspecial attentionโ for each language pair and lots of
human efforts
12
MT โ Evaluation
ยง BLEU (Bilingual Evaluation Understudy)
ยง BLEU computes a similarity score between the machine-written translation to one or several human-written translation(s), based on:- n-gram precision (usually for 1, 2, 3 and 4-grams)- plus a penalty for too-short machine translations
ยง BLEU is precision-based, while ROUGE is recall-based
Details of how to calculate BLEU: https://www.coursera.org/lecture/nlp-sequence-models/bleu-score-optional-kC2HD
Agenda
โข Machine Translationโข Attention Networksโข Attention in practice
โข Seq2seq with attentionโข Hierarchical document classification
14
Attention Networks
ยง Attention is a generic deep learning method โฆ - to obtain a composed representation
(output ๐) โฆ- from an arbitrary size of input
representations (values ๐ฝ) โฆ- based on another given representation
(query ๐)ยง General form of an attention network:
๐ = ATT(๐, ๐ฝ)
ยง when we have a set of queries, it will be:
๐ถ = ATT(๐ธ, ๐ฝ)๐ฝ
ATT
๐ถ๐ธ
๐ฝ
ATT
๐๐
15
Attention Networks
๐ธ ร๐!
๐ฝ ร๐"
๐ธ ร๐"
๐ฝ
ATT
๐ถ๐ธ
๐ถ = ATT(๐ธ, ๐ฝ)
We sometime say, each query vector ๐ โattendsโ to value vectors
โข ๐!, ๐" are embedding dimensions of query and value vectors, respectively
ATT
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
16
Attention Networks โ definition
Formal definition: ยง Given a set of vectors of values ๐ฝ and queries ๐ธ โฆ
- for each query in ๐ธ, attention computes a weighted sum of the values ๐ฝ as output of that query
ยง To this end, each query vector puts some amount of attention on each value vector- This attention is used as the weight in the weighted sum
ยง The weighted sum in attention networks can be seen as a selective summary of the information of the value vectors, where the query defines what portion (attention weight) of each value should be taken
17
Attentions!
ATT
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
18
Attentions!
๐ผ!,# is the attention of query ๐! on value ๐#๐ถ! is the vector of attentions of query ๐! on value vectors ๐ฝ๐ถ! is a probability distribution๐ is the attention function
๐
๐
๐ผ#,#๐ผ#,% ๐ผ#,& ๐ผ#,'
๐ผ%,#๐ผ%,%๐ผ%,&๐ผ%,'
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
19
Attention Networks โ formulation
ยง Given the query vector ๐%, an attention network assigns attention ๐ผ%,' to each value vector ๐' using attention function ๐:
๐ผ%,' = ๐(๐% , ๐')
where ๐ถ% forms a probability distribution over vector values:
-'()
๐ฝ
๐ผ%,' = 1
ยง The output regarding each query is the weighted sum of the value vectors using attentions as weights:
๐% =-'()
๐ฝ
๐ผ%,'๐'
20
Attention โ first formulation
Basic dot-product attentionยง First, non-normalized attention scores:
7๐ผ-,/ = ๐-๐-0
- In this variant ๐" = ๐#- There is no parameter to learn!
ยง Then, softmax over values:๐ถ- = softmax(>๐ถ-)
ยง Output (weighted sum): ๐- = โ/1$๐ฝ ๐ผ-,/๐/
๐
๐
๐ผ#,# ๐ผ#,% ๐ผ#,& ๐ผ#,'
๐ผ%,#๐ผ%,%๐ผ%,&๐ผ%,'
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
21
Example
๐
๐
0.0000.007 0.004 0.989
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
3โ10
14โ3
222
0.5โ21
301
2.9830.0061.007
>๐ถ$ =
๐$๐$% = โ1๐$๐&% = 4๐$๐'% = 3.5๐$๐(% = 9
โ ๐ถ$ =
0.0000.0070.0040.989
๐$ = 0.00014โ3
+ 0.007222+ 0.004
0.5โ21
+ 0.989301
๐$ =2.9830.0061.007
22
Example
๐
๐
0.0000.007 0.004 0.989
0.0000.2680.0050.727
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%2013โ10
14โ3
222
0.5โ21
301
2.9830.0061.007
2.7190.5261.268
>๐ถ& =
๐&๐$% = โ1๐&๐&% = 6๐&๐'% = 2๐&๐(% = 7
โ ๐ถ& =
0.0000.2680.0050.727
๐& = 0.00014โ3
+ 0.268222+ 0.005
0.5โ21
+ 0.727301
๐& =2.7190.5261.268
23
Attention variants
Basic dot-product attention (recap)ยง First, non-normalized attention scores:
7๐ผ-,/ = ๐-๐-0
- In this variant ๐" = ๐#- There is no parameter to learn!
ยง Then, softmax over values:๐ถ- = softmax(>๐ถ-)
ยง Output (weighted sum): ๐- = โ/1$๐ฝ ๐ผ-,/๐/
๐
๐
๐ผ#,# ๐ผ#,% ๐ผ#,& ๐ผ#,'
๐ผ%,#๐ผ%,%๐ผ%,&๐ผ%,'
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
24
Attention variants
Multiplicative attentionยง First, non-normalized attention scores:
7๐ผ-,/ = ๐-๐พ๐-0
- ๐พ is a matrix of model parameters- adds a linear function to measure the similarity
between query and valueยง Then, softmax over values:
๐ถ- = softmax(>๐ถ-)
ยง Output (weighted sum): ๐- = โ/1$๐ฝ ๐ผ-,/๐/
๐
๐
๐ผ#,# ๐ผ#,% ๐ผ#,& ๐ผ#,'
๐ผ%,#๐ผ%,%๐ผ%,&๐ผ%,'
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
25
Attention variants
Additive attentionยง First, non-normalized attention scores:
7๐ผ-,/ = ๐0tanh(๐-๐พ$ +๐/๐พ%)
- ๐พ$, ๐พ%, and ๐ are model parameters- adds a non-linear function to measure the
similarity between query and valueยง Then, softmax over values:
๐ถ- = softmax(>๐ถ-)
ยง Output (weighted sum): ๐- = โ/1$๐ฝ ๐ผ-,/๐/
๐
๐
๐ผ#,# ๐ผ#,% ๐ผ#,& ๐ผ#,'
๐ผ%,#๐ผ%,%๐ผ%,&๐ผ%,'
๐$ ๐% ๐& ๐'
๐%
๐$
๐$ ๐%
26
Attention in practice
ยง Attention is used to create a compositional embedding of value vectors according to a query- E.g., in seq2seq models or document classification (will be
discussed in this lecture)
ATT
๐$ ๐% ๐& ๐'
๐
๐
27
Self-attention (next lectures)
ยง No query is given; queries are the same as values: ๐ธ = ๐ฝยง Self-attention is used to encode a sequence ๐ฝ to a
contextualized sequence 2๐ฝ- Each encoded vector >๐! is a contextual embedding of the corresponding
input vector ๐&
ATT
๐$ ๐% ๐&
>๐$ >๐%๐&
๐%๐$
>๐&
๐ฝ
ATT
E๐ฝ๐ธ = ๐ฝ
๐ฝ
Self-ATT
E๐ฝ
28
Attention โ summary
ยง Attention is a way to define the distribution of focus on inputs based on a query, and create a compositional embedding of inputs
ยง Attention networks define an attention distribution over inputs and calculate their weighted sum
ยง The original definition of attention network has two inputs: key vectors ๐ฒ, and value vectors ๐ฝ- Key vectors are used to calculate attentions- and output is the weighted sum of value vectors- In practice, in most cases ๐ฒ = ๐ฝ. - In this course, we use our slightly simplified definition
Agenda
โข Machine Translationโข Attention Networksโข Attention in practice
โข Seq2seq with attentionโข Hierarchical document classification
30
Neural Machine Translation (NMT)
ยง Given the source language sentence ๐ and target language sentence ๐, NMT uses seq2seq models to calculate the conditional language model:
๐(๐|๐)- A language model of the target language- Conditioned on the source language
ยง In contrast to SMT, no need for pre-defined alignments! ๐
ยง We can simply use a seq2seq with two RNNs
31
๐ฌ๐($)
๐($)
RNN' RNN' RNN' RNN'
๐ฌ๐(%)
๐ฅ(%)๐ฌ
๐(&)
๐ฅ(&)๐ฌ
๐(')
๐(%) ๐(&) ๐(')
๐ฅ(')<eos>
๐ฅ($)<bos>
J๐(&)
๐พ
RNN( RNN( RNN(
๐ผ๐($)
๐($)
๐ฆ($)<bos>
๐ผ๐(%)
๐ฆ(%)๐ผ
๐(&)
๐ฆ(&)
๐(%) ๐(&)
โฆ
D๐(!): predicted probability distribution of the next target word, given the source sequence and previous target words
ENCODER DECODER
Seq2seq with two RNNs (recap)
32
Seq2seq with two RNNs โ training (recap)
Look here for more: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
33
Seq2seq with two RNNs โ decoding / beam search (recap)
Look here for more: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
34
Sentence-level semantic representations (recap)
ยง 2-dimensional projection of the last hidden states (๐ F ) of RNNG that are obtained from different phrases
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks." Advances in Neural Information Processing Systems 27 (2014): 3104-3112.
35
RNN' RNN' RNN'
๐ฌ๐($)
RNN'
๐ฌ๐(%)
๐ฅ(%)๐ฌ
๐(&)
๐ฅ(&)๐ฌ
๐(')
RNN( RNN( RNN(
๐ผ๐($)
๐ฆ($)<bos>
๐ผ๐(%)
๐ฆ(%)๐ผ
๐(&)
๐ฆ(&)๐ฅ(')<eos>
๐ฅ($)<bos>
โฆ
ENCODER DECODER
Bottleneck problem in seq2seq with two RNNs
All information of source sequence must be embedded in the last hidden
state. Information bottleneck!
๐($) ๐(%) ๐(&) ๐(') ๐($) ๐(%) ๐(&)
36
Seq2seq + Attention
ยง It can be useful, if we allow decoder the direct access to all elements of source sequence,- Decoder can decide on which element of source
sequence, it wants to put attentionยง Attention is a solution to the bottleneck problemยง Seq2seq with attention
- adds an attention network to the architecture of basic seq2seq (two RNNs)
- At each time step, decoder uses the attention network to attend to all contextualized vectors of the source sequence
- Training and inference (decoding) processes are the same as basic seq2seq
Bahdanau, Dzmitry, Kyung Hyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate."3rd International Conference on Learning Representations, ICLR. 2015.
37
RNN' RNN' RNN'
๐ฌ๐($)
RNN'
๐ฌ๐(%)
๐ฅ(%)๐ฌ
๐(&)
๐ฅ(&)๐ฌ
๐(')
RNN(
๐ผ๐($)
๐ฆ($)<bos>
๐ฆ(%) ๐ฆ(&)๐ฅ(')<eos>
๐ฅ($)<bos>
Seq2seq with attention
ATT
โจ๐พ
J๐($)context vectorof time step 1 of decoding
vectors concatenation
(query)
(values)
๐($) ๐(%) ๐(&) ๐(') ๐($)
๐โ($)
38
RNN' RNN' RNN'
๐ฌ๐($)
RNN'
๐ฌ๐(%)
๐ฅ(%)๐ฌ
๐(&)
๐ฅ(&)๐ฌ
๐(')
RNN( RNN(
๐ผ๐($)
๐ฆ($)<bos>
๐ผ๐(%)
๐ฆ(%) ๐ฆ(&)๐ฅ(')<eos>
๐ฅ($)<bos>
ATT
โจ๐พ
J๐(%)Seq2seq with attention
context vector of time step 2 of decoding
(values)
(query)
๐($) ๐(%) ๐(&) ๐(') ๐($) ๐(%)
๐โ(%)
39
RNN' RNN' RNN'
๐ฌ๐($)
RNN'
๐ฌ๐(%)
๐ฅ(%)๐ฌ
๐(&)
๐ฅ(&)๐ฌ
๐(')
RNN( RNN( RNN(
๐ผ๐($)
๐ฆ($)<bos>
๐ผ๐(%)
๐ฆ(%)๐ผ
๐(&)
๐ฆ(&)๐ฅ(')<eos>
๐ฅ($)<bos>
โฆ
ATT
โจ๐พ
J๐(&)Seq2seq with attention
context vector of time step 3 of decoding
(values)
(query)
๐($) ๐(%) ๐(&) ๐(') ๐($) ๐(%) ๐(&)
๐โ(&)
40
Seq2seq with attention โ formulation
ยง Two sets of vocabularies- ๐0 is the set of vocabularies for source sequences- ๐1 is the set of vocabularies for target sequences
ENCODER is the same as seq2seq with two RNNsยง Encoder embedding
- Encoder embeddings for source words (๐0) โ ๐ฌ- Embedding of the source word ๐ฅ(*) (at time step ๐) โ ๐(3)
ยง Encoder RNN:๐(3) = RNN0 (๐ 35) , ๐(3))
Parameters are shown in red
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks." Advances in Neural Information Processing Systems 27 (2014): 3104-3112.
41
Seq2seq with attention โ formulation
DECODER โ input
ยง Decoder embedding- Decoder embeddings at input for target words (๐1) โ ๐ผ- Embedding of the target word ๐ฆ(6) (at time step ๐ก) โ ๐(6)
ยง Decoder RNN๐(I) = RNNJ(๐ IK$ , ๐(I))
- The values of the last hidden state of the encoder RNN are passed to the initial hidden state of the decoder RNN:
๐(L) = ๐ F
Parameters are shown in red
42
Seq2seq with attention โ formulation
DECODER - attention
ยง Attention context vector
๐โ(I) = ATT ๐ I , ๐ $ , โฆ , ๐ F
For instance, if ATT is a โbasic dot-product attentionโ, this is done by:- First calculating non-normalized attentions:
9๐ผ*, = ๐ , -๐*
- Then, normalizing the attentions:
๐ถ , = softmax(E๐ถ , )
- and finally calculating the weighted sum of encoder hidden states
๐โ(,) =F*.$
๐ณ
๐ผ*, ๐*
Parameters are shown in red
43
Seq2seq with attention โ formulation
DECODER - output
ยง Decoder output prediction- Predicted probability distribution of words at the next time step:
J๐(I) = softmax ๐พ[๐ I ; ๐โ(I)] + ๐ โ โ ๐
; denotes the concatenation of two vectors
Parameters are shown in red
44
Alignment in NMT (seq2seq with attention)
ยง Attention automatically learns (nearly) alignment
Bahdanau et al. [2015]Try more here: https://distill.pub/2016/augmented-rnns/#attentional-interfaces
English to French translation
45
Seq2seq with attention โ summary
ยง Attention on source sequence facilitates the focus on relevant words and better flow of information- It also helps avoiding vanishing gradient problem by providing a
shortcut to faraway states
ยง Attention provides some interpretability- Looking at attention distributions, we can assume what the
decoder is focusing on- It is however disputable whether attention distributions should
be taken as model explanations (particularly in Transformers)!โข Jain, Sarthak, and Byron C. Wallace. "Attention is not Explanation." In proc. of NAACL-HTL 2019.
https://www.aclweb.org/anthology/N19-1357.pdfโข Wiegreffe, Sarah, and Yuval Pinter. "Attention is not not Explanation." In proc. of EMNLP-IJCNLP.
2019. https://www.aclweb.org/anthology/D19-1002/
Agenda
โข Machine Translationโข Attention Networksโข Attention in practice
โข Seq2seq with attentionโข Hierarchical document classification
47
Hierarchical document classification with attention
ยง Document classification with attention- An attention network is applied to word embeddings as values
(inputs) to compose a document vector (output)โข Document embedding is then used as features for classification
- The query of the attention network is a randomly initialized parameter vector, whose weights are trained end-to-end with the model
ยง Hierarchical document classification- Split the document into sentences- Use a word-level attention to create a sentence embedding from
the word embeddings of each sentence- Use a sentence-level attention to create the document
embedding from the sentence embeddings
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016, June). Hierarchical attention networks for document classification.In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies
48
Query vector of the word-level attention
(trained end-to-end with other parameters)
Sentence embedding
Architecture
GRU over word embeddings
(creates contextualized word
embeddings)
49
Query vector of the word-level attention
(trained end-to-end with other parameters)
Sentence embedding
GRU over word embeddings
(creates contextualized word
embeddings)
GRU over sentence embeddings
(creates contextualized
sentence embeddings)
Query vector of the sentence-level attention (trained end-to-end with
other parameters)
document embedding
Architecture
50
Examples
https://www.aclweb.org/anthology/N16-1174.pdf
51
Example
https://www.aclweb.org/anthology/N16-1174.pdf
52
Recap
ยง Attention is a general deep learning approach to learn to distribute the focus on certain parts, and compose outputs
ยง Attention significantly helps seq2seq models in machine translation!
ยง Attention can also be used for encoding text, e.g., in document classification
๐ฝ
ATT
๐ถ๐ธ