Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Recurrent Neural Networks

Instructor: Yoav Artzi

CS5740: Natural Language Processing

Adapted from Yoav Goldberg’s Book and slides by Sasha Rush

Overview

• Finite state models• Recurrent neural networks (RNNs)• Training RNNs• RNN Models• Long short-term memory (LSTM)• Attention

Text Classification• Consider the example:– Goal: classify sentiment

How can you not see this movie?You should not see this movie.

• Model: bag of words• How well will the classifier work?– Similar unigrams and bigrams

• Generally: need to maintain a state to capture distant influences

Finite State Machines

• Simple, classical way of representing state

• Current state: saves necessary past information

• Example: email address parsing

Deterministic Finite State Machines• 𝑆 – states• Σ – vocabulary • 𝑠! ∈ 𝑆 – start state • 𝑅: 𝑆 ×Σ → 𝑆 – transition function

• What does it do?– Maps input 𝑤!, … ,𝑤" to states 𝑠!, … , 𝑠"– For all 𝑖 ∈ 1,… , 𝑛

𝑠# = 𝑅(𝑠#$!, 𝑤#)• Can we use it for POS tagging? Language

modeling?

Types of State Machines• Acceptor– Compute final state 𝑠! and make a decision

based on it: 𝑦 = 𝑂(𝑠!)• Transducers– Apply function 𝑦" = 𝑂(𝑠") to produce output

for each intermediate state• Encoders– Compute final state 𝑠!, and use it in another

model

Recurrent Neural Networks

• Motivation:– Neural network model, but with state– How can we borrow ideas from FSMs?

• RNNs are FSMs …–… with a twist– No longer finite in the same sense

RNN

• 𝑆 = ℝ!!"# - hidden state space• Σ = ℝ!"$ - input state space• 𝒔" ∈ 𝑆 - initial state vector• 𝑅 ∶ ℝ!"$×ℝ!!"# → ℝ!!"# - transition

function• Simple definition of 𝑅:

𝑅#$%&' 𝒔, 𝒙 = tanh( 𝒙, 𝒔 𝑾 + 𝒃)

Elman (1990)* Notation: vectors and matrices are bold

RNN• Map from dense sequence to dense

representation– 𝒙%, … , 𝒙& → 𝒔%, … , 𝒔&– For all 𝑖 ∈ 1,… , 𝑛

𝒔' = 𝑅 𝒔'(%, 𝒙'– 𝑅 is parameterized, and parameters are shared

between all steps– Example:𝒔) = 𝑅 𝒔*, 𝒙) = ⋯ = 𝑅(𝑅 𝑅 𝑅 𝒔+, 𝒙% , 𝒙, , 𝒙* , 𝒙))

RNNs• Hidden states 𝒔! can be used in different

ways• Similar to finite state machines– Acceptor– Transducer– Encoder

• Output function maps vectors to symbols: 𝑂:ℝ"%&' → ℝ"()*

• For example: single layer + softmax𝑂 𝒔! = softmax(𝒔!𝑾+ 𝒃)

Graphical Representation

Recursive Representation Unrolled Representation

Graphical Representation

Training• RNNs are trained with SGD and Backprop• Define loss over outputs

– Depends on supervision and task• Backpropagation through time (BPTT)

– Use unrolled representation– Run forward propagation– Run backward propagation– Update all weights

• Weights are shared between time steps– Sum the contributions of each time step to the gradient

• Inefficient– Batch helps, common but tricky to implement with variable-size

models (good helper methods in PyTorch, non-issue with auto batching in DyNet)

RNN: Acceptor Architecture• Only care about the output from the last hidden

state• Train: supervised, loss on prediction• Example:– Text classification

Language Modeling• Input: 𝑋 = 𝑥#, … , 𝑥$• Goal: compute 𝑝(𝑋)• Bi-gram decomposition:

𝑝 𝑋 =8!%#

$

𝑝(𝑥! ∣ 𝑥!&#)

• With RNNs, can do non-Markovian models:

𝑝 𝑋 =8!%#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!&#)

RNN: Transducer Architecture

• Predict output for every time step

Language Modeling• Input: 𝑋 = 𝑥#, … , 𝑥$• Goal: compute 𝑝(𝑋)• Model:

𝑝 𝑋 =8!%#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!&#)

𝑝 𝑥! 𝑥#, … , 𝑥!&# = 𝑂 𝒔! = 𝑂(𝑅 𝒔! , 𝒙!&# )𝑂 𝒔! = softmax(𝑠!𝑾+ 𝒃)

• Predict next token =𝑦! as we go:=𝑦! = argmax𝑂(𝒔!)

RNN: Transducer Architecture• Predict output for every time step• Examples:– Language modeling– POS tagging– NER

RNN: Transducer Architecture

X =x1, . . . ,xn

si =R(si�1,xi), i = 1, . . . , n

O(si) =softmax(siW + b)

yi =argmaxO(si)<latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit>

RNN: Encoder Architecture• Similar to acceptor• Difference: last state is used as input to

another model and not for prediction𝑂 𝑠! = 𝑠!à 𝑦$ = 𝑠$

• Example:– Sentence embedding

Bidirectional RNNs• RNN decisions are based on historical data only– How can we account for future input?

• When is it relevant? Feasible?

Bidirectional RNNs• RNN decisions are based on historical data only

– How can we account for future input?• When is it relevant? Feasible?

• When all the input is available. Not for real-time input.• Probabilistic model, for example for language modeling:

𝑝 𝑋 =%!"#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!%#, 𝑥!&#, … , 𝑥$)

Deep RNNs• Can also make RNNs deeper (vertically) to

increase model capacity

RNN: Generator• Special case of the transducer architecture• Generation conditioned on 𝒔'• Probabilistic model:

𝑝 𝑋 𝑠' = %!"#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!%#, 𝑠')

RNN: Generator

• Stop when generating the STOP token• During learning (usually): force predicting

the annotated token and compute loss

sj =R(sj�1, E(tj�1))

O(sj) =softmax(sjW + b)

tj =argmaxO(sj)<latexit sha1_base64="fcMW9lT2t0Nfd/OMxP9HWlhPBBM=">AAACqnicbVHbThsxEPVub3R7C/DYl1GjtkSl0S5Foi9IqAipLxUUNQlVHEVex5s4eC+yZ1Gi1X4cv9A3/gZvWFBIOpKl43NmfMYzYaakQd+/cdwnT589f7Hx0nv1+s3bd43Nra5Jc81Fh6cq1RchM0LJRHRQohIXmRYsDpXohZfHld67EtrINPmD80wMYjZOZCQ5Q0sNG9c0ZjgJo8KUwykcfoLznSWmmH4Nyl04eeDohGGBZa20WkCpd7pcMW1VjwBFMcPCpBHGbFY+SrjHvRK+wP0lLFuUgrdms2iJMj22ygxWnSp3b9ho+m1/EbAOgho0SR1nw8Y/Okp5HosEuWLG9AM/w0HBNEquROnR3IiM8Us2Fn0LExYLMygWoy7ho2VGEKXangRhwS5XFCw2Zh6HNrNq1axqFfk/rZ9j9H1QyCTLUST8zijKFWAK1d5gJLXgqOYWMK6l7RX4hGnG0W63GkKw+uV10N1rB9/ae7/3m0c/6nFskPfkA9khATkgR+QnOSMdwp3Pzi+n6/TcXffc/ev271Jdp67ZJo/CHd0CXEHQMg==</latexit>

Example: Caption Generation• Given: image 𝐼• Goal: generate caption• Set 𝒔! = CNN(𝐼)• Model:

𝑝 𝑋 𝐼 = 4"#$

%

𝑝(𝑥" ∣ 𝑥$, … , 𝑥"&$, 𝐼)

Examples from Karpathyand Fei-Fei 2015

Sequence-to-Sequence• Connect encoder and

generator• Many alternatives:

– Set generator 𝒔!' to encoder output 𝒔%(

– Concatenate 𝒔%( with each step input during generation

• Examples:– Machine translation– Chatbots– Dialog systems

• Can also generate other sequences – not only natural language!

Sequence-to-SequenceX =x1, . . . ,xn

sEi =RE(sEi�1,xi), i = 1, . . . , n

c =OE(sEn )

sDj =RD(sDj�1, [E(tj�1); c])

OD(sDj ) =softmax(sDj W + b)

tj =argmaxOD(sDj )

<latexit sha1_base64="Fg9TrOqDzjrPAUnGt7bL5pLUSSQ=">AAADYXicbVJNb9NAEN04BdpAS1qOvYyIqGoRIhsVgYQqVZBI3FoQaSLFibXerJNN/SXvGBJZ/pPcuHDhj7BOHBonHcnS2zcz741H40SekGgYvytade/R4yf7B7Wnzw6PntePT25lmMSMd1nohXHfoZJ7IuBdFOjxfhRz6jse7zl3n/N87wePpQiD77iI+NCnk0C4glFUlH1c+dmHSzgDy6c4ddx0ntlm0xqHKJsbVACWVVu/ZWaLUQcuz+Cb3TnfYFPxxsxGneammNCbIJTDWrSkxLJc5bqsEow6+pbdbNRe2bVLdrPcrt2EwX2/NaWYYlYk9Y9wbzVcql6XNZSynkuDhXyOqQxd9Ok82ypZv3oZvP6v6GS6ZUFtx3mW61k0nqjMHB7yy+eo2fWG0TKWAbvALECDFHFj13+pDbLE5wEyj0o5MI0IhymNUTCPZzUrkTyi7I5O+EDBgPpcDtPlhWTwSjFjcMNYfQHCkt3sSKkv5cJ3VGU+rNzO5eRDuUGC7odhKoIoQR6wlZGbeIAh5OcGYxFzht5CAcpioWYFNqUxZaiOMl+Cuf3Lu+D2bcu8aL37etG4+lSsY5+ckpfknJjkPbkiX8gN6RJW+aPtaYfakfa3elCtV09WpVql6HlBSlE9/QcFXg2c</latexit>

Sequence-to-Sequence Training Graph

Long-range Interactions

• Promise: Learn long-range interactions of language from data

• Example:How can you not see this movie?You should not see this movie.

• Sometimes: requires ”remembering” early state– Key signal here is at 𝑠#, but gradient is at 𝑠!

Long-term Gradients• Gradient go through (many) multiplications• OK at end layers à close to the loss• But: issue with early layers• For example, derivative of tanh

𝑑𝑑𝑥

tanh 𝑥 = 1 − tanh'𝑥– Large activation à gradient disappears (vanish)

• In other activation functions, values can become larger and larger (explode)

Exploding Gradients• Common when there is no

saturation in activation (e.g., ReLu) and we get exponential blowup

• Result: reasonable short-term gradient, but bad long-term ones

• Common heuristic:– Gradient clipping:

bounding all gradients by maximum value

Vanishing Gradients

• Occurs when multiplying small values– For example: when tanh saturates

• Mainly affects long-term gradients• Solving this is more complex

Long Short-term Memory (LSTM)

Hochreiter and Schmidhuber (1997)https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM vs. Elman RNN



LSTM

Image by Tim Rocktäschel

Cell State

Output

Input

�(·)<latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit>

ft = �(Wf [ht�1,xt] + bf )

it = �(Wi[ht�1,xt] + bf )

ct = ft � ct�1 + it � tanh(Wc[ht�1,xt] + bc)

ot = �(Wo[ht�1,xt] + bo)

ht = ot � tanh(ct)<latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit>

Attention• In seq-to-seq models, a single vector

connects encoding and decoding– Any concern?– All the input string information must encoded into

a fixed-length vector– The decoder must recover all this information

from a fixed-length vector • Attention relaxes the assumption that a

single vector must be used to encode the input sentence regardless of length

Attention

• Encode input sentence as a sequence of vectors

• At each step: pick what vector to use• But: discrete choice is not differentiable–Make the choice soft

Attention�� $0/%*5*0/&% (&/&3"5*0/ 8*5) "55&/5*0/ ��

s0 s1

y1 y2 y3 y4 y5

s2 s3 s4

predict predict predict predict predict

concat concat concat concat concat

attend attend attend attend attend

the black fox jumped </s>

c0 c1 c4c2 c3

RD, OD RD, OD RD, ODRD, OD RD, OD

BIE BIE BIE BIE BIE

E[<s>]

<s>

E[a]

a

E[conditioning]

conditioning

E[sequence]

sequence

E[</s>]

</s>

E[<s>]

<s>

E[the]

the

E[black]

black

E[fox]

fox

E[jumped]

jumped

'JHVSF �� 4FRVFODF�UP�TFRVFODF 3// HFOFSBUPS XJUI BUUFOUJPO�

UIBU JT UIFZ SFQSFTFOU B XJOEPX GPDVTFE BSPVOE UIF JOQVU JUFN xi BOE OPU UIF JUFN JUTFMG� 4FDPOE CZ IBWJOH B USBJOBCMF FODPEJOH DPNQPOFOU UIBU JT USBJOFE KPJOUMZ XJUI UIF EFDPEFS UIF FODPEFSBOE EFDPEFS FWPMWF UPHFUIFS BOE UIF OFUXPSL DBO MFBSO UP FODPEF SFMFWBOU QSPQFSUJFT PG UIF JOQVUUIBU BSF VTFGVM GPS EFDPEJOH BOE UIBU NBZ OPU CF QSFTFOU BU UIF TPVSDF TFRVFODF x1Wn EJSFDUMZ� 'PSFYBNQMF UIF CJ3// FODPEFS NBZ MFBSO UP FODPEF UIF QPTJUJPO PG xi XJUIJO UIF TFRVFODF BOE UIFEFDPEFS DPVME VTF UIJT JOGPSNBUJPO UP BDDFTT UIF FMFNFOUT JO PSEFS PS MFBSO UP QBZ NPSF BUUFOUJPOUP FMFNFOUT JO UIF CFHJOOJOH PG UIF TFRVFODF UIFO UP FMFNFOUT BU JUT FOE�

Attention

�� $0/%*5*0/&% (&/&3"5*0/ 8*5) "55&/5*0/ ��

s0 s1

y1 y2 y3 y4 y5

s2 s3 s4

predict predict predict predict predict

concat concat concat concat concat

attend attend attend attend attend

the black fox jumped </s>

c0 c1 c4c2 c3

RD, OD RD, OD RD, ODRD, OD RD, OD

BIE BIE BIE BIE BIE

E[<s>]

<s>

E[a]

a

E[conditioning]

conditioning

E[sequence]

sequence

E[</s>]

</s>

E[<s>]

<s>

E[the]

the

E[black]

black

E[fox]

fox

E[jumped]

jumped

'JHVSF �� 4FRVFODF�UP�TFRVFODF 3// HFOFSBUPS XJUI BUUFOUJPO�

UIBU JT UIFZ SFQSFTFOU B XJOEPX GPDVTFE BSPVOE UIF JOQVU JUFN xi BOE OPU UIF JUFN JUTFMG� 4FDPOE CZ IBWJOH B USBJOBCMF FODPEJOH DPNQPOFOU UIBU JT USBJOFE KPJOUMZ XJUI UIF EFDPEFS UIF FODPEFSBOE EFDPEFS FWPMWF UPHFUIFS BOE UIF OFUXPSL DBO MFBSO UP FODPEF SFMFWBOU QSPQFSUJFT PG UIF JOQVUUIBU BSF VTFGVM GPS EFDPEJOH BOE UIBU NBZ OPU CF QSFTFOU BU UIF TPVSDF TFRVFODF x1Wn EJSFDUMZ� 'PSFYBNQMF UIF CJ3// FODPEFS NBZ MFBSO UP FODPEF UIF QPTJUJPO PG xi XJUIJO UIF TFRVFODF BOE UIFEFDPEFS DPVME VTF UIJT JOGPSNBUJPO UP BDDFTT UIF FMFNFOUT JO PSEFS PS MFBSO UP QBZ NPSF BUUFOUJPOUP FMFNFOUT JO UIF CFHJOOJOH PG UIF TFRVFODF UIFO UP FMFNFOUT BU JUT FOE�

Attend

X =x1, . . . ,xn

sEi =RE(sEi�1,xi), i = 1, . . . , n

ci =OE(sEi )

↵ji =sDj�1 · ci

↵j =softmax(↵j

1, . . . , ↵jn)

cj =nX

i=1

↵ji ci

sDj =R(sDj�1, [E(tj�1); cj ])

OD(sDj ) =softmax(sDj W + b)

tj =argmaxOD(sDj )

<latexit sha1_base64="Z29XwgPR1TAuE6xl6WcRES6WEdk=">AAAEQHicfVNNi9NAGJ5N/FjrV1ePXgaLS4O1NLKiIIVFt+BtV7Ef0HTDZDppp5svZibSEuanefEnePPsxYMiXj05kyZL2i4OBN687zPP87wPjJcElItO59ueYV67fuPm/q3a7Tt3792vHzwY8DhlmPRxHMRs5CFOAhqRvqAiIKOEERR6ARl6F2/1fPiJME7j6KNYJWQSollEfYqRUC33wBiMYBceQidEYu752VK6dsuZxoK3Kq0IOk6t/OfSpec92D2EH9xes9LN6DNbnvdaVTJqtSBVCiXpmslDLCtBWOqbUvOdbvIpFUvBYYFHQTJH8nzhUo2t6i607gl0sJKAO+S0av6SZb21IEuR8dgXIVrK5paOrTbJTcOtQWRVKbF0CzaehmqVrjKj1szR2u3/DSn/C+Vdp9nc3akFx2VTRTNHIhOyGFqvYdXBJPd06p40N5ktTb2z6Aak/BtK+PSS05NWHn2ZW6m9yMNHbKYmS3iVnvZRc+uNTruTH7hb2EXRAMU5c+tfVdY4DUkkcIA4H9udREwyxATFAZE1J+UkQfgCzchYlREKCZ9k+QOQ8InqTKEfM/VFAubd6o0MhZyvQk8htVm+PdPNq2bjVPivJhmNklSQCK+F/DSAIob6NcEpZQSLYKUKhBlVXiGeI4awUG9Oh2Bvr7xbDJ637aP2i/dHjeM3RRz74BF4DJrABi/BMXgHzkAfYOOz8d34afwyv5g/zN/mnzXU2CvuPAQbx/z7D9jYZdk=</latexit>

Attention

• Many variants of attention function– Dot product (previous slide)–MLP– Bi-linear transformation

• Various ways to combine context vector into decoder computation

• See Luong et al. 2015

https://arxiv.org/abs/1508.04025

Documents

Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention