41
Recurrent Neural Networks Instructor: Yoav Artzi CS5740: Natural Language Processing Adapted from Yoav Goldberg’s Book and slides by Sasha Rush

Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Recurrent Neural Networks

Instructor: Yoav Artzi

CS5740: Natural Language Processing

Adapted from Yoav Goldberg’s Book and slides by Sasha Rush

Page 2: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Overview

• Finite state models• Recurrent neural networks (RNNs)• Training RNNs• RNN Models• Long short-term memory (LSTM)• Attention

Page 3: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Text Classification• Consider the example:– Goal: classify sentiment

How can you not see this movie?You should not see this movie.

• Model: bag of words• How well will the classifier work?– Similar unigrams and bigrams

• Generally: need to maintain a state to capture distant influences

Page 4: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Finite State Machines

• Simple, classical way of representing state

• Current state: saves necessary past information

• Example: email address parsing

Page 5: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Deterministic Finite State Machines• 𝑆 – states• Σ – vocabulary • 𝑠! ∈ 𝑆 – start state • 𝑅: 𝑆 ×Σ → 𝑆 – transition function

• What does it do?– Maps input 𝑤!, … ,𝑤" to states 𝑠!, … , 𝑠"– For all 𝑖 ∈ 1,… , 𝑛

𝑠# = 𝑅(𝑠#$!, 𝑤#)• Can we use it for POS tagging? Language

modeling?

Page 6: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Types of State Machines• Acceptor– Compute final state 𝑠! and make a decision

based on it: 𝑦 = 𝑂(𝑠!)• Transducers– Apply function 𝑦" = 𝑂(𝑠") to produce output

for each intermediate state• Encoders– Compute final state 𝑠!, and use it in another

model

Page 7: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Recurrent Neural Networks

• Motivation:– Neural network model, but with state– How can we borrow ideas from FSMs?

• RNNs are FSMs …–… with a twist– No longer finite in the same sense

Page 8: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN

• 𝑆 = ℝ!!"# - hidden state space• Σ = ℝ!"$ - input state space• 𝒔" ∈ 𝑆 - initial state vector• 𝑅 ∶ ℝ!"$×ℝ!!"# → ℝ!!"# - transition

function• Simple definition of 𝑅:

𝑅#$%&' 𝒔, 𝒙 = tanh( 𝒙, 𝒔 𝑾 + 𝒃)

Elman (1990)* Notation: vectors and matrices are bold

Page 9: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN• Map from dense sequence to dense

representation– 𝒙%, … , 𝒙& → 𝒔%, … , 𝒔&– For all 𝑖 ∈ 1,… , 𝑛

𝒔' = 𝑅 𝒔'(%, 𝒙'– 𝑅 is parameterized, and parameters are shared

between all steps– Example:𝒔) = 𝑅 𝒔*, 𝒙) = ⋯ = 𝑅(𝑅 𝑅 𝑅 𝒔+, 𝒙% , 𝒙, , 𝒙* , 𝒙))

Page 10: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNNs• Hidden states 𝒔! can be used in different

ways• Similar to finite state machines– Acceptor– Transducer– Encoder

• Output function maps vectors to symbols: 𝑂:ℝ"%&' → ℝ"()*

• For example: single layer + softmax𝑂 𝒔! = softmax(𝒔!𝑾+ 𝒃)

Page 11: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Graphical Representation

Recursive Representation Unrolled Representation

Page 12: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Graphical Representation

Page 13: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Training• RNNs are trained with SGD and Backprop• Define loss over outputs

– Depends on supervision and task• Backpropagation through time (BPTT)

– Use unrolled representation– Run forward propagation– Run backward propagation– Update all weights

• Weights are shared between time steps– Sum the contributions of each time step to the gradient

• Inefficient– Batch helps, common but tricky to implement with variable-size

models (good helper methods in PyTorch, non-issue with auto batching in DyNet)

Page 14: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Acceptor Architecture• Only care about the output from the last hidden

state• Train: supervised, loss on prediction• Example:– Text classification

Page 15: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Language Modeling• Input: 𝑋 = 𝑥#, … , 𝑥$• Goal: compute 𝑝(𝑋)• Bi-gram decomposition:

𝑝 𝑋 =8!%#

$

𝑝(𝑥! ∣ 𝑥!&#)

• With RNNs, can do non-Markovian models:

𝑝 𝑋 =8!%#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!&#)

Page 16: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Transducer Architecture

• Predict output for every time step

Page 17: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Language Modeling• Input: 𝑋 = 𝑥#, … , 𝑥$• Goal: compute 𝑝(𝑋)• Model:

𝑝 𝑋 =8!%#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!&#)

𝑝 𝑥! 𝑥#, … , 𝑥!&# = 𝑂 𝒔! = 𝑂(𝑅 𝒔! , 𝒙!&# )𝑂 𝒔! = softmax(𝑠!𝑾+ 𝒃)

• Predict next token =𝑦! as we go:=𝑦! = argmax𝑂(𝒔!)

Page 18: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Transducer Architecture• Predict output for every time step• Examples:– Language modeling– POS tagging– NER

Page 19: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Transducer Architecture

X =x1, . . . ,xn

si =R(si�1,xi), i = 1, . . . , n

O(si) =softmax(siW + b)

yi =argmaxO(si)<latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit><latexit sha1_base64="UmXA3eSeIF2bRpqSfxmvb9cYbqQ=">AAACyXicbVFdi9NAFJ3ErzV+bFcffblYXLZYSyLC+rKwKIjgg6vYbaEpYTKdtMNOJnHmZkkNefIf+uabP8VJtpVs1wuBM+eee+ZkbpxLYdD3fzvurdt37t7bu+89ePjo8X7v4Mm5yQrN+JhlMtPTmBouheJjFCj5NNecprHkk/jifdOfXHJtRKa+4Trn85QulUgEo2ipqPdnCidwCGFKcRUnVVlHwTBcZGiGHUpBGHrbs6kjASeH8PWow1TiVVAPuzZiMASrg61d6/G5OyMGjQ2EyEusTJZgSsv6mmCLJzW8/Ocd14MwBC9cUazWbZYmP9VLKyhh9wYv6vX9kd8W3ATBBvTJps6i3i8bmBUpV8gkNWYW+DnOK6pRMMlrLywMzym7oEs+s1DRlJt51W6ihheWWUCSafsphJbtTlQ0NWadxlbZxDS7vYb8X29WYPJ2XgmVF8gVu7ooKSRgBs1aYSE0ZyjXFlCmhc0KbEU1ZWiX3zxCsPvLN8H561Hgj4Ivb/qn7zbPsUeekefkiATkmJySj+SMjAlzPjjSKZxL95P73S3dH1dS19nMPCXXyv35F3rw2uQ=</latexit>

Page 20: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Encoder Architecture• Similar to acceptor• Difference: last state is used as input to

another model and not for prediction𝑂 𝑠! = 𝑠!à 𝑦$ = 𝑠$

• Example:– Sentence embedding

Page 21: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Bidirectional RNNs• RNN decisions are based on historical data only– How can we account for future input?

• When is it relevant? Feasible?

Page 22: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Bidirectional RNNs• RNN decisions are based on historical data only

– How can we account for future input?• When is it relevant? Feasible?

• When all the input is available. Not for real-time input.• Probabilistic model, for example for language modeling:

𝑝 𝑋 =%!"#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!%#, 𝑥!&#, … , 𝑥$)

Page 23: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Deep RNNs• Can also make RNNs deeper (vertically) to

increase model capacity

Page 24: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Generator• Special case of the transducer architecture• Generation conditioned on 𝒔'• Probabilistic model:

𝑝 𝑋 𝑠' = %!"#

$

𝑝(𝑥! ∣ 𝑥#, … , 𝑥!%#, 𝑠')

Page 25: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

RNN: Generator

• Stop when generating the STOP token• During learning (usually): force predicting

the annotated token and compute loss

sj =R(sj�1, E(tj�1))

O(sj) =softmax(sjW + b)

tj =argmaxO(sj)<latexit sha1_base64="fcMW9lT2t0Nfd/OMxP9HWlhPBBM=">AAACqnicbVHbThsxEPVub3R7C/DYl1GjtkSl0S5Foi9IqAipLxUUNQlVHEVex5s4eC+yZ1Gi1X4cv9A3/gZvWFBIOpKl43NmfMYzYaakQd+/cdwnT589f7Hx0nv1+s3bd43Nra5Jc81Fh6cq1RchM0LJRHRQohIXmRYsDpXohZfHld67EtrINPmD80wMYjZOZCQ5Q0sNG9c0ZjgJo8KUwykcfoLznSWmmH4Nyl04eeDohGGBZa20WkCpd7pcMW1VjwBFMcPCpBHGbFY+SrjHvRK+wP0lLFuUgrdms2iJMj22ygxWnSp3b9ho+m1/EbAOgho0SR1nw8Y/Okp5HosEuWLG9AM/w0HBNEquROnR3IiM8Us2Fn0LExYLMygWoy7ho2VGEKXangRhwS5XFCw2Zh6HNrNq1axqFfk/rZ9j9H1QyCTLUST8zijKFWAK1d5gJLXgqOYWMK6l7RX4hGnG0W63GkKw+uV10N1rB9/ae7/3m0c/6nFskPfkA9khATkgR+QnOSMdwp3Pzi+n6/TcXffc/ev271Jdp67ZJo/CHd0CXEHQMg==</latexit>

Page 26: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Example: Caption Generation• Given: image 𝐼• Goal: generate caption• Set 𝒔! = CNN(𝐼)• Model:

𝑝 𝑋 𝐼 = 4"#$

%

𝑝(𝑥" ∣ 𝑥$, … , 𝑥"&$, 𝐼)

Examples from Karpathyand Fei-Fei 2015

Page 27: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Sequence-to-Sequence• Connect encoder and

generator• Many alternatives:

– Set generator 𝒔!' to encoder output 𝒔%(

– Concatenate 𝒔%( with each step input during generation

• Examples:– Machine translation– Chatbots– Dialog systems

• Can also generate other sequences – not only natural language!

Page 28: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Sequence-to-SequenceX =x1, . . . ,xn

sEi =RE(sEi�1,xi), i = 1, . . . , n

c =OE(sEn )

sDj =RD(sDj�1, [E(tj�1); c])

OD(sDj ) =softmax(sDj W + b)

tj =argmaxOD(sDj )

<latexit sha1_base64="Fg9TrOqDzjrPAUnGt7bL5pLUSSQ=">AAADYXicbVJNb9NAEN04BdpAS1qOvYyIqGoRIhsVgYQqVZBI3FoQaSLFibXerJNN/SXvGBJZ/pPcuHDhj7BOHBonHcnS2zcz741H40SekGgYvytade/R4yf7B7Wnzw6PntePT25lmMSMd1nohXHfoZJ7IuBdFOjxfhRz6jse7zl3n/N87wePpQiD77iI+NCnk0C4glFUlH1c+dmHSzgDy6c4ddx0ntlm0xqHKJsbVACWVVu/ZWaLUQcuz+Cb3TnfYFPxxsxGneammNCbIJTDWrSkxLJc5bqsEow6+pbdbNRe2bVLdrPcrt2EwX2/NaWYYlYk9Y9wbzVcql6XNZSynkuDhXyOqQxd9Ok82ypZv3oZvP6v6GS6ZUFtx3mW61k0nqjMHB7yy+eo2fWG0TKWAbvALECDFHFj13+pDbLE5wEyj0o5MI0IhymNUTCPZzUrkTyi7I5O+EDBgPpcDtPlhWTwSjFjcMNYfQHCkt3sSKkv5cJ3VGU+rNzO5eRDuUGC7odhKoIoQR6wlZGbeIAh5OcGYxFzht5CAcpioWYFNqUxZaiOMl+Cuf3Lu+D2bcu8aL37etG4+lSsY5+ckpfknJjkPbkiX8gN6RJW+aPtaYfakfa3elCtV09WpVql6HlBSlE9/QcFXg2c</latexit>

Page 29: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Sequence-to-Sequence Training Graph

Page 30: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Long-range Interactions

• Promise: Learn long-range interactions of language from data

• Example:How can you not see this movie?You should not see this movie.

• Sometimes: requires ”remembering” early state– Key signal here is at 𝑠#, but gradient is at 𝑠!

Page 31: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Long-term Gradients• Gradient go through (many) multiplications• OK at end layers à close to the loss• But: issue with early layers• For example, derivative of tanh

𝑑𝑑𝑥

tanh 𝑥 = 1 − tanh'𝑥– Large activation à gradient disappears (vanish)

• In other activation functions, values can become larger and larger (explode)

Page 32: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Exploding Gradients• Common when there is no

saturation in activation (e.g., ReLu) and we get exponential blowup

• Result: reasonable short-term gradient, but bad long-term ones

• Common heuristic:– Gradient clipping:

bounding all gradients by maximum value

Page 33: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Vanishing Gradients

• Occurs when multiplying small values– For example: when tanh saturates

• Mainly affects long-term gradients• Solving this is more complex

Page 34: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Long Short-term Memory (LSTM)

Hochreiter and Schmidhuber (1997)https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 35: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

LSTM vs. Elman RNN

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 36: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

LSTM

Image by Tim Rocktäschel

Cell State

Output

Input

�(·)<latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit><latexit sha1_base64="dNFa4QbOT9ylps99TO0e4EmcH+U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gOaUDabTbt0dxN3N4US+ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8MOVMG9f9dtbWNza3tks75d29/YPDytFxWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDfzO2OqNEvko5mkNBB4IFnMCDZWCnzNBgLXfBIl5qJfqbp1dw60SryCVKFAs1/58qOEZIJKQzjWuue5qQlyrAwjnE7LfqZpiskID2jPUokF1UE+P3qKzq0SoThRtqRBc/X3RI6F1hMR2k6BzVAvezPxP6+XmfgmyJlMM0MlWSyKM45MgmYJoIgpSgyfWIKJYvZWRIZYYWJsTmUbgrf88ippX9Y9t+49XFUbt0UcJTiFM6iBB9fQgHtoQgsIPMEzvMKbM3ZenHfnY9G65hQzJ/AHzucPQvORwA==</latexit>

ft = �(Wf [ht�1,xt] + bf )

it = �(Wi[ht�1,xt] + bf )

ct = ft � ct�1 + it � tanh(Wc[ht�1,xt] + bc)

ot = �(Wo[ht�1,xt] + bo)

ht = ot � tanh(ct)<latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit><latexit sha1_base64="5rN75Omq38gFp31KxTutiBaTFOk=">AAADsnicpVJNa9tAEF1L/UjdLyc99rLUpCS0caQQSC6F0F56TKGOQy1FrNYra8lqV+yOQozQ/+u5t/6brhyryHZdCBkQzL558/RmmDgX3IDn/e447qPHT55uPes+f/Hy1eve9s6FUYWmbEiVUPoyJoYJLtkQOAh2mWtGsliwUXz9pa6Pbpg2XMnvMMtZmJGp5AmnBCwUbXd+BhmBNE7KpIoAv/+EA8OnGdlr4FF1lYybR1pFJRz41UfcILe2K8Qf/r5jS9/HQdBtAL5Rlj9EljaybfeBmijALcpctSXDWywgMm3boet2/uOGLrlRG4dU91JVS6rpyoxqk/t6GftRr+8NvHng9cRfJH20iPOo9yuYKFpkTAIVxJix7+UQlkQDp4JV3aAwLCf0mkzZ2KaSZMyE5fzkKrxrkQlOlLafBDxH2x0lyYyZZbFl1ibNaq0G/1UbF5CchiWXeQFM0rsfJYXAoHB9v3jCNaMgZjYhVHPrFdOUaELBXnnXLsFfHXk9uTga+N7A/3bcP/u8WMcWeoveoT3koxN0hr6iczRE1Dl0hs6VE7nH7g+XuPSO6nQWPW/QUrjiD5/sNGU=</latexit>

Page 37: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Attention• In seq-to-seq models, a single vector

connects encoding and decoding– Any concern?– All the input string information must encoded into

a fixed-length vector– The decoder must recover all this information

from a fixed-length vector • Attention relaxes the assumption that a

single vector must be used to encode the input sentence regardless of length

Page 38: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Attention

• Encode input sentence as a sequence of vectors

• At each step: pick what vector to use• But: discrete choice is not differentiable–Make the choice soft

Page 39: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Attention����� $0/%*5*0/&% (&/&3"5*0/ 8*5) "55&/5*0/ ���

s0 s1

y1 y2 y3 y4 y5

s2 s3 s4

predict predict predict predict predict

concat concat concat concat concat

attend attend attend attend attend

the black fox jumped </s>

c0 c1 c4c2 c3

RD, OD RD, OD RD, ODRD, OD RD, OD

BIE BIE BIE BIE BIE

E[<s>]

<s>

E[a]

a

E[conditioning]

conditioning

E[sequence]

sequence

E[</s>]

</s>

E[<s>]

<s>

E[the]

the

E[black]

black

E[fox]

fox

E[jumped]

jumped

'JHVSF ����� 4FRVFODF�UP�TFRVFODF 3// HFOFSBUPS XJUI BUUFOUJPO�

UIBU JT UIFZ SFQSFTFOU B XJOEPX GPDVTFE BSPVOE UIF JOQVU JUFN xi BOE OPU UIF JUFN JUTFMG� 4FDPOE CZ IBWJOH B USBJOBCMF FODPEJOH DPNQPOFOU UIBU JT USBJOFE KPJOUMZ XJUI UIF EFDPEFS UIF FODPEFSBOE EFDPEFS FWPMWF UPHFUIFS BOE UIF OFUXPSL DBO MFBSO UP FODPEF SFMFWBOU QSPQFSUJFT PG UIF JOQVUUIBU BSF VTFGVM GPS EFDPEJOH BOE UIBU NBZ OPU CF QSFTFOU BU UIF TPVSDF TFRVFODF x1Wn EJSFDUMZ� 'PSFYBNQMF UIF CJ3// FODPEFS NBZ MFBSO UP FODPEF UIF QPTJUJPO PG xi XJUIJO UIF TFRVFODF BOE UIFEFDPEFS DPVME VTF UIJT JOGPSNBUJPO UP BDDFTT UIF FMFNFOUT JO PSEFS PS MFBSO UP QBZ NPSF BUUFOUJPOUP FMFNFOUT JO UIF CFHJOOJOH PG UIF TFRVFODF UIFO UP FMFNFOUT BU JUT FOE�

Page 40: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Attention

����� $0/%*5*0/&% (&/&3"5*0/ 8*5) "55&/5*0/ ���

s0 s1

y1 y2 y3 y4 y5

s2 s3 s4

predict predict predict predict predict

concat concat concat concat concat

attend attend attend attend attend

the black fox jumped </s>

c0 c1 c4c2 c3

RD, OD RD, OD RD, ODRD, OD RD, OD

BIE BIE BIE BIE BIE

E[<s>]

<s>

E[a]

a

E[conditioning]

conditioning

E[sequence]

sequence

E[</s>]

</s>

E[<s>]

<s>

E[the]

the

E[black]

black

E[fox]

fox

E[jumped]

jumped

'JHVSF ����� 4FRVFODF�UP�TFRVFODF 3// HFOFSBUPS XJUI BUUFOUJPO�

UIBU JT UIFZ SFQSFTFOU B XJOEPX GPDVTFE BSPVOE UIF JOQVU JUFN xi BOE OPU UIF JUFN JUTFMG� 4FDPOE CZ IBWJOH B USBJOBCMF FODPEJOH DPNQPOFOU UIBU JT USBJOFE KPJOUMZ XJUI UIF EFDPEFS UIF FODPEFSBOE EFDPEFS FWPMWF UPHFUIFS BOE UIF OFUXPSL DBO MFBSO UP FODPEF SFMFWBOU QSPQFSUJFT PG UIF JOQVUUIBU BSF VTFGVM GPS EFDPEJOH BOE UIBU NBZ OPU CF QSFTFOU BU UIF TPVSDF TFRVFODF x1Wn EJSFDUMZ� 'PSFYBNQMF UIF CJ3// FODPEFS NBZ MFBSO UP FODPEF UIF QPTJUJPO PG xi XJUIJO UIF TFRVFODF BOE UIFEFDPEFS DPVME VTF UIJT JOGPSNBUJPO UP BDDFTT UIF FMFNFOUT JO PSEFS PS MFBSO UP QBZ NPSF BUUFOUJPOUP FMFNFOUT JO UIF CFHJOOJOH PG UIF TFRVFODF UIFO UP FMFNFOUT BU JUT FOE�

Attend

X =x1, . . . ,xn

sEi =RE(sEi�1,xi), i = 1, . . . , n

ci =OE(sEi )

↵ji =sDj�1 · ci

↵j =softmax(↵j

1, . . . , ↵jn)

cj =nX

i=1

↵ji ci

sDj =R(sDj�1, [E(tj�1); cj ])

OD(sDj ) =softmax(sDj W + b)

tj =argmaxOD(sDj )

<latexit sha1_base64="Z29XwgPR1TAuE6xl6WcRES6WEdk=">AAAEQHicfVNNi9NAGJ5N/FjrV1ePXgaLS4O1NLKiIIVFt+BtV7Ef0HTDZDppp5svZibSEuanefEnePPsxYMiXj05kyZL2i4OBN687zPP87wPjJcElItO59ueYV67fuPm/q3a7Tt3792vHzwY8DhlmPRxHMRs5CFOAhqRvqAiIKOEERR6ARl6F2/1fPiJME7j6KNYJWQSollEfYqRUC33wBiMYBceQidEYu752VK6dsuZxoK3Kq0IOk6t/OfSpec92D2EH9xes9LN6DNbnvdaVTJqtSBVCiXpmslDLCtBWOqbUvOdbvIpFUvBYYFHQTJH8nzhUo2t6i607gl0sJKAO+S0av6SZb21IEuR8dgXIVrK5paOrTbJTcOtQWRVKbF0CzaehmqVrjKj1szR2u3/DSn/C+Vdp9nc3akFx2VTRTNHIhOyGFqvYdXBJPd06p40N5ktTb2z6Aak/BtK+PSS05NWHn2ZW6m9yMNHbKYmS3iVnvZRc+uNTruTH7hb2EXRAMU5c+tfVdY4DUkkcIA4H9udREwyxATFAZE1J+UkQfgCzchYlREKCZ9k+QOQ8InqTKEfM/VFAubd6o0MhZyvQk8htVm+PdPNq2bjVPivJhmNklSQCK+F/DSAIob6NcEpZQSLYKUKhBlVXiGeI4awUG9Oh2Bvr7xbDJ637aP2i/dHjeM3RRz74BF4DJrABi/BMXgHzkAfYOOz8d34afwyv5g/zN/mnzXU2CvuPAQbx/z7D9jYZdk=</latexit>

Page 41: Recurrent Neural Networks · 2020-04-06 · Overview •Finite state models •Recurrent neural networks (RNNs) •Training RNNs •RNN Models •Long short-term memory (LSTM) •Attention

Attention

• Many variants of attention function– Dot product (previous slide)–MLP– Bi-linear transformation

• Various ways to combine context vector into decoder computation

• See Luong et al. 2015