View
11
Download
0
Category
Preview:
Citation preview
Conditional Language Modeling
Chris DyerDeepMind
Carnegie Mellon University
August 30, 2017MT Marathon 2017
A language model assigns probabilities to sequences of words, .w = (w1, w2, . . . , w`)
p(w) = p(w1)⇥ p(w2 | w1)⇥ p(w3 | w1, w2)⇥ · · ·⇥p(w` | w1, . . . , w`�1)
=
|w|Y
t=1
p(wt | w1, . . . , wt�1)
We saw that it is helpful to decompose this probability using the chain rule, as follows:
This reduces the language modeling problem to modeling the probability of the next word, given the history of preceding words.
Review: Unconditional LMs
Unconditional LMs with RNNs
h1
h0 x1
<s>
Unconditional LMs with RNNs
softmax
p1
h1
h0 x1
<s>
Unconditional LMs with RNNs
softmax
p1
h1
h0 x1
<s>
⇠
tom
p(tom | hsi)
Unconditional LMs with RNNs
softmax
p1
h1
h0 x1
<s>
x2
⇠
tom
p(tom | hsi)
Unconditional LMs with RNNs
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
⇠
tom
p(tom | hsi)
Unconditional LMs with RNNs
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
⇠
tom
p(tom | hsi)
⇠likes
⇥p(likes | hsi, tom)
Unconditional LMs with RNNs
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
⇠
tom
p(tom | hsi)
⇠likes
⇥p(likes | hsi, tom)
x3
h3
softmax
⇠
beer
⇥p(beer | hsi, tom, likes)
Unconditional LMs with RNNs
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
⇠
tom
p(tom | hsi)
⇠likes
⇥p(likes | hsi, tom)
x3
h3
softmax
⇠
beer
⇥p(beer | hsi, tom, likes)
x4
h4
softmax
⇠
</s>
⇥p(h/si | hsi, tom, likes, beer)
Unconditional LMs with RNNs
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
⇠
tom⇠
likes
x3
h3
softmax
⇠
beer
x4
h4
softmax
⇠
</s>
Unconditional LMs with RNNsTraining
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
tom likes
x3
h3
softmax
beer
x4
h4
softmax
</s>
cost1 cost2 cost3 cost4
Unconditional LMs with RNNsTraining
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
tom likes
x3
h3
softmax
beer
x4
h4
softmax
</s>
cost1 cost2 cost3 cost4{log lo
ss/
cross
entro
py
Unconditional LMs with RNNsTraining
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
tom likes
x3
h3
softmax
beer
x4
h4
softmax
</s>
cost1 cost2 cost3 cost4
F
{log lo
ss/
cross
entro
py
Unconditional LMs with RNNsTraining
h2
softmaxsoftmax
p1
h1
h0 x1
<s>
x2
tom likes
x3
h3
softmax
beer
x4
h4
softmax
</s>
cost1 cost2 cost3 cost4
F
{log lo
ss/
cross
entro
py
Unconditional LMs with RNNsTraining
Unconditional LMs with RNNs
h2h1
h0
h3 h4
softmax
w1 w2 w3 w4
w4w3w2w1
p(W5|w1,w2,w3,w4)z }| {
vector(word embedding)
observedcontext word
random variable
RNN hidden state vector, length=|vocab|
A conditional language model assigns probabilities to sequences of words, , given some conditioning context, .
w = (w1, w2, . . . , w`)
Conditional LMs
p(w | x) =Y
t=1
p(wt | x, w1, w2, . . . , wt�1)
As with unconditional models, it is again helpful to use the chain rule to decompose this probability:
What is the probability of the next word, given the history of previously generated words and conditioning context ?
x
x
Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in Portuguese Its English translationA sentence in English Its Portuguese translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer
x w
Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in Portuguese Its English translationA sentence in English Its Portuguese translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer
x w
Data for training conditional LMs
To train conditional language models, we need paired samples, .{(xi,wi)}Ni=1
Data availability varies. It’s easy to think of tasks that could be solved by conditional language models, but the data just doesn’t exist.
Relatively large amounts of data for:Translation, summarisation, caption generation, speech recognition
Section overviewThe rest of this section will look at “encoder-decoder” models that learn a function that maps into a fixed-size vector and then uses a language model to “decode” that vector into a sequence of words, .w
x
Kunst kann nicht gelehrt werden…
Artistry can’t be taught…
x
w
Section overviewThe rest of this section will look at “encoder-decoder” models that learn a function that maps into a fixed-size vector and then uses a language model to “decode” that vector into a sequence of words, .w
x
A dog is playing on the beach.
x
w
• Two questions
• How do we encode as a fixed-size vector, ?
• How do we condition on in the decoding model?
Section overview
x c
c
- Problem (or at least modality) specific- Think about assumptions
- Less problem specific- We will review solution/architectures
Kalchbrenner and Blunsom 2013
c = embed(x)
s = Vc
Encoder
Recurrent decoderSource sentence
Embedding of wt�1
Recurrent connection
Recall unconditional RNNht = g(W[ht�1;wt�1] + b])
Learnt biasht = g(W[ht�1;wt�1] + s+ b])
ut = Pht + b0
p(Wt | x,w<t) = softmax(ut)
K&B 2013: EncoderHow should we define ?c = embed(x)
The simplest model possible:
What do you think of this model?
x1
x1 x2 x3 x4 x5 x6
x2 x3 x4 x5 x6
c =X
i
xi
K&B 2013: CSM EncoderHow should we define ?c = embed(x)
Convolutional sentence model (CSM)
K&B 2013: RNN Decoder
c = embed(x)
s = Vc
Encoder
Recurrent decoderSource sentence
Embedding of wt�1
Recurrent connection
Recall unconditional RNNht = g(W[ht�1;wt�1] + b])
Learnt biasht = g(W[ht�1;wt�1] + s+ b])
ut = Pht + b0
p(Wt | x,w<t) = softmax(ut)
h1
h0 x1
<s>
s
K&B 2013: RNN Decoder
softmax
p1
h1
h0 x1
<s>
s
K&B 2013: RNN Decoder
softmax
p1
h1
h0 x1
<s>
s
⇠
tom
p(tom | s, hsi)
K&B 2013: RNN Decoder
softmax
p1
h1
h0 x1
<s>
s
⇠
tom
p(tom | s, hsi)
K&B 2013: RNN Decoder
h2
softmax
x2
⇠likes
⇥p(likes | s, hsi, tom)
softmax
p1
h1
h0 x1
<s>
s
⇠
tom
p(tom | s, hsi)
K&B 2013: RNN Decoder
h2
softmax
x2
⇠likes
⇥p(likes | s, hsi, tom)
x3
h3
softmax
⇠
beer
⇥p(beer | s, hsi, tom, likes)
softmax
p1
h1
h0 x1
<s>
s
⇠
tom
p(tom | s, hsi)
K&B 2013: RNN Decoder
h2
softmax
x2
⇠likes
⇥p(likes | s, hsi, tom)
x3
h3
softmax
⇠
beer
⇥p(beer | s, hsi, tom, likes)
x4
h4
softmax
⇠
</s>
⇥p(h\si | s, hsi, tom, likes, beer)
Sutskever et al. (2014)LSTM encoder
LSTM decoder
(c0,h0) are parameters
The encoding is where .
w0 = hsi
(ci,hi) = LSTM(xi, ci�1,hi�1)
(ct+`,ht+`) = LSTM(wt�1, ct+`�1,ht+`�1)
(c`,h`) ` = |x|
ut = Pht+` + b
p(Wt | x,w<t) = softmax(ut)
Aller Anfang ist schwer STOP
START
c
Sutskever et al. (2014)
Aller Anfang ist schwer STOP
START
Beginnings
c
Sutskever et al. (2014)
Aller Anfang ist schwer
are
STOP
START
Beginnings
c
Sutskever et al. (2014)
Aller Anfang ist schwer
are
STOP
START
difficultBeginnings
c
Sutskever et al. (2014)
Aller Anfang ist schwer
are
STOP
START
difficult STOPBeginnings
c
Sutskever et al. (2014)
Sutskever et al. (2014)
• Good
• RNNs deal naturally with sequences of various lengths
• LSTMs in principle can propagate gradients a long distance
• Very simple architecture!
• Bad
• The hidden state has to remember a lot of information!
Aller Anfang ist schwer
are
STOP
START
difficult STOPBeginnings
c
Sutskever et al. (2014): Tricks
Aller Anfang ist schwer
are
STOP
START
difficult STOPBeginnings
c
Sutskever et al. (2014): TricksRead the input sequence “backwards”: +4 BLEU
Sutskever et al. (2014): Tricks
Ensemble of 2 models: +3 BLEU
Decoder:
u(j)t = Ph(j)
t + b(j)
ut =1
J
JX
j0=1
u(j0)
p(Wt | x,w<t) = softmax(ut)
Ensemble of 5 models: +4.5 BLEU
Use an ensemble of J independently trained models.
(c(j)t+`,h(j)t+`) = LSTM(j)(wt�1, c
(j)t+`�1,h
(j)t+`�1)
A word about decoding
w
⇤= argmax
wp(w | x)
= argmax
w
|w|X
t=1
log p(wt | x,w<t)
In general, we want to find the most probable (MAP) output given the input, i.e.
A word about decoding
w
⇤= argmax
wp(w | x)
= argmax
w
|w|X
t=1
log p(wt | x,w<t)
In general, we want to find the most probable (MAP) output given the input, i.e.
This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search:
w⇤1 = argmax
w1
p(w1 | x)
w⇤2 = argmax
w2
p(w2 | x, w⇤1)
.
.
.
w⇤t = argmax
w2
p(wt | x,w⇤<t)
A word about decoding
w
⇤= argmax
wp(w | x)
= argmax
w
|w|X
t=1
log p(wt | x,w<t)
In general, we want to find the most probable (MAP) output given the input, i.e.
This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search:
w⇤1 = argmax
w1
p(w1 | x)
w⇤2 = argmax
w2
p(w2 | x, w⇤1)
.
.
.
w⇤t = argmax
w2
p(wt | x,w⇤<t)
undecidable :(
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
x = Bier trinke ichbeer drink I
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
x = Bier trinke ichbeer drink I
logprob=-2.11
logprob=-1.82beer
I
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
x = Bier trinke ichbeer drink I
logprob=-2.11
logprob=-1.82beer
I
logprob=-6.93
logprob=-5.80
drink
I
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
x = Bier trinke ichbeer drink I
logprob=-2.11
logprob=-1.82beer
I
logprob=-6.93
logprob=-5.80
drink
I
logprob=-8.66
logprob=-2.87drink
beer
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
logprob=-2.11
x = Bier trinke ichbeer drink I
logprob=-1.82beer
logprob=-8.66
logprob=-2.87
logprob=-6.93
logprob=-5.80
I
drink
I
drink
beer
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
logprob=-2.11
x = Bier trinke ichbeer drink I
logprob=-1.82beer
logprob=-8.66
logprob=-2.87
logprob=-6.93
logprob=-5.80
logprob=-3.04
logprob=-5.12
logprob=-6.28
logprob=-7.31
I
drink
I
drink
beer beer
wine
drink
like
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
logprob=-2.11
x = Bier trinke ichbeer drink I
logprob=-1.82beer
logprob=-8.66
logprob=-2.87
logprob=-6.93
logprob=-5.80
logprob=-3.04
logprob=-5.12
logprob=-6.28
logprob=-7.31
I
drink
I
drink
beer beer
wine
drink
like
A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.
E.g., for b=2:
w0 w1 w2 w3
hsilogprob=0
logprob=-2.11
x = Bier trinke ichbeer drink I
logprob=-1.82beer
logprob=-8.66
logprob=-2.87
logprob=-6.93
logprob=-5.80
logprob=-3.04
logprob=-5.12
logprob=-6.28
logprob=-7.31
I
drink
I
drink
beer beer
wine
drink
like
Sutskever et al. (2014): Tricks
Use beam search: +1 BLEU
hsilogprob=0
logprob=-2.11
x = Bier trinke ichbeer drink I
logprob=-1.82beer
logprob=-8.66
logprob=-2.87
logprob=-6.93
logprob=-5.80
logprob=-3.04
logprob=-5.12
logprob=-6.28
logprob=-7.31
I
drink
I
drink
beer beer
wine
drink
like
w0 w1 w2 w3
A conceptual digression
A car bomb exploded downtownIn der Innenstadt explodierte eine Autobombe
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe
!
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
!
A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe
! !
A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe
In der Innenstadt explodierte eine Autobombe
detonate :arg0 bomb :arg1 car :loc downtown :time past
Semantics“logical form”
In der Innenstadt explodierte eine Autobombe
!
A car bomb exploded downtown
Syntax
detonate :arg0 bomb :arg1 car :loc downtown :time past
Semantics“logical form”
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
!
A car bomb exploded downtown
Syntax
detonate :arg0 bomb :arg1 car :loc downtown :time past
Semantics“logical form”
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe
! !
A car bomb exploded downtown
detonate :arg0 bomb :arg1 car :loc downtown :time past
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe
! !
A car bomb exploded downtown
detonate :arg0 bomb :arg1 car :loc downtown :time past
report_event[ factivity=true explode(e, bomb, car) loc(e, downtown)]
explodieren :arg0 Bombe :arg1 Auto :loc Innenstadt :tempus imperf
Interlingua“meaning”
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
In der Innenstadt explodierte eine Autobombe
! !
A car bomb exploded downtown
detonate :arg0 bomb :arg1 car :loc downtown :time past
report_event[ factivity=true explode(e, bomb, car) loc(e, downtown)]
explodieren :arg0 Bombe :arg1 Auto :loc Innenstadt :tempus imperf
Interlingua“meaning”
Hidden
In der Innenstadt explodierte eine Autobombe A car bomb exploded downtown
Interlingua?
Conditioning with vectors
We are compressing a lot of information in a finite-sized vector.
Conditioning with vectors
We are compressing a lot of information in a finite-sized vector.
“You can't cram the meaning of a whole %&!$#sentence into a single $&!#* vector!”
Prof. Ray Mooney
We are compressing a lot of information in a finite-sized vector.
Gradients have a long way to travel. Even LSTMs forget!
Conditioning with vectors
We are compressing a lot of information in a finite-sized vector.
Gradients have a long way to travel. Even LSTMs forget!
Conditioning with vectors
What is to be done?
Outline of Final Section
• Machine translation with attention
• Image caption generation with attention
Outline of Final Section
• Machine translation with attention
• Image caption generation with attention
Solving the Vector Problem in Translation
• Represent a source sentence as a matrix
• Generate a target sentence from a matrix
• This will
• Solve the capacity problem
• Solve the gradient flow problem
Sentences as Matrices• Problem with the fixed-size vector model
• Sentences are of different sizes but vectors are of the same size
• Solution: use matrices instead
• Fixed number of rows, but number of columns depends on the number of words
• Usually |f| = #cols
Sentences as Matrices
Ich mochte ein Bier
Sentences as Matrices
Ich mochte ein Bier
Mach’s gut
Sentences as Matrices
Ich mochte ein Bier
Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer
Sentences as Matrices
Ich mochte ein Bier
Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer
Question: How do we build these matrices?
With Concatenation• Each word type is represented by an n-dimensional
vector
• Take all of the vectors for the sentence and concatenate them into a matrix
• Simplest possible model
• So simple, no one has bothered to publish how well/badly it works!
Ich möchte ein Bier
x1 x2 x3 x4
fi = xi
Ich möchte ein Bier
x1 x2 x3 x4
Ich mochte ein Bier
fi = xi
F 2 Rn⇥|f |
Ich möchte ein Bier
x1 x2 x3 x4
With Convolutional Nets• Apply convolutional networks to transform the naive
concatenated matrix to obtain a context-dependent matrix
• Explored in a recent ICLR submission by Gehring et al., 2016 (from FAIR)
• Closely related to the neural translation model proposed by Kalchbrenner and Blunsom, 2013
• Note: convnets usually have a “pooling” operation at the top level that results in a fixed-sized representation. For sentences, leave this out.
Ich möchte ein Bier
x1 x2 x3 x4
Ich möchte ein Bier
x1 x2 x3 x4
Ich möchte ein Bier
x1 x2 x3 x4
⇤
Filter 1
Ich möchte ein Bier
x1 x2 x3 x4
⇤ ⇤
Filter 1 Filter 2
Ich möchte ein Bier
x1 x2 x3 x4
⇤ ⇤
Ich mochte ein Bier
F 2 Rf(n)⇥g(|f |)
Filter 1 Filter 2
With Bidirectional RNNs• By far the most widely used matrix representation, due to
Bahdanau et al (2015)
• One column per word
• Each column (word) has two halves concatenated together:
• a “forward representation”, i.e., a word and its left context
• a “reverse representation”, i.e., a word and its right context
• Implementation: bidirectional RNNs (GRUs or LSTMs) to read f from left to right and right to left, concatenate representations
Ich möchte ein Bier
x1 x2 x3 x4
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
Ich mochte ein Bier
F 2 R2n⇥|f |
fi = [ �h i;�!h i]
Where are we in 2017?• There are lots of ways to construct F
• And currently lots of work looking at alternatives!
• Some particularly exciting trends want to get rid of recurrences when processing the input.
• Convolutions are making a comeback (especially at Facebook)
• “Stacked self-attention” (aka “All you need is attention” from Google Brain) is another strategy.
• Still many innovations possible, particularly targeting lower-resource scenarios and domain adaptation (two problems the big players aren’t as interested in)
Generation from Matrices• We have a matrix F representing the input, now we need to generate from it
• Bahdanau et al. (2015) were the first to propose using attention for translating from matrix-encoded sentences
• High-level idea
• Generate the output sentence word by word using an RNN
• At each output position t, the RNN receives two inputs (in addition to any recurrent inputs)
• a fixed-size vector embedding of the previously generated output symbol et-1
• a fixed-size vector encoding a “view” of the input matrix
• How do we get a fixed-size vector from a matrix that changes over time?
• Bahdanau et al: do a weighted sum of the columns of F (i.e., words) based on how important they are at the current time step. (i.e., just a matrix-vector product Fat)
• The weighting of the input columns at each time-step (at) is called attention
Recall RNNs…
→
Recall RNNs…
→
Recall RNNs…
I'd
→
Recall RNNs…
I'd
→ I'd
Recall RNNs…
I'd
→
like
I'd
Recall RNNs…
→
→
Ich mochte ein Bier
→
Ich mochte ein Bier
→
Ich mochte ein Bier
Attention history:a>1
a>2
a>3
a>4
a>5
→
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
c1 = Fa1
→
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
c1 = Fa1
I'd
→
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
c1 = Fa1
I'd
→
Ich mochte ein Bier
Attention history:a>1
a>2
a>3
a>4
a>5
I'd
I'd
→
Ich mochte ein Bier
Attention history:a>1
a>2
a>3
a>4
a>5
I'd
I'd
I'd →
Ich mochte ein Bier
Attention history:a>1
a>2
a>3
a>4
a>5
I'd
I'd →
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
c2 = Fa2
I'd
I'd →
like
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
c2 = Fa2
I'd
I'd →
like
like
a
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
Attention
• How do we know what to attend to at each time-step?
• That is, how do we compute ?at
Computing Attention• At each time step (one time step = one output word), we want to be able to
“attend” to different words in the source sentence
• We need a weight for every column: this is an |f|-length vector at
• Here is a simplified version of Bahdanau et al.’s solution
• Use an RNN to predict model output, call the hidden states
• At time t compute the expected input embedding
• Take the dot product with every column in the source matrix to compute the attention energy.
• Exponentiate and normalize to 1:
• Finally, the input source vector for time t is
at = softmax(ut)
ct = Fat
(called in the paper)
(called in the paper)
st
ut = F>rt
rt = Vst�1( is a learned parameter)V
et
↵t
(Since F has |f| columns, has |f| rows)ut
(st has a fixed dimensionality, call it m)
Computing Attention• At each time step (one time step = one output word), we want to be able to
“attend” to different words in the source sentence
• We need a weight for every column: this is an |f|-length vector at
• Here is a simplified version of Bahdanau et al.’s solution
• Use an RNN to predict model output, call the hidden states
• At time t compute the query key embedding
• Take the dot product with every column in the source matrix to compute the attention energy.
• Exponentiate and normalize to 1:
• Finally, the input source vector for time t is
at = softmax(ut)
ct = Fat
(called in the paper)
(called in the paper)
st
ut = F>rt
rt = Vst�1( is a learned parameter)V
et
↵t
(Since F has |f| columns, has |f| rows)ut
(st has a fixed dimensionality, call it m)
Computing Attention• At each time step (one time step = one output word), we want to be able to
“attend” to different words in the source sentence
• We need a weight for every column: this is an |f|-length vector at
• Here is a simplified version of Bahdanau et al.’s solution
• Use an RNN to predict model output, call the hidden states
• At time t compute the query key embedding
• Take the dot product with every column in the source matrix to compute the attention energy.
• Exponentiate and normalize to 1:
• Finally, the input source vector for time t is
at = softmax(ut)
ct = Fat
(called in the paper)
(called in the paper)
st
ut = F>rt
rt = Vst�1( is a learned parameter)V
et
↵t
(Since F has |f| columns, has |f| rows)ut
(st has a fixed dimensionality, call it m)
Computing Attention• At each time step (one time step = one output word), we want to be able to
“attend” to different words in the source sentence
• We need a weight for every column: this is an |f|-length vector at
• Here is a simplified version of Bahdanau et al.’s solution
• Use an RNN to predict model output, call the hidden states
• At time t compute the query key embedding
• Take the dot product with every column in the source matrix to compute the attention energy.
• Exponentiate and normalize to 1:
• Finally, the input source vector for time t is
at = softmax(ut)
ct = Fat
(called in the paper)
(called in the paper)
st
ut = F>rt
rt = Vst�1( is a learned parameter)V
et
↵t
(Since F has |f| columns, has |f| rows)ut
(st has a fixed dimensionality, call it m)
Computing Attention• At each time step (one time step = one output word), we want to be able to
“attend” to different words in the source sentence
• We need a weight for every column: this is an |f|-length vector at
• Here is a simplified version of Bahdanau et al.’s solution
• Use an RNN to predict model output, call the hidden states
• At time t compute the query key embedding
• Take the dot product with every column in the source matrix to compute the attention energy.
• Exponentiate and normalize to 1:
• Finally, the input source vector for time t is
at = softmax(ut)
ct = Fat
(called in the paper)
(called in the paper)
st
ut = F>rt
rt = Vst�1( is a learned parameter)V
et
↵t
(Since F has |f| columns, has |f| rows)ut
(st has a fixed dimensionality, call it m)
Nonlinear Attention-Energy Model
• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:
• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF
• This can learn more complex interactions
• It is unclear if the added complexity is necessary for good performance
ut = F>rt
ut = tanh (WF+ rt)v
(simple model)(Bahdanau et al)
Nonlinear Attention-Energy Model
• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:
• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF
• This can learn more complex interactions
• It is unclear if the added complexity is necessary for good performance
ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)
Nonlinear Attention-Energy Model
• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:
• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF
• This can learn more complex interactions
• It is unclear if the added complexity is necessary for good performance
ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)
Putting it all togethere0 = hsi
while et 6= h/si :
F = EncodeAsMatrix(f)
rt = Vst�1
s0 = w (Learned initial state; Bahdanau uses )U �h 1
at = softmax(ut)
ct = Fatst = RNN(st�1, [et�1; ct])
yt = softmax(Pst + b)
et | e<t ⇠ Categorical(yt)
}(Compute attention)
t = 0
t = t+ 1
( is a learned embedding of )et�1 et
( and are learned parameters)P b
ut = v> tanh(WF+ rt)
Putting it all togethere0 = hsi
while et 6= h/si :
F = EncodeAsMatrix(f)
rt = Vst�1
s0 = w (Learned initial state; Bahdanau uses )U �h 1
at = softmax(ut)
ct = Fatst = RNN(st�1, [et�1; ct])
yt = softmax(Pst + b)
et | e<t ⇠ Categorical(yt)
}(Compute attention)
t = 0
t = t+ 1
( is a learned embedding of )et�1 et
( and are learned parameters)P b
doesn’t depend on output decisionsut = v> tanh(WF+ rt)
Putting it all togethere0 = hsi
while et 6= h/si :
F = EncodeAsMatrix(f) (Part 1 of lecture)
rt = Vst�1
s0 = w (Learned initial state; Bahdanau uses )U �h 1
at = softmax(ut)
ct = Fatst = RNN(st�1, [et�1; ct])
yt = softmax(Pst + b)
et | e<t ⇠ Categorical(yt)
}(Compute attention)
t = 0
t = t+ 1
( is a learned embedding of )et�1 et
( and are learned parameters)P b
X = WF
Xut = v> tanh(WF+ rt)
Putting it all togethere0 = hsi
while et 6= h/si :
F = EncodeAsMatrix(f) (Part 1 of lecture)
rt = Vst�1
s0 = w (Learned initial state; Bahdanau uses )U �h 1
at = softmax(ut)
ct = Fatst = RNN(st�1, [et�1; ct])
yt = softmax(Pst + b)
et | e<t ⇠ Categorical(yt)
}(Compute attention)
t = 0
t = t+ 1
( is a learned embedding of )et�1 et
( and are learned parameters)P b
X = WF
ut = v> tanh(X+ rt)
Attention in MTAdd attention to seq2seq translation: +11 BLEU
A word about gradients
I'd
I'd →
like
like
a
a
beer
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
Attention and Translation• Cho’s question: does a translator read and memorize
the input sentence/document and then generate the output?
• Compressing the entire input sentence into a vector basically says “memorize the sentence”
• Common sense experience says translators refer back and forth to the input. (also backed up by eye-tracking studies)
• Should humans be a model for machines?
Summary• Attention
• provides the ability to establish information flow directly from distant
• closely related to “pooling” operations in convnets (and other architectures)
• Traditional attention model seems to only cares about “content”
• No obvious bias in favor of diagonals, short jumps, fertility, etc.
• Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities
• Factorization into keys and values (Miller et al., 2016; Ba et al., 2016, Gulcehre et al., 2016)
• Attention weights provide interpretation you can look at
Questions?
Há perguntas?
Obrigado!
Recommended