74
Conditional Language Modeling Chris Dyer

Conditional Language Modeling

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conditional Language Modeling

Conditional Language Modeling

Chris Dyer

Page 2: Conditional Language Modeling

A language model assigns probabilities to sequences of words, .w = (w1, w2, . . . , w`)

p(w) = p(w1)⇥ p(w2 | w1)⇥ p(w3 | w1, w2)⇥ · · ·⇥p(w` | w1, . . . , w`�1)

=

|w|Y

t=1

p(wt | w1, . . . , wt�1)

It is convenient to decompose this probability using the chain rule, as follows:

This reduces the language modeling problem to modeling the probability of the next word, given the history of preceding words.

Unconditional LMs

Page 3: Conditional Language Modeling

Evaluating unconditional LMsHow good is our unconditional language model?1. Held-out per-word cross entropy or perplexity Same as training criterion. How uncertain is the model at each time position, an average?

2. Task-based evaluation Use in a task-model that uses a language model in place of some other language model. Does it improve?

ppl = b�1

|w|P|w|

i=1 logb p(wi|w<i)

H = � 1

|w|

|w|X

i=1

log2 p(wi | w<i) (units: bits per word)

(units: uncertainty per word)

Page 4: Conditional Language Modeling

History-based LMs

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

A common strategy is to make a Markov assumption,which is a conditional independence assumption.

Page 5: Conditional Language Modeling

History-based LMs

Markov: forget the distant past. Is this valid for language? No… Is it practical? Often!

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

A common strategy is to make a Markov assumption,which is a conditional independence assumption.

Page 6: Conditional Language Modeling

History-based LMs

Markov: forget the distant past. Is this valid for language? No… Is it practical? Often!

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

A common strategy is to make a Markov assumption,which is a conditional independence assumption.

Why RNNs are great for language: no more Markov assumptions!

Page 7: Conditional Language Modeling

History-based LMs with RNNs

h2h1

h0

h3 h4

softmax

w1 w2 w3 w4

w4w3w2w1

p(W5|w1,w2,w3,w4)z }| {

vector(word embedding)

observedcontext word

random variable

RNN hidden state vector, length=|vocab|

Page 8: Conditional Language Modeling

History-based LMs with RNNs

h2h1

h0

h3 h4

softmax

w1 w2 w3 w4

w4w3w2w1

p(W5|w1,w2,w3,w4)z }| {

vector(word embedding)

observedcontext word

random variable

RNN hidden state vector, length=|vocab|

Page 9: Conditional Language Modeling

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

Distributions over wordsEach dimension corresponds to a word in a closed vocabulary, V.

The pi’s form a distribution, i.e.pi > 0 8i,

X

i

pi = 1

Bridle. (1990) Probabilistic interpretation of feedforward classification…

Page 10: Conditional Language Modeling

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

Distributions over words

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

Page 11: Conditional Language Modeling

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

Distributions over words

istories are sequences of words…h

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

Page 12: Conditional Language Modeling

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

h 2 Rd

|V | = 100, 000

What are the dimensions of ?b

Distributions over words

istories are sequences of words…h

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

Page 13: Conditional Language Modeling

softmax

h

the a and cat dog horse runs says walked walks walking pig Lisbon sardines …

u = Wh+ b

pi =expuiPj expuj

h 2 Rd

|V | = 100, 000

What are the dimensions of ?

Distributions over words

istories are sequences of words…h

p(w) = p(w1)⇥p(w2 | w1)⇥p(w3 | w1, w2)⇥p(w4 | w1, w2, w3)⇥. . .

W

Page 14: Conditional Language Modeling

RNN language models

Page 15: Conditional Language Modeling

h1

h0 x1

<s>

RNN language models

Page 16: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

RNN language models

Page 17: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

tom

p(tom | hsi)

RNN language models

Page 18: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

RNN language models

Page 19: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

RNN language models

Page 20: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

RNN language models

Page 21: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

x3

h3

softmax

beer

⇥p(beer | hsi, tom, likes)

RNN language models

Page 22: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom

p(tom | hsi)

⇠likes

⇥p(likes | hsi, tom)

x3

h3

softmax

beer

⇥p(beer | hsi, tom, likes)

x4

h4

softmax

</s>

⇥p(h/si | hsi, tom, likes, beer)

RNN language models

Page 23: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom⇠

likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

Training RNN language models

Page 24: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

Training RNN language models

Page 25: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4{log lo

ss/

cross

entro

py

Training RNN language models

Page 26: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

F

{log lo

ss/

cross

entro

py

Training RNN language models

Page 27: Conditional Language Modeling

h2

softmaxsoftmax

p1

h1

h0 x1

<s>

x2

tom likes

x3

h3

softmax

beer

x4

h4

softmax

</s>

cost1 cost2 cost3 cost4

F

{log lo

ss/

cross

entro

py

Training RNN language models

Page 28: Conditional Language Modeling

Training RNN language modelsThe cross-entropy objective seeks the maximumlikelihood (MLE) objective.

“Find the parameters that make the training data most likely.”

Page 29: Conditional Language Modeling

Training RNN language modelsThe cross-entropy objective seeks the maximumlikelihood (MLE) objective.

“Find the parameters that make the training data most likely.”

You will overfit.1. Stop training early, based on a validation set 2. Weight decay / other regularizers 3. “Dropout” during training.

In contrast to count-based models, zeroes aren’t a problem.

Page 30: Conditional Language Modeling

• Unlike Markov (n-gram) models, RNNs never forget

• However, they don’t always remember so well (recall Felix’s lectures on RNNs vs. LSTMs)

• Algorithms

• Sample a sequence from the probability distribution defined by the RNN

• Train the RNN to minimize cross entropy (aka MLE)

• What about: what is the most probable sequence?

RNN language models

Page 31: Conditional Language Modeling

How well do RNN LMs do?

perplexity Word Error Rate (WER)

order=5 MarkovKneser-Ney freq. est. 221 13.5

RNN 400 hidden 171 12.5

3xRNN interpolation 151 11.6

Mikolov et al. (2010 Interspeech) “Recurrent neural network based language model”

Page 32: Conditional Language Modeling

How well do RNN LMs do?

perplexity Word Error Rate (WER)

order=5 MarkovKneser-Ney freq. est. 221 13.5

RNN 400 hidden 171 12.5

3xRNN interpolation 151 11.6

Mikolov et al. (2010 Interspeech) “Recurrent neural network based language model”

Page 33: Conditional Language Modeling

A conditional language model assigns probabilities to sequences of words, , given some conditioning context, .

w = (w1, w2, . . . , w`)

Conditional LMs

p(w | x) =Y

t=1

p(wt | x, w1, w2, . . . , wt�1)

As with unconditional models, it is again helpful to use the chain rule to decompose this probability:

What is the probability of the next word, given the history of previously generated words and conditioning context ?

x

x

Page 34: Conditional Language Modeling

Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in French Its English translationA sentence in English Its French translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer

x w

Page 35: Conditional Language Modeling

Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in French Its English translationA sentence in English Its French translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer

x w

Page 36: Conditional Language Modeling

Conditional LMs “input” “text output”An author A document written by that authorA topic label An article about that topic{SPAM, NOT_SPAM} An emailA sentence in French Its English translationA sentence in English Its French translationA sentence in English Its Chinese translationAn image A text description of the imageA document Its summaryA document Its translationMeterological measurements A weather reportAcoustic signal Transcription of speechConversational history + database Dialogue system responseA question + a document Its answerA question + an image Its answer

x w

Page 37: Conditional Language Modeling

Data for training conditional LMs

To train conditional language models, we need paired samples, .

Data availability varies. It’s easy to think of tasks that could be solved by conditional language models, but the data just doesn’t exist.

Relatively large amounts of data for:Translation, summarisation, caption generation, speech recognition

{(xi,wi)}Ni=1

Page 38: Conditional Language Modeling

Evaluating conditional LMsHow good is our conditional language model?

These are language models, we can use cross-entropy or perplexity.

Task-specific evaluation. Compare the model’s most likely output to human-generated expected output using a task-specific evaluation metric .

w⇤ = argmaxw

p(w | x)

L

L(w⇤,wref)

Examples of : BLEU, METEOR, WER, ROUGE.

Human evaluation.

okay to implement, hard to interpret

easy to implement, okay to interpret

hard to implement, easy to interpret

L

Page 39: Conditional Language Modeling

Evaluating conditional LMsHow good is our conditional language model?

These are language models, we can use cross-entropy or perplexity.

Task-specific evaluation. Compare the model’s most likely output to human-generated expected output using a task-specific evaluation metric .

w⇤ = argmaxw

p(w | x)

L

L(w⇤,wref)

Examples of : BLEU, METEOR, WER, ROUGE.

Human evaluation.

okay to implement, hard to interpret

easy to implement, okay to interpret

hard to implement, easy to interpret

L

Page 40: Conditional Language Modeling

Lecture overviewThe rest of this lecture will look at “encoder-decoder” models that learn a function that maps into a fixed-size vector and then uses a language model to “decode” that vector into a sequence of words, .w

x

Kunst kann nicht gelehrt werden…

Artistry can’t be taught…

x

w

encoder

decoder

representation

Page 41: Conditional Language Modeling

Lecture overviewThe rest of this lecture will look at “encoder-decoder” models that learn a function that maps into a fixed-size vector and then uses a language model to “decode” that vector into a sequence of words, .w

x

A dog is playing on the beach.

x

w

encoder

decoder

representation

Page 42: Conditional Language Modeling

• Two questions

• How do we encode as a fixed-size vector, ?

• How do we condition on in the decoding model?

Lecture overview

x c

c

- Problem (or at least modality) specific- Think about assumptions

- Less problem specific- We will review one standard solution: RNNs

Page 43: Conditional Language Modeling

Kalchbrenner and Blunsom 2013

c = embed(x)

s = Vc

Encoder

Page 44: Conditional Language Modeling

Kalchbrenner and Blunsom 2013

c = embed(x)

s = Vc

Encoder

Recurrent decoderSource sentence

Embedding of wt�1

Recurrent connection

Learnt biasht = g(W[ht�1;wt�1] + s+ b])

ut = Pht + b0

p(Wt | x,w<t) = softmax(ut)

Page 45: Conditional Language Modeling

Kalchbrenner and Blunsom 2013

c = embed(x)

s = Vc

Encoder

Recurrent decoderSource sentence

Embedding of wt�1

Recurrent connection

Recall unconditional RNNht = g(W[ht�1;wt�1] + b])

Learnt biasht = g(W[ht�1;wt�1] + s+ b])

ut = Pht + b0

p(Wt | x,w<t) = softmax(ut)

Page 46: Conditional Language Modeling

K&B 2013: EncoderHow should we define ?c = embed(x)

The simplest model possible:

What do you think of this model?

x1

x1 x2 x3 x4 x5 x6

x2 x3 x4 x5 x6

c =X

i

xi

Page 47: Conditional Language Modeling

K&B 2013: RNN Decoder

c = embed(x)

s = Vc

Encoder

Recurrent decoderSource sentence

Embedding of wt�1

Recurrent connection

Recall unconditional RNNht = g(W[ht�1;wt�1] + b])

Learnt biasht = g(W[ht�1;wt�1] + s+ b])

ut = Pht + b0

p(Wt | x,w<t) = softmax(ut)

Page 48: Conditional Language Modeling

h1

h0 x1

<s>

s

K&B 2013: RNN Decoder

Page 49: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

s

K&B 2013: RNN Decoder

Page 50: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

Page 51: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

Page 52: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

x3

h3

softmax

beer

⇥p(beer | s, hsi, tom, likes)

Page 53: Conditional Language Modeling

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

x3

h3

softmax

beer

⇥p(beer | s, hsi, tom, likes)

x4

h4

softmax

</s>

⇥p(h\si | s, hsi, tom, likes, beer)

Page 54: Conditional Language Modeling

A word about decoding

w⇤ = argmaxw

p(w | x)

= argmaxw

|w|X

t=1

log p(wt | x,w<t)

In general, we want to find the most probable (MAP) output given the input, i.e.

Page 55: Conditional Language Modeling

A word about decoding

w⇤ = argmaxw

p(w | x)

= argmaxw

|w|X

t=1

log p(wt | x,w<t)

In general, we want to find the most probable (MAP) output given the input, i.e.

This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search:

undecidable :(

w⇤1 ⇡ argmax

w1

p(w1 | x)

w⇤2 ⇡ argmax

w2

p(w2 | x, w⇤1)

...

w⇤t ⇡ argmax

wt

p(wt | x,w⇤<t)

Page 56: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

Page 57: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

logprob=-2.11

logprob=-1.82beer

I

Page 58: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

logprob=-2.11

logprob=-1.82beer

I

logprob=-6.93

logprob=-5.80

drink

I

Page 59: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

x = Bier trinke ichbeer drink I

logprob=-2.11

logprob=-1.82beer

I

logprob=-6.93

logprob=-5.80

drink

I

logprob=-8.66

logprob=-2.87drink

beer

Page 60: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

I

drink

I

drink

beer

Page 61: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

Page 62: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

Page 63: Conditional Language Modeling

A word about decodingA slightly better approximation is to use a beam search withbeam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

w0 w1 w2 w3

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

Page 64: Conditional Language Modeling

How well does this model do?

perplexity (2011) perplexity (2012)

order=5 MarkovKneser-Ney freq. est. 222 225

RNN LM 178 181

RNN LM + x 140 142

Page 65: Conditional Language Modeling

How well does this model do?

Page 66: Conditional Language Modeling

How well does this model do?

Page 67: Conditional Language Modeling

How well does this model do?

Page 68: Conditional Language Modeling

How well does this model do?

Page 69: Conditional Language Modeling

How well does this model do?

Page 70: Conditional Language Modeling

How well does this model do?

Page 71: Conditional Language Modeling

How well does this model do?

(Literal: I will feel bad if you do not find a solution.)

Page 72: Conditional Language Modeling

How well does this model do?

(Literal: I will feel bad if you do not find a solution.)

Page 73: Conditional Language Modeling

Summary• Conditional language modeling provides a convenient formulation for a lot

of practical applications

• Two big problems:

• Model expressivity

• Decoding difficulties

• Next time

• A better encoder for vector to sequence models

• “Attention” for better learning

• Lots of results on machine translation

Page 74: Conditional Language Modeling

Questions?