237
Discrete and Continuous Language Models University of Southern California Information Sciences Institute Ashish Vaswani

Discrete and Continuous Language

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discrete and Continuous Language

!

!

Submission #1

!

!

Submission #2

!

!

Submission #3

!

Submission #4a

Discrete and Continuous Language

Models

University of Southern CaliforniaInformation Sciences Institute

Ashish Vaswani

Page 2: Discrete and Continuous Language

Statistical Machine Translation (SMT)

2

source target

Page 3: Discrete and Continuous Language

Statistical Machine Translation (SMT)

2

source target

la kato sidis sur mato the cat sat on a mat

Page 4: Discrete and Continuous Language

Statistical Machine Translation (SMT)

2

source target

la kato sidis sur mato the cat sat on a mat

the cat sat on a matवह िबल्ली चटाई पर बैठी

Page 5: Discrete and Continuous Language

Statistical Machine Translation (SMT)

2

source target

la kato sidis sur mato the cat sat on a mat

the cat sat on a matवह िबल्ली चटाई पर बैठी

set a as the sigmoid of y a =1

1+ e−y

Page 6: Discrete and Continuous Language

Statistical Machine Translation (SMT)

2

source target

la kato sidis sur mato the cat sat on a mat

the cat sat on a matवह िबल्ली चटाई पर बैठी

set a as the sigmoid of ya =1

1+ e−y

Page 7: Discrete and Continuous Language

Statistical Machine Translation (SMT)

2

source target

la kato sidis sur mato the cat sat on a mat

the cat sat on a matवह िबल्ली चटाई पर बैठी

set a as the sigmoid of ya =1

1+ e−y

loop over all people for (int i=0; i>personlist.size(); i++)

Page 8: Discrete and Continuous Language

Language models in Machine Translation

3

Translationgrammars

Language modelDecoder

input sentence

input sentence

Page 9: Discrete and Continuous Language

Language models in Machine Translation

3

Translationgrammars

Language modelDecoder

la kato , the cat

चटाई पर बैठी , sat on a mat

sigmoid of y, 1

1+ e−y

input sentence

input sentence

Page 10: Discrete and Continuous Language

Language models in Machine Translation

3

Translationgrammars

Language modelDecoder

SMT Parameters

1.5

0.9billi

on

la kato , the cat

चटाई पर बैठी , sat on a mat

sigmoid of y, 1

1+ e−y

input sentence

input sentence

Page 11: Discrete and Continuous Language

Applications of Language Models

4

Natural Language Applications

• Machine Translation• Speech Recognition• Spelling Correction• Document Summarization• Sentence Completion

Programming Languages

• Code Completion• Retrieval of code from natural language• Retrieval of natural language from code snippets• Identifying buggy code

Page 12: Discrete and Continuous Language

5

OutlineWhat are Language Models ?

• Definition• Markovization

Page 13: Discrete and Continuous Language

5

OutlineWhat are Language Models ?

• Definition• Markovization

Discrete Language Models• Estimating Probabilities• Maximum Likelihood Estimation• Overfitting• Smoothing

Page 14: Discrete and Continuous Language

5

OutlineWhat are Language Models ?

• Definition• Markovization

Discrete Language Models• Estimating Probabilities• Maximum Likelihood Estimation• Overfitting• Smoothing

Continuous Language Models• Feed Forward Neural Network LM• Recurrent Neural Network LM

A. Vanilla RNNB. Long Short Term Memory RNN

• Visualizing LSTMs

Page 15: Discrete and Continuous Language

What are Language Models ?

Page 16: Discrete and Continuous Language

7

Language Models

the cat sat on the mat

Page 17: Discrete and Continuous Language

8

Language Models

)P( the cat sat on the mat

Page 18: Discrete and Continuous Language

9

Chain Rule

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Page 19: Discrete and Continuous Language

9

Chain Rule

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

P(w1,w2, . . . ,wn) =∏

i

P(wi | w1,w2, . . . ,wi−1)

Page 20: Discrete and Continuous Language

10

n-gram Language Model

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Andrei Markov

Page 21: Discrete and Continuous Language

10

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Andrei Markov

Bigram Language Model

Page 22: Discrete and Continuous Language

10

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Andrei Markov

Bigram Language Model

Page 23: Discrete and Continuous Language

10

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Andrei Markov

Bigram Language Model

Page 24: Discrete and Continuous Language

11

Trigram Language Model

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Andrei Markov

Page 25: Discrete and Continuous Language

11

Trigram Language Model

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Andrei Markov

P(w1,w2, . . . ,wn) ≈∏

i

P(wi | wi−k+1,w2, . . . ,wi−1)

Page 26: Discrete and Continuous Language

Discrete Space Language Models

Page 27: Discrete and Continuous Language

13

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Estimating Probabilities

P(on | sat) = count(sat on)count(sat)

Page 28: Discrete and Continuous Language

13

P( )the cat sat on the mat

P( the )= P( cat | the ) P( sat | the cat ) P( on | the cat sat )P( the | the cat sat on ) P( mat | the cat sat on the )

Estimating Probabilities

P(on | sat) = c(sat on)c(sat)

Page 29: Discrete and Continuous Language

14

Estimating Probabilities

P(wi | wi−1) =c(wi−1 wi)

c(wi)

P(on | sat) = c(sat on)c(sat)

Page 30: Discrete and Continuous Language

15

Estimating Probabilities

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

Page 31: Discrete and Continuous Language

15

Estimating Probabilities

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

P(cat | the) = 36= 0.5

Page 32: Discrete and Continuous Language

15

Estimating Probabilities

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

P(cat | the) = 36= 0.5

P(dog | the) = 16= 0.167

Page 33: Discrete and Continuous Language

15

Estimating Probabilities

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

P(cat | the) = 36= 0.5

P(dog | the) = 16= 0.167

P(mat | the) = 16= 0.167

Page 34: Discrete and Continuous Language

15

Estimating Probabilities

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

P(the | on) = 22= 1.0

<s> the cat caught the mouse </s>

P(cat | the) = 36= 0.5

P(dog | the) = 16= 0.167

P(mat | the) = 16= 0.167

Page 35: Discrete and Continuous Language

16

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

P(the | on) = 22= 1.0

<s> the cat caught the mouse </s>

P(cat | the) = 36= 0.5

P(dog | the) = 16= 0.167

P(mat | the) = 16= 0.167

Maximum Likelihood Estimation

Page 36: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) =

Page 37: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

Page 38: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×

Page 39: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×P(sat | cat)×

Page 40: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×P(sat | cat)×P(on | sat)×

Page 41: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×P(sat | cat)×P(on | sat)×

P(the | on)×

Page 42: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×P(sat | cat)×P(on | sat)×

P(the | on)×

P(mat | the)×

Page 43: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×P(sat | cat)×P(on | sat)×

P(the | on)×

P(mat | the)×P(< /s >| mat)

Page 44: Discrete and Continuous Language

17

Maximum Likelihood Estimation (MLE)

<s> the cat sat on the mat </s>P( ) = P(the |< s >)×

P(cat | the)×P(sat | cat)×P(on | sat)×

P(the | on)×

P(mat | the)×P(< /s >| mat)

= 0.084

Page 45: Discrete and Continuous Language

18

MLE with higher order n-grams ?

<s> <s> the cat sat on the mat </s>P( ) =

=

P(the |< s >< s >)×

P(sat | the cat)×. . .

P(< /s >| the mat)

P(cat |< s > the)×

0.33

Page 46: Discrete and Continuous Language

19

MLE with higher order n-grams ?

If you increase the n-gram order, you will improve your training data probability with MLE

Page 47: Discrete and Continuous Language

19

MLE with higher order n-grams ?

If you increase the n-gram order, you will improve your training data probability with MLE

P(DATA)n−gram >= P(DATA)(n−1)−gram

Page 48: Discrete and Continuous Language

19

MLE with higher order n-grams ?

If you increase the n-gram order, you will improve your training data probability with MLE

P(DATA)n−gram >= P(DATA)(n−1)−gram

ai logai

bi>=

(n∑

i=1

ai

)log∑n

i=1 ai∑ni=1 bi

Log Sum Inequality

Page 49: Discrete and Continuous Language

Comparing Language Models

Page 50: Discrete and Continuous Language

21

Data Splits

•Divide your data into train, development,

and test.

•Train on the training set.

•Adjust hyperparameters on the dev set

(length of n-grams).

•Test on the test set.

Page 51: Discrete and Continuous Language

22

Extrinsic Evaluation

•Imagine you have two models A and B.

•Use them on an end-to-end task: Machine

Translation, speech recognition.

•Evaluate the accuracy on the task.

•Improve language model.

•This process can take a while.

Page 52: Discrete and Continuous Language

23

Intrinsic Evaluation

Log Likelihood

Trigram Model:

Bigram Model:number of tokens∑

i=1

log P(wi | wi−1)

number of tokens∑

i=1

log P(wi | wi−2,wi−1)

Higher is better

Page 53: Discrete and Continuous Language

23

Intrinsic Evaluation

Trigram Model:

Bigram Model:

Cross Entropy

Lower is better

− 1number of tokens

number of tokens∑

i=1

log P(wi | wi−2,wi−1)

− 1number of tokens

number of tokens∑

i=1

log P(wi | wi−1)

Page 54: Discrete and Continuous Language

24

Intrinsic Evaluation

Perplexity

Trigram Model:

Bigram Model:

Lower is better

2−1

number of tokens

∑number of tokens

i=1log P(wi|wi−1)

2−1

number of tokens

∑number of tokens

i=1log P(wi|wi−2,wi−1)

Page 55: Discrete and Continuous Language

MLE and Overfitting

Page 56: Discrete and Continuous Language

26

MLE recap

P(wi | wi−1) =c(wi−1 wi)

c(wi)<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

P(the | on) = 22= 1.0

<s> the cat caught the mouse </s>

P(cat | the) = 36= 0.5

P(dog | the) = 16= 0.167

P(mat | the) = 16= 0.167

Page 57: Discrete and Continuous Language

27

MLE recap

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

P(wi | wi−1)MLE =c(wi−1,wi)

c(wi−1)

P(cat | the)MLE =36= 0.5

P(dog | the)MLE =16= 0.167

P(mat | the)MLE =16= 0.167

P(the | on)MLE =22= 1.0

Page 58: Discrete and Continuous Language

28

Overfitting

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

• You can achieve very low perplexity on training.

• But test data can be different from training.

Page 59: Discrete and Continuous Language

28

Overfitting

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

• You can achieve very low perplexity on training.

• But test data can be different from training.

<s> the dog caught the cat </s>

TEST TRAIN

Page 60: Discrete and Continuous Language

29

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

Overfitting

P(caught | cat)MLE =13

P(caught | dog)MLE = 0

Page 61: Discrete and Continuous Language

29

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

Test probability is 0!

Overfitting

P(caught | cat)MLE =13

P(caught | dog)MLE = 0

Page 62: Discrete and Continuous Language

30

Generalization

We need to train models that generalize well.

• Train n-gram models were ’n’ is small

• Train n-gram models on large collections of data.

Page 63: Discrete and Continuous Language

Tokens Vs Types

31

Num

ber o

f Typ

es

0

40000

80000

120000

160000

Millions of tokens

1 2 3 4 5 6 7 8

WSJ Eclipse

Page 64: Discrete and Continuous Language

32

n-gram Language Models

sharontalksbush

iPhonemercurial

...

1-gram

Number of unique 1-grams in Language: 60× 103

Page 65: Discrete and Continuous Language

33

n-gram Language Models

sharontalksbush

iPhonemercurial

...

×

2-gram

sharontalksbush

iPhonemercurial

...

3.6× 109Number of unique 2-grams in Language:

Page 66: Discrete and Continuous Language

34

n-gram Language Models

3-gram

sharontalksbush

iPhonemercurial

...

×

sharontalksbush

iPhonemercurial

...

×

sharontalksbush

iPhonemercurial

...

2.16× 1014Number of unique 3-grams in Language:

Page 67: Discrete and Continuous Language

35

How about Google n-grams ?

Number of unique 3-grams: 2.16× 1014

Page 68: Discrete and Continuous Language

35

How about Google n-grams ?

Number of unique 3-grams: 2.16× 1014

Page 69: Discrete and Continuous Language

36

How about Google n-grams ?

2.16× 1014

Running text in Google n-grams 1× 1012= = 0.0046

Number of unique 3-grams in Language:

Page 70: Discrete and Continuous Language

37

The Web ?

2.16× 1014

Number of words in the indexed web=

4.4927× 10142.07

Number of unique 3-grams in Language:=

Page 71: Discrete and Continuous Language

38

The Web ?

Number of words in the indexed web=

4.4927× 1014

Number of unique 3-grams in Code:=

3.375× 10150.133

Page 72: Discrete and Continuous Language

39

Better Solution: Smoothing

• Just collecting a lot of data is not enough.

• Smoothing methods differ in how they move probability from seen n-grams to unseen n-grams.

•Neural network language models have a more elegant solution for smoothing.

Page 73: Discrete and Continuous Language

40

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

Test probability is 0!

Better Solution: Smoothing

P(caught | cat)MLE =13

P(caught | dog)MLE = 0

Page 74: Discrete and Continuous Language

40

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

Better Solution: Smoothing

Test probability is > 0 P(caught | cat)smooth =1− x3

P(caught | dog)smooth =x3

Page 75: Discrete and Continuous Language

41

Data Preprocessing

• Split data into train/dev/test

• Get word frequencies on train and keep top V most frequent words

•Replace the remaining words with <unk>. This accounts for Out Of Vocabulary (OOV) words at test time.

Page 76: Discrete and Continuous Language

42

Add One (Laplace) Smoothing

P(wi | wi−1)MLE =c(wi−1,wi)

c(wi−1)

• Pretend that we saw every word once more than it did

• Add one to all all n-gram counts, seen or unseen

MLE estimate of probabilities:

Page 77: Discrete and Continuous Language

42

Add One (Laplace) Smoothing

P(wi | wi−1)MLE =c(wi−1,wi)

c(wi−1)

• Pretend that we saw every word once more than it did

• Add one to all all n-gram counts, seen or unseen

MLE estimate of probabilities:

P(wi | wi−1)Add−1 =c(wi−1,wi) + 1c(wi−1) + VAdd-1 estimate of probabilities:

Page 78: Discrete and Continuous Language

43

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

P(caught | cat)MLE =13

P(caught | dog)MLE = 0

Add One (Laplace) Smoothing

Page 79: Discrete and Continuous Language

43

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

P(caught | cat)MLE =13

P(caught | dog)MLE = 0P(caught | dog)Add−1 =c(dog, caught) + 1

c(dog)+ | v | =0+ 11+ 8

= 0.11

Add One (Laplace) Smoothing

Page 80: Discrete and Continuous Language

43

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

P(caught | cat)MLE =13

P(caught | dog)MLE = 0P(caught | dog)Add−1 =c(dog, caught) + 1

c(dog)+ | v | =0+ 11+ 8

= 0.11

P(caught | cat)Add−1 =c(cat, caught) + 1

c(cat)+ | v | =1+ 13+ 8

= 0.18

Add One (Laplace) Smoothing

Page 81: Discrete and Continuous Language

43

<s> the cat sat on the mat </s>

<s> the dog sat on the cat </s>

<s> the cat caught the mouse </s>

<s> the dog caught the cat </s>

TEST TRAIN

P(caught | cat)MLE =13

P(caught | dog)MLE = 0P(caught | dog)Add−1 =c(dog, caught) + 1

c(dog)+ | v | =0+ 11+ 8

= 0.11

P(caught | cat)Add−1 =c(cat, caught) + 1

c(cat)+ | v | =1+ 13+ 8

= 0.18

Add One (Laplace) Smoothing

Page 82: Discrete and Continuous Language

44

Add One (Laplace) Smoothing

• Add-1 smoothing can smooth excessively.

• Not used for language modeling.

Page 83: Discrete and Continuous Language

45

Good Turing Smoothing

For every n-gram with count , new count c∗ = (c+ 1)nc+1nc

c

Let be the number of n-grams seen times. (Frequency of Frequency)cnc

• Move probability mass from n-grams that occur times to those that occur times.

• Move probability from n-grams occurring once to unseen n-grams

cc + 1

Page 84: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

Page 85: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

Page 86: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

How likely is the next species a trout ?

Page 87: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

118

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

How likely is the next species a trout ?

Page 88: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

118

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

How likely is the next species a trout ?

How likely is the next species new ?

Page 89: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

118

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

How likely is the next species a trout ?

How likely is the next species new ? n1N

=318

Page 90: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

118

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

How likely is the next species a trout ?

How likely is the next species new ? n1N

=318

In that case, what is the probability of the next species being trout ?

Page 91: Discrete and Continuous Language

46

Good Turing Smoothing (Josh Goodman)

You’re fishing in a lake and there are 8 species of fish• carp, perch, whitefish, trout, salmon, eel, catfish, bass

118

You catch 18 fish• 10 carp, 3 perch, 2 whitefish, 1 trout, salmon and eel

How likely is the next species a trout ?

How likely is the next species new ? n1N

=318

In that case, what is the probability of the next species being trout ?

<118

Page 92: Discrete and Continuous Language

47

Good Turing Smoothing (Josh Goodman)

,P ∗GT (things with zero frequency) =n118

=318

c∗ = (c+ 1)nc+1nc

Probability of Unseen (bass or catfish) Probability seen once (bass or catfish)

Page 93: Discrete and Continuous Language

47

Good Turing Smoothing (Josh Goodman)

,P ∗GT (things with zero frequency) =n118

=318

c∗ = (c+ 1)nc+1nc

Probability of Unseen (bass or catfish) Probability seen once (bass or catfish)

PMLE(bass) =018

P ∗GT (bass) =318

Page 94: Discrete and Continuous Language

47

Good Turing Smoothing (Josh Goodman)

,P ∗GT (things with zero frequency) =n118

=318

c∗ = (c+ 1)nc+1nc

Probability of Unseen (bass or catfish) Probability seen once (bass or catfish)

PMLE(bass) =018

P ∗GT (bass) =318

PMLE(trout) =118

PGT ∗ (trout) =23

18=127

c ∗ (trout) = (1+ 1)13=23

Page 95: Discrete and Continuous Language

48

Backoff and interpolation

If the context is less frequent, then use shorter contexts.

Backoff:

• If n-gram is frequent, use it.

• else, use lower order n-gram.

Interpolation

• Mix all n-gram contexts.

• Works better in practice than backoff.

Page 96: Discrete and Continuous Language

49

Interpolation

Pinterp(wi | wi−2,wi−1) = λ1P(wi | wi−2,wi−1) + λ2P(wi | wi−1) + λ3P(wi)

λ1 + λ2 + λ3 = 1where

Simple interpolation:

Page 97: Discrete and Continuous Language

49

Interpolation

Pinterp(wi | wi−2,wi−1) = λ1P(wi | wi−2,wi−1) + λ2P(wi | wi−1) + λ3P(wi)

λ1 + λ2 + λ3 = 1where

Simple interpolation:

Conditioning interpolation weights on context:

Pinterp(wi | wi−2,wi−1) = λ1(wi−2,wi−1)P(wi | wi−2,wi−1)

+λ2(wi−2,wi−1)P(wi | wi−1) + +λ2(wi−2,wi−1)P(wi)

Page 98: Discrete and Continuous Language

50

Interpolation

• Jeninek Mercer

• Absolute discounting

• Modified Kneser Ney

Page 99: Discrete and Continuous Language

51

Performance on the Penn Treebank Dataset

Good Turing 3-gram

Good Turing 5-gram

Keneser Ney 3-gram

Kneser Ney 5-gram

Perplexity

140 147.5 155 162.5 170

Page 100: Discrete and Continuous Language

52

Cache Language Models

• Work well for code.

• Intuition is that recently used words (variables) are likely to appear.

On the “Naturalness” of Buggy Code. (Ray et al., 2015)

Page 101: Discrete and Continuous Language

52

Cache Language Models

• Work well for code.

• Intuition is that recently used words (variables) are likely to appear.

PCACHE(wi | history) = λP(wi | wi−2,wi−1) + (1− λ) c(w ∈ history)| history |

On the “Naturalness” of Buggy Code. (Ray et al., 2015)

Page 102: Discrete and Continuous Language

53

Cache Language Models

Stefan Fiott, 2015 (Masters Thesis)

Page 103: Discrete and Continuous Language

54

General approach for training n-gram models

• Collect n-gram counts

• Smooth for unseen n-grams

• Large number of parameters

• Fast probability computation

Page 104: Discrete and Continuous Language

So Far…

Standard n-gram

Model Size

Training Time

Query Time

55

Page 105: Discrete and Continuous Language

Continuous Space Language Models

Page 106: Discrete and Continuous Language

Feed Forward Neural Network Language

Models

Page 107: Discrete and Continuous Language

Motivation for Neural Network Language Models

58

<s> ibm was bought </s>

TEST TRAIN

<s> apple was sold </s>

c(ibm was bought) = 0

Page 108: Discrete and Continuous Language

Motivation for Neural Network Language Models

58

<s> ibm was bought </s>

TEST TRAIN

<s> apple was sold </s>

c(ibm was bought) = 0

bought ≈ sold

ibm ≈ apple

Page 109: Discrete and Continuous Language

Neural Probabilistic Language Models

59

input words

input embeddings

Y� Y�

(

Bengio et al., 2003

the cat

P(sat | the cat)

Page 110: Discrete and Continuous Language

Neural Probabilistic Language Models

60

input words

input embeddings

Y� Y�

(

'� '�

'�Du1

the cat

P(sat | the cat)

Page 111: Discrete and Continuous Language

Neural Probabilistic Language Models

61

input words

input embeddings

Y� Y�

(

'� '�

'�+

'�Du1 Du2

P(sat | the cat)

the cat

Page 112: Discrete and Continuous Language

Neural Probabilistic Language Models

62

input words

input embeddings

Y� Y�

(

'� '� =

'�+

'�Du1 Du2 Fh1

hidden: Fh1 = φ⎛

⎝n−1∑

j=1

CjDuj

⎠ φ+

P(sat | the cat)

the cat

Page 113: Discrete and Continuous Language

Neural Probabilistic Language Models

63

input words

input embeddings

Y� Y�

(

'� '�

11

hidden: Fh1 = φ⎛

⎝n−1∑

j=1

CjDuj

⎠ Fh1

P(sat | the cat)

Page 114: Discrete and Continuous Language

Neural Probabilistic Language Models

64

input words

input embeddings

Y� Y�

(

'� '�

11

hidden: Fh2 = φ (MFh1

)

hidden: Fh1 = φ⎛

⎝n−1∑

j=1

CjDuj

⎠Fh2Fh1

=φ+

P(sat | the cat)

Page 115: Discrete and Continuous Language

Neural Probabilistic Language Models

65

Y� Y�

(

'� '�

1

(′

|:|

input words

input embeddings

hidden: Fh2 = φ (MFh1

)

hidden: Fh1 = φ⎛

⎝n−1∑

j=1

CjDuj

D′

w Fh2

p(w | u) = exp(D′

wFh2

)

=exp+

the cat

sat

P(sat | the cat)

Page 116: Discrete and Continuous Language

Neural Probabilistic Language Models

66

Y� Y�

(

'� '�

1

(′

|:|

RSVQEPM^EXMSR : >(Y) =∑

[′∈:

T([′ | Y)

input words

input embeddings

hidden: Fh2 = φ (MFh1

)

hidden: Fh1 = φ⎛

⎝n−1∑

j=1

CjDuj

(′

=Fh2

p(w | u) = exp(D′

wFh2

)

+ exp

P(sat | the cat)

sat

Page 117: Discrete and Continuous Language

Neural Probabilistic Language Models

67

Y� Y�

(

'� '�

1

(′

|:|

RSVQEPM^EXMSR : >(Y) =∑

[′∈:

T([′ | Y)

LJ(\) = QE\(�, \)

Nair and Hinton, 2010

input words

input embeddings

hidden: Fh2 = φ (MFh1

)

hidden: Fh1 = φ⎛

⎝n−1∑

j=1

CjDuj

p(w | u) = exp(D′

wFh2

)

P(sat | the cat)

Page 118: Discrete and Continuous Language

Neural Probabilistic Language Models

68

|:|

RSVQEPM^EXMSR : >(Y) =∑

[′∈:

T([′ | Y)

p(w | u) = exp(D′

wFh2

)

Page 119: Discrete and Continuous Language

Neural Probabilistic Language Models

69

=exp

(D′

wFh2

)

Z(u)log P(w | u)

Page 120: Discrete and Continuous Language

Maximum Likelihood Training

70

log P(w | u)θML = arg maxθ

Page 121: Discrete and Continuous Language

Maximum Likelihood Training

70

log P(w | u)θML = arg maxθ

Page 122: Discrete and Continuous Language

71

θML = arg maxθ logexp

(D′

wFh2

)

Z(u)

Maximum Likelihood Training

Page 123: Discrete and Continuous Language

Maximum Likelihood Training

72

w

Page 124: Discrete and Continuous Language

Maximum Likelihood Training

73

w

Page 125: Discrete and Continuous Language

74

• Perform stochastic gradient descent :

• Compute P(w|u) using Forward Propagation

• Compute gradient with Backward Propagation

• Very slow training and decoding times

Maximum Likelihood Training

Page 126: Discrete and Continuous Language

Maximum Likelihood Training

75

observed dataY[

true distributionPtrue(w|u)

training data

The cat sat

Page 127: Discrete and Continuous Language

Noise Contrastive Estimation

76

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat

Page 128: Discrete and Continuous Language

Noise Contrastive Estimation

76

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

Page 129: Discrete and Continuous Language

Noise Contrastive Estimation

77

w

Page 130: Discrete and Continuous Language

Noise Contrastive Estimation

77

k noise samples

w

Page 131: Discrete and Continuous Language

Noise Contrastive Estimation

78

k noise samples

w

Page 132: Discrete and Continuous Language

Noise Contrastive Estimation

79

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

Vaswani, Zhao, Fossum and Chiang, 2013

Page 133: Discrete and Continuous Language

Noise Contrastive Estimation

79

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

4([ MW XVYI | Y[)

Vaswani, Zhao, Fossum and Chiang, 2013

Page 134: Discrete and Continuous Language

Noise Contrastive Estimation

79

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

4([ MW XVYI | Y[)P(w | u)

P(w | u) + kq(w)

Vaswani, Zhao, Fossum and Chiang, 2013

Page 135: Discrete and Continuous Language

Noise Contrastive Estimation

79

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

4([̄N MW RSMWI | Y[̄N)4([ MW XVYI | Y[)P(w | u)

P(w | u) + kq(w)

Vaswani, Zhao, Fossum and Chiang, 2013

Page 136: Discrete and Continuous Language

Noise Contrastive Estimation

79

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

4([̄N MW RSMWI | Y[̄N)4([ MW XVYI | Y[)P(w | u)

P(w | u) + kq(w)q(w̄j)

P(w̄j | u) + kq(w)

Vaswani, Zhao, Fossum and Chiang, 2013

Page 137: Discrete and Continuous Language

Noise Contrastive Estimation

79

observed dataY[

true distributionPtrue(w|u)

training data

noise distributionnoise dataq(w)uw̄

11+ k

k1+ k

The cat sat The cat pigThe cat hatThe cat on…

P(w | u)P(w | u) + kq(w)

q(w̄j)

P(w̄j | u) + kq(w)

P(C = 0 | uw) P(C = 1 | uw̄j)

Vaswani, Zhao, Fossum and Chiang, 2013

Page 138: Discrete and Continuous Language

Noise Contrastive Estimation

80

P(w | u)P(w | u) + kq(w)

q(w̄j)

P(w̄j | u) + kq(w)

log +k∑

j=1

log

= log +k∑

j=1

log

=L P(C = 0 | uw) P(C = 1 | uw̄j)

Page 139: Discrete and Continuous Language

Noise Contrastive Estimation

81

For each training example (u, w):

generate k noise samples

train model to classify real examples and noise samples

Gutmann and Hyvarinen, 2010, Mnih and Teh, 2012

Page 140: Discrete and Continuous Language

Advantages of NCE

82

• Much Faster training time.

• You can significantly speed up test time by encouraging the model to have a normalization constant of 1.

Page 141: Discrete and Continuous Language

83

Better perplexity

Log Bilinear

Kneser Ney

NPLM with MLE

NPLM with NCE

RNN

Perplexity

120 127.5 135 142.5 150

Page 142: Discrete and Continuous Language

84

NNLM on the Android Dataset

• Dataset of Android Apps

• 11 million tokens

• 90/10 split for Training set and Development

• 50k Vocabulary

• Cross entropy of 2.95 on Dev.

Results from Saheel Godane

Page 143: Discrete and Continuous Language

85

So Far…

Standard n-gram

NPLM MLE

NPLM NCE

Model Size

Training Time

Query Time

Page 144: Discrete and Continuous Language

86

Faster training times

Trai

ning

Tim

e (s

econ

ds)

0

1000

2000

3000

4000

Vocabulary Size (x1000)

10 15 20 25 30 35 40 45 50 55 60 65 70

CSLM NPLM (k=1000)NPLM (k=100) NPLM (k=10)

Page 145: Discrete and Continuous Language

D

input words

input embeddings

C1 C2

M

D'

u1 u2

hidden: L� = LJ⎛

⎝R−�∑

N=�

'N(YN

hidden:

L� = LJ⎛

⎝R−�∑

N=�

'N(YN

L� = LJ (1L�)

output:

L� = LJ⎛

⎝R−�∑

N=�

'N(YN

L� = LJ (1L�)

4([ | Y) = I\T ((′L� + F)>(Y)

|V|

Neural Network Architecture

1000*150

750 * 1000

100k * 750

Sizes

100k * 150

90.9 x 10E6Total:

Page 146: Discrete and Continuous Language

D

input words

input embeddings

C1 C2

M

D'

u1 u2

hidden: L� = LJ⎛

⎝R−�∑

N=�

'N(YN

hidden:

L� = LJ⎛

⎝R−�∑

N=�

'N(YN

L� = LJ (1L�)

output:

L� = LJ⎛

⎝R−�∑

N=�

'N(YN

L� = LJ (1L�)

4([ | Y) = I\T ((′L� + F)>(Y)

|V|

Neural Network Architecture

1000*150

750 * 1000

Sizes

100k * 150

Total:

514k * 750

310 x 10E6

Page 147: Discrete and Continuous Language

Nearest Neighbors

88

doctor hospital apple bought

physician medical ibm purchased

dentist clinic intel sold

pharmacist hospitals seagate procured

psychologist hospice hp scooped

neurosurgeon mortuary netflix cashed

veterinarian sanatorium kmart reaped

physiotherapist orphanage tivo fetched

Page 148: Discrete and Continuous Language

2-dim projection

89Slide from David Chiang

Page 149: Discrete and Continuous Language

Perplexity is robust to learning rate

90

NPLM perplexities

Perp

lexi

ty

45

47.25

49.5

51.75

54

Epoch

1 2 3 4

lr-1 lr-0.5 lr-0.25 lr-0.1

Page 150: Discrete and Continuous Language

Other approaches

91

• Class based softmax

• Hierarchical Softmax

•Automatically discovering the right size of the network(Murray and Chiang, 2015)

• NCE has been used in modeling Code and Language (Allamanis et al., 2015)

Page 151: Discrete and Continuous Language

Recurrent Neural Network Language

Models

Page 152: Discrete and Continuous Language

Feed Forward NNLM

93

Context Vector

The Cat

Sat

Page 153: Discrete and Continuous Language

Recurrent NNLMs

94

Context Vector

output

input

Page 154: Discrete and Continuous Language

95

The cat sat on a mat ∩ Context Vector

output

input

Recurrent NNLMs

Page 155: Discrete and Continuous Language

96

The cat sat on a

cat

Context Vector

Recurrent NNLMs

Page 156: Discrete and Continuous Language

96

The cat sat on a

cat sat

Context Vector

Context Vector

Context Vector

on

Context Vector

a

Context Vector

mat

Recurrent NNLMs

Page 157: Discrete and Continuous Language

96

The cat sat on a

cat sat

Context Vector

Context Vector

Context Vector

on

Context Vector

a

Context Vector

mat

Recurrent NNLMs

Page 158: Discrete and Continuous Language

96

The cat sat on a

cat sat

Context Vector

Context Vector

Context Vector

on

Context Vector

a

Context Vector

mat

… caught a

Context Vector

a

Context Vector

mouse

Recurrent NNLMs

Page 159: Discrete and Continuous Language

Recurrent Neural Networks

97

The cat sat on a

cat sat

Context Vector

Context Vector

Context Vector

on

Context Vector

a

Context Vector

mat

… caught a

Context Vector

a

Context Vector

mouse

Recurrent NNLMs

Page 160: Discrete and Continuous Language

98

on

Recurrent NNLMsP(a)

σ(x)

Page 161: Discrete and Continuous Language

98

on

0.25

-0.1

Recurrent NNLMsP(a)

σ(x)

Page 162: Discrete and Continuous Language

98

on

0.25

-0.1

0.54

0.54

Recurrent NNLMsP(a)

σ(x)

Page 163: Discrete and Continuous Language

99

on

P(a|context) = 0.9

0.25

-0.1

0.149

0.149

Recurrent NNLMs

tanh(x)

Page 164: Discrete and Continuous Language

100

The cat sat on a

Context Vector

Context Vector

Context Vector

Context Vector

Context Vector

Training Recurrent NNLMsForward Propagation

P(cat) P(sat) P(on) P(mat)P(a)

TIME

Page 165: Discrete and Continuous Language

101

The cat sat on a

Context Vector

Context Vector

Context Vector

Context Vector

Context Vector

Back Propagation through time

∂logP(a)∂θ

∂logP(on)∂θ

∂logP(sat)∂θ

∂logP(cat)∂θ

∂logP(mat)∂θ

Training Recurrent NNLMs

TIME

Page 166: Discrete and Continuous Language

102

The cat sat on a

Vanishing gradientsBack Propagation through time

∂logP(a)∂θ

∂logP(on)∂θ

∂logP(sat)∂θ

∂logP(cat)∂θ

∂logP(mat)∂θ

Page 167: Discrete and Continuous Language

102

The cat sat on a

Vanishing gradientsBack Propagation through time

∂logP(a)∂θ

∂logP(on)∂θ

∂logP(sat)∂θ

∂logP(cat)∂θ

∂logP(mat)∂θ

∂σ(x)∂x

= σ(x)(1− σ(x))

Page 168: Discrete and Continuous Language

102

The cat sat on a

Vanishing gradientsBack Propagation through time

∂logP(a)∂θ

∂logP(on)∂θ

∂logP(sat)∂θ

∂logP(cat)∂θ

∂logP(mat)∂θ

∂σ(x)∂x

= σ(x)(1− σ(x))

σ(x)(1− σ(x))σ(x)(1− σ(x)) …

Page 169: Discrete and Continuous Language

103

Solution: Long Short Term MemoryP(a)

0.25

-0.1

0.149

0.149

on

Page 170: Discrete and Continuous Language

104

Solution: Long Short Term MemoryP(a)

on

ct

ft itc′t

ot

σ σtanh

×

× +ct−1

ht−1 ht−1

ht−1

ct

tanh

× σht

Page 171: Discrete and Continuous Language

104

Solution: Long Short Term MemoryP(a)

on

ct

ft itc′t

ot

σ σtanh

×

× +ct−1

ht−1 ht−1

ht−1

ct

tanh

× σht

LSTM BLOCK

Page 172: Discrete and Continuous Language

105

Forward Propagation in LSTM BlockP(a)

on

ct

ft itc′t

ot

σ σtanh

×

× +ct−1

ht−1 ht−1

ht−1

ct

tanh

× σht

Page 173: Discrete and Continuous Language

106

Forward Propagation in LSTM BlockP(a)

on

ht

ct−1

ht−1

ct

inputs

outputs

Page 174: Discrete and Continuous Language

107

1. Forgetting memoryP(a)

on

ct−1

ht−1

ht

ct

Page 175: Discrete and Continuous Language

107

1. Forgetting memoryP(a)

on

ftσ

ct−1

ht−1

ht

ct

Forget Gate

Page 176: Discrete and Continuous Language

107

1. Forgetting memoryP(a)

on

ftσ

×ct−1

ht−1

ht

ct

Forget Gate

Page 177: Discrete and Continuous Language

108

2. Adding new memoriesP(a)

on

ftσ

×ct−1

ht−1

ct

ht

Page 178: Discrete and Continuous Language

108

2. Adding new memoriesP(a)

on

ft c′tσ tanh

×ct−1

ht−1

ct

ht

Input Gate

Page 179: Discrete and Continuous Language

108

2. Adding new memoriesP(a)

on

ft itc′tσ σtanh

×ct−1

ht−1 ht−1

ct

ht

Input Gate

Page 180: Discrete and Continuous Language

108

2. Adding new memoriesP(a)

on

ft itc′tσ σtanh

×

×ct−1

ht−1 ht−1

ct

ht

Input Gate

Page 181: Discrete and Continuous Language

109

3. Updating the memoryP(a)

on

ft itc′tσ σtanh

×

×ct−1

ht−1 ht−1

ct

ht

Page 182: Discrete and Continuous Language

109

3. Updating the memoryP(a)

on

ft itc′tσ σtanh

×

× +ct−1

ht−1 ht−1

ct

ht

Page 183: Discrete and Continuous Language

109

3. Updating the memoryP(a)

on

ct

ft itc′tσ σtanh

×

× +ct−1

ht−1 ht−1

ct

ht

Cell Value

Page 184: Discrete and Continuous Language

110

4. Calculating hidden stateP(a)

on

ct

ft itc′tσ σtanh

×

× +ct−1

ht−1 ht−1

ct

ht

Page 185: Discrete and Continuous Language

110

4. Calculating hidden stateP(a)

on

ct

ft itc′tσ σtanh

×

× +ct−1

ht−1 ht−1

ct

tanh

ht

Page 186: Discrete and Continuous Language

110

4. Calculating hidden stateP(a)

on

ct

ft itc′t

ot

σ σtanh

×

× +ct−1

ht−1 ht−1

ht−1

ct

tanh

× σht

Output gate

Page 187: Discrete and Continuous Language

111

5. Computing probabilitiesP(a)

on

ct

ft itc′t

ot

σ σtanh

×

× +ct−1

ht−1 ht−1

ht−1

ct

tanh

× σht

Page 188: Discrete and Continuous Language

112

No more vanishing gradient∂logP(a)∂ct−1

ft

ot

× +ct−1

tanh

×

ht

ct

Page 189: Discrete and Continuous Language

113

No more vanishing gradient

ft

ot

× +ct−1

tanh

×

ht

ct

∂logP(a)∂ct−1

Page 190: Discrete and Continuous Language

114

No more vanishing gradient

ft

ot

× +

tanh

×

ht

ct

∂logP(a)∂ct−1

ft

ot

× +

tanh

×

ht

ct

∂logP(on)∂ct−1

ft

ot

× +ct−1

tanh

×

ht

ct

∂logP(sat)∂ct−1

TIME

Page 191: Discrete and Continuous Language

114

No more vanishing gradient

ft

ot

× +

tanh

×

ht

ct

∂logP(a)∂ct−1

ft

ot

× +

tanh

×

ht

ct

∂logP(on)∂ct−1

ft

ot

× +ct−1

tanh

×

ht

ct

∂logP(sat)∂ct−1

TIME

Page 192: Discrete and Continuous Language

115

Improved Perplexity on Penn Treebank

Log Bilinear

Kneser Ney

NNLM with MLE

NNLM with NCE

RNN

LSTM

Perplexity

110 120 130 140 150

Zaremba et al., 2014

Page 193: Discrete and Continuous Language

116

LSTM Successes in NLP

•Language Modeling

•Neural Machine Translation

•Natural Language Parsing

•Tagging

Page 194: Discrete and Continuous Language

117

•Language Modeling

•Neural Machine Translation

•Natural Language Parsing

Large Vocabulary LSTMs

Page 195: Discrete and Continuous Language

Visualizing Simple LSTMs

Joint work with Jon May

Page 196: Discrete and Continuous Language

119

Encoder Decoder Framework for Sequence Sequence to Learning

the cat sat on a

cat sat on a mat

Language Modeling: Input: EnglishOutput: English

Page 197: Discrete and Continuous Language

120

Encoder Decoder Framework for Sequence Sequence to Learning

the cat sat on a

cat sat on a mat

Machine TranslationInput: EsperantoOutput: English

acxeto kato sur mato

Page 198: Discrete and Continuous Language

120

Encoder Decoder Framework for Sequence Sequence to Learning

the cat sat on a

cat sat on a mat

Machine TranslationInput: EsperantoOutput: English

acxeto kato sur mato

ENCODER

Page 199: Discrete and Continuous Language

120

Encoder Decoder Framework for Sequence Sequence to Learning

the cat sat on a

cat sat on a mat

Machine TranslationInput: EsperantoOutput: English

acxeto kato sur mato

ENCODER DECODER

Page 200: Discrete and Continuous Language

120

Encoder Decoder Framework for Sequence Sequence to Learning

the cat sat on a

cat sat on a mat

Machine TranslationInput: EsperantoOutput: English

acxeto kato sur mato

ENCODER DECODER

SENTENCEVECTOR

Page 201: Discrete and Continuous Language

121

Visualization Approach

• Train on input/output sequences

• Look at LSTM internals for test input/output sequencesP(a)

ct

ft itc′t

ot

σ σtanh

×

× +ct−1

ht−1 ht−1

ht−1

ct

tanh

× σht

Page 202: Discrete and Continuous Language

122

Counting a Symbol

SENTENCEVECTOR

a a a a a a a a a a

a a

a a a a a a a a a a a a a a a a a a

Input #(a) = Output #(a)

Input: Output:

Page 203: Discrete and Continuous Language

123

Counting a Symbol

SENTENCEVECTOR

a a a a a a a a a a

Input #(a) = Output #(a)

Input: Output:

Page 204: Discrete and Continuous Language

123

Counting a Symbol

SENTENCEVECTOR

a a a a a a a a a a

Input #(a) = Output #(a)

a a a a<s> a a a a<s>

a a a </s>a

Input: Output:

Page 205: Discrete and Continuous Language

Counting:

124

Page 206: Discrete and Continuous Language

Counting:

125

Time

Page 207: Discrete and Continuous Language

Counting:

126

Page 208: Discrete and Continuous Language

Counting:

127

Time

Page 209: Discrete and Continuous Language

128

Counting a or b

SENTENCEVECTOR

Input: Output:a a a a a a a a a a

b ba a a a a a a a a a a a a a a a a a

Input #(a) = Output #(a)b b b b b b b b b b b b b b b b b b b b b b b b b b

Input #(b) = Output #(b)OR

Page 210: Discrete and Continuous Language

129

Counting a or b

Counting a

Counting b

Page 211: Discrete and Continuous Language

130

Counting a or b

Counting a Counting b

Time Time

Page 212: Discrete and Continuous Language

131

Counting a or b

Counting a Counting b

Time Time

Page 213: Discrete and Continuous Language

132

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

Page 214: Discrete and Continuous Language

132

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

‘a’ and ‘b’ have flipped embeddings

Page 215: Discrete and Continuous Language

133

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

Counting a

Time Time

Page 216: Discrete and Continuous Language

133

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

p(symbol) ∝ embedding(symbol)hT

Counting a

Time Time

Page 217: Discrete and Continuous Language

133

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

‘a’ regionCounting a

Time Time

Page 218: Discrete and Continuous Language

133

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

‘a’ region </s>Counting a

Time Time

Page 219: Discrete and Continuous Language

134

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

</s>Counting b

Time Time

Page 220: Discrete and Continuous Language

134

Counting a or b

Output embeddings

-5.25

-3.5

-1.75

0

1.75

3.5

5.25

a b </s>

</s>‘b’ regionCounting b

Time Time

Page 221: Discrete and Continuous Language

135

Counting a or b

Counting a‘a’ region </s> ‘a’ region </s>

Time Time

Page 222: Discrete and Continuous Language

136

Counting a or b

Counting b‘b’ region </s> ‘b’ region </s>

Page 223: Discrete and Continuous Language

137

Counting log(a)

SENTENCEVECTOR

Input: Output:a a a a a 5*log(5) number of a’s

a a a a a a a a a 5*log(9) number of a’s

Page 224: Discrete and Continuous Language

138

Counting log(a)

Page 225: Discrete and Continuous Language

139

Counting log(a)

Time

Page 226: Discrete and Continuous Language

140

Counting log(a)

Page 227: Discrete and Continuous Language

141

Counting log(a)

Time

Page 228: Discrete and Continuous Language

Downloadable Tools

Page 229: Discrete and Continuous Language

143

Downloadable Tools

http://www.speech.sri.com/projects/srilm/

Page 230: Discrete and Continuous Language

144

Feed Forward Neural Language Model

http://nlg.isi.edu/software/nplm/

Page 231: Discrete and Continuous Language

145

Yandex: Training RNNs with NCE

https://github.com/yandex/faster-rnnlm

Page 232: Discrete and Continuous Language

146

Training LSTMs

http://nlg.isi.edu/software/EUREKA.tar.gz

https://github.com/isi-nlp/Zoph_RNN

NCE based training

GPU based training

Page 233: Discrete and Continuous Language

147

Training LSTMs

http://nlg.isi.edu/software/EUREKA.tar.gz

https://github.com/isi-nlp/Zoph_RNN

NCE based training

GPU based training

Page 234: Discrete and Continuous Language

148

RNNLM Toolkit

http://rnnlm.org/

Page 235: Discrete and Continuous Language

149

Penne

https://bitbucket.org/ndnlp/penne

Page 236: Discrete and Continuous Language

150

Contributors and Collaborators

Page 237: Discrete and Continuous Language

Thanks!