27
Recurrent Neural Network Based Language Model Author: Toma sˇ Mikolov et. al Johns Hopkins University, USA Presented by : Vicky Xuening Wang ECS 289G, Nov 2015, UC Davis 1 Toma sˇ Mikolov1,2, Martin Karafia t1, Luka sˇ Burget1, Jan “Honza” Cˇernocky 1, Sanjeev Khudanpur2

Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

  • Upload
    hahuong

  • View
    227

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

Recurrent Neural Network Based Language Model

Author: Toma ́sˇ Mikolov et. al Johns Hopkins University, USA

Presented by : Vicky Xuening Wang

ECS 289G, Nov 2015, UC Davis

1

Toma ́sˇ Mikolov1,2, Martin Karafia ́t1, Luka ́sˇ Burget1, Jan “Honza” Cˇernocky ́1, Sanjeev Khudanpur2

Page 2: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

2

Language Model Tasks

Page 3: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

• Statistical/Probabilistic Language Models

• Goal: compute the probability of a sentence or sequence of words:

• P(W) = P(w1,w2,w3,w4,w5...wn)

• Related task: predict probability of an upcoming word:

• P(wn|w1,w2,w3,w4,….wn-1)

3

Introduction - Language model

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Page 4: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

• Chain rule of probability

• Markov assumption

• N-gram model

4

Introduction

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Page 5: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

• Typical tasks: • Machine Translation

• P(high winds tonite) > P(large winds tonite)

• Spell Correction • P(about fifteen minutes from) > P(about fifteen minuets

from)

• Speech Recognition • P(I saw a van) >> P(eyes awe of an)

• Summarization, question-answering, etc

5

Introduction - LM tasks

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Page 6: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

6

Introduction - Bigram model

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Maximum Likelihood Estimation

Page 7: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

7

Introduction - Perplexity

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdfhttps://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Lower is better!

Page 8: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

8

Introduction - WER

Lower is better!

Page 9: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

• Recurrent Neural Network based language model (RNN-LM) outperforms standard backoff N-gram models

• Words are projected into low dimensional space, similar words are automatically clustered together.

• Smoothing is solved implicitly.

• Backpropagation is used for training.

9

Overview

Page 10: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

10

Fixed-length

Page 11: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

11

Page 12: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

12

• Input layer x • Hidden/context layer s • Output layer y

Page 13: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

13

Model Description - RNN Con’d

• RNN can be seen as a chain of NNs • Intimately related to sequences and lists. • In the last few years, RNN has been successfully applied

to : speech recognition, language modeling, translation, image captioning…

Page 14: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

14

RNN v.s. FF

• Parameters to tune or selected:

• RNN

• Size of hidden layer

• FF

• size of layer that projects words to low dimensional space

• size of hidden layer

• size of context-length

Page 15: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

15

RNN v.s. FF

• In feedforward networks, history is represented by context of N − 1 words - it is limited in the same way as in N-gram backoff models.

• In recurrent networks, history is represented by neurons with recurrent connections - history length is unlimited.

• Also, recurrent networks can learn to compress whole history in low dimensional space, while feedforward networks compress (project) just single word.

• Recurrent networks have possibility to form short term memory, so they can better deal with position invariance; feedforward networks cannot do that.

Page 16: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

16

Comparison of modelsSimple experiment on 4M words from Switchboard corpus

60

70

80

90

100

PPL

73.5

80

85.1

93.7baseline

KN 5gram FF RNN 4*RNN+KN5

Page 17: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

17

Model setting• Standard backpropogation algorithm + SGD

• Train in several epochs:

• α=0.1

• if log-likelihood of validation data increases

• continue

• else α=0.5α and continue

• terminate if no significant improvement

• Convergence usually reached at 10-20 epochs

Page 18: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

18

Page 19: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

19

Model setting- Optimization

• Rare token • merge all words occurring less often

than a threshold in training data to a uniformly distributed rare token

Page 20: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

20

Experiments• WSJ (Source: read text only)

• training corpus consists of 37M words

• baseline KN5 - modified Kneser-Ney smoothed 5-gram

• RNN LM - select 6.4M words trained on 300K sentences

• combine 0.75 RNN+0.25 backoff Model

• NIST RT05 (115 hours of meeting speech + web data)

• more than1.3G words

• RNN LM select 5.4M words

Page 21: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

21Best perplexity result is 112 for mixture of static and dynamic RNN LMs with larger learning rate 0.3

~50%! 18%

Page 22: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

22

12% improvement

Page 23: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

23

• RNNs are trained only on in-domain data(5.4M words)

• RT 05, RT 09 are trained on more than 1.3G words

Page 24: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

24

Summary

• RNN LM is simple and intelligent.

• RNN LMs can be competitive with backoff LMs that are trained on much more data.

• Results show interesting improvements both for ASR and MT.

• Simple toolkit has been developed that can be used to train RNN LMs.

• This work provides clear connection between machine learning, data compression and language modeling.

Page 25: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

25

Future work• Clustering of vocabulary to speed up training

• Parallel implementation of neural network training algorithm

• Online learning or dynamic model will be the future

• BPTT algorithm for a lot of training data

• Go beyond BPTT? LSTM

• Extended to OCR, data compression, cognitive sciences…

Page 26: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

26

Page 27: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

–Xuening

Thanks!

27