54
EN. 601.467/667 Introduction to Human Language Technology Deep Learning I Shinji Watanabe 1

EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

EN. 601.467/667 Introduction to Human Language TechnologyDeep Learning I

Shinji Watanabe

1

Page 2: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Today’s agenda

• Big impact in HLT technologies including speech recognition• Before/after deep neural network

• Brief introduction of deep neural network• Why deep neural network after 2010?• Success in the other HLT areas (image and text processing)

2

Page 3: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Today’s agenda

• Big impact in HLT technologies including speech recognition• Before/after deep neural network

• Brief introduction of deep neural network• Why deep neural network after 2010?• Success in the other HLT areas (image and text processing)

No math, no theory, based on my personal experience

3

Page 4: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Short bio

• Research interests• Automatic speech recognition (ASR), speech enhancement, application of

machine learning to speech processing

• Around 20 years of ASR experience since 2001

Page 5: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Automatic speech recognition?

5

Page 6: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Speech recognition evaluation metric

• Word error rate (WER)• Using edit distance word-by-word:

• # inser(on errors = 1, # subs(tu(on errors = 2, # of dele(on errors = 0➡ Edit distance = 3• Word error rate (%): Edit distance (=3) / # reference words (=9) * 100 = 33.3%• How to compute WERs for languages that do not have word boundaries?

• Chunking or using character error rate

Reference) I want to go to the Johns Hopkins campusRecognition result) I want to go to the 10 top kids campus

Page 7: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

2001: when I started speech recognition….

1%

10%

100%

2000 201020051995 2015

(Pallett’03, Saon’15, Xiong’16)

Switchboard task (Telephone conversation speech)

Wor

d er

ror r

ate

(WER

)

2016

Page 8: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Really bad age….

• No applicaHon• No breakthrough technologies• Everyone outside speech research criHcized it…• General people don’t know “what is speech recogniHon”

Page 9: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas
Page 10: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

2001: when I started speech recognition….

1%

10%

100%

2000 201020051995 2015

(Pallett’03, Saon’15, Xiong’16)

Switchboard task (Telephone conversation speech)

Wor

d er

ror r

ate

(WER

)

2016

Page 11: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Now we are at

1%

10%

100%

2000 201020051995 2015

Deep learning

(Palle>’03, Saon’15, Xiong’16)

Switchboard task (Telephone conversation speech)

Wor

d er

ror r

ate

(WER

)

2016

5.9%

Page 12: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Everything was changed

• No application• No breakthrough technologies• Everyone outside speech research criticized it…• General people don’t know “what is speech recognition”

Page 13: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies• Everyone outside speech research criticized it…• General people don’t know “what is speech recognition”

Page 14: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies deep neural network• Everyone outside speech research criticized it…• General people don’t know “what is speech recognition”

Page 15: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies deep neural network• Everyone outside speech research criticized it… many people outside

speech research know/respect it• General people don’t know “what is speech recognition”

Page 16: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies deep neural network• Everyone outside speech research criticized it… many people outside

speech research know/respect it• General people don’t know “what is speech recognition” now my kids

know what I’m doing

Page 17: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas
Page 18: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Today’s agenda

• Big impact in HLT technologies including speech recognition• Before/after deep neural network

• Brief introduction of deep neural network• Why deep neural network after 2010?• Success in the other HLT areas (image and text processing)

18

Page 19: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Supervised training

• Classification• Phoneme or word recognition given speech• Word prediction given history (language modeling)• Image classification

• Binary classification• Special case of classification problems (only two cases)• Segmentation

• Regression• Predict a clean speech spectrogram from a noisy speech spectrogram

19

Please list an HLT related example of these problems

Page 20: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Supervised training

• ClassificaHon (mostly in HLT applicaHons)• Phoneme or word recogniEon given speech• Word predicEon given history (language modeling)• Image classificaEon

• Binary classificaHon• Special case of classificaEon problems (only two cases)• SegmentaEon

• Regression• Predict a clean speech spectrogram from a noisy speech spectrogram

20

Page 21: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Learning based (data-driven) HLT techniques

21

Page 22: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Speech recognition as a classification problem

Yes

No

one

four

22

Google speech command database https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html

• Use probability distribution 𝑝(𝑌|𝑋), and pick up the most possible word

• Note that this is a very simple problem. The real speech recognition needs to consider a sequential problem

Page 23: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Language model as a classification problem

23

• “I want to go to my office”: correct sentence

• “I want to go to my office”: language modeling is to predict office given history

• Classification problem• 𝑋: “I want to go to my”• 𝑌: “office”• Use probability distribution 𝑝(𝑌|𝑋), and pick up the most possible word

• 𝑃 𝑌 𝑋 is obtained by a deep neural network

Page 24: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Binary classificaNon

• Two category classification problems

• Paper submission

24

Page 25: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Binary classification (we will have more discussion later)

25from h>p://cs.jhu.edu/~kevinduh/a/deep2014/140114-ResearchSeminar.pdf

• We can use a linear classifier• ○: 𝑎!𝑜! + 𝑎"𝑜" + 𝑏 > 0• ×: 𝑎!𝑜! + 𝑎"𝑜" + 𝑏 < 0

• We can also make a probability with the sigmoid function 𝜎• 𝑝 ○ 𝑜!, 𝑜" = 𝜎 𝑎!𝑜! + 𝑎"𝑜" + 𝑏• 𝑝 × 𝑜!, 𝑜" = 1 − 𝜎 𝑎!𝑜! + 𝑎"𝑜" + 𝑏

• Sigmoid function

𝜎 𝑥 =1

1 + 𝑒34𝑜! = −

𝑎"𝑎!𝑜" −

𝑏𝑎!

𝑎!: scaling

Page 26: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Binary classification

30

• We can use a GMM (although not suitable) to model the data distribution• 𝑝 𝑜!, 𝑜" ○ = ∑#𝜔# 𝑁(𝒐|𝝁#, 𝚺#)• 𝑝 𝑜!, 𝑜" × = ∑#𝜔$# 𝑁 𝒐 𝝁$#, 𝚺

$#

• We can build a classifier from the Bayes theorem (speech recognition before 2010)

• 𝑝 ○ 𝑜!, 𝑜" =% 𝑜!, 𝑜" ○

% 𝑜!, 𝑜" ○ &% 𝑜!, 𝑜" ×

• 𝑝 × 𝑜!, 𝑜" =% 𝑜!, 𝑜" ×

% 𝑜!, 𝑜" ○ &% 𝑜!, 𝑜" ×

Page 27: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

GePng more difficult with the GMM classifier or linear classifier

31

Page 28: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Getting more difficult with the GMM classifier or linear classifier

32

Page 29: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Neural network

33

• Combination of linear classifiers to classify complicated patterns• More layers, more complicated

patterns

Page 30: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Neural network

34

• Combination of linear classifiers to classify complicated patterns• More layers or more classifiers,

more complicated patterns

Page 31: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Going deeper -> more accurate

35

Page 32: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Neural network used in speech recognition

• Very large combination of linear classifiers

36

Output HMM state or phoneme

Speech features~50 to ~1000 dim

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 unitsa i u w N

Page 33: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Difficulties of training

• Blue: accuracy of training data (higher is bePer)• Orange: accuracy of validaEon data (higher is bePer)

37

Which one is better?

Page 34: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Today’s agenda

• Big impact in HLT technologies including speech recognition• Before/after deep neural network

• Brief introduction of deep neural network• Why deep neural network after 2010?• Success in the other HLT areas (image and text processing)

38

Page 35: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Why neural network was not focused

1. Very difficult to train• Batch? On-line? Mini-batch?• Stochastic gradient decent

• Learning rate? Scheduling?• What kind of topologies?• Large computational cost

2. The amount of training data is very critical3. CPU -> GPU

39

Speech features~50 to ~1000 dim

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 unitsa i u w N

Page 36: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Before deep learning (2002 – 2009)

• Success of neural networks was very old period• People believed that GMM was better• But very small gain from standard GMMs

40

Page 37: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

41from https://en.wikipedia.org/wiki/Geoffrey_Hinton

Page 38: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

When I noticed deep learning (2010)

• A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

• This still did not fully convince me (I introduced it at NTT’s reading group)

42

• Using deep belief network as pre-training

• Fine-tuning deep neural network→ Provides stable estimation

Page 39: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Pre-training and fine-tuning

• First train neural network like parameters with deep belief network or autoencoder• Then, using deep neural

network training

43

Speech features~50 to ~1000 dim

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 unitsa i u w N

Page 40: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Interspeech 2011 at Florence

• The following three papers convinced me• Feature extraction: Valente, Fabio / Magimai-Doss, Mathew / Wang, Wen

(2011): "Analysis and comparison of recent MLP features for LVCSR systems", In INTERSPEECH-2011, 1245-1248.• Acoustic model: Seide, Frank / Li, Gang / Yu, Dong (2011): "Conversational

speech transcription using context-dependent deep neural networks", In INTERSPEECH-2011, 437-440.• Language model: Mikolov, Tomáš / Deoras, Anoop / Kombrink, Stefan /

Burget, Lukáš / Černocký, Jan (2011): "Empirical evaluation and combination of advanced language modeling techniques", In INTERSPEECH-2011, 605-608.

• I discussed this potential to my NLP folks in NTT but they did not believe it (SVM, log linear model)

44

Page 41: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Late 2012

• My first deep learning (Kaldi nnet)• Kaldi started to support DNN since 2012 (mainly developed by Karel Vesely)

• Deep belief network based pre-training• Feed forward neural network• Sequence-discriminative training

45

Hub5 ‘00 (SWB) WSJ

GMM 18.6 5.6

DNN 14.2 3.6

DNN with sequence-discriminative training

12.6 3.2

Page 42: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

46

Page 43: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Build speech recognition with public tools and resources • TED-LIUM (~100 hours)• LIBRISPEECH (~1000 hours)

• We can build Kald+DNN+TED-LIUM to make a English speech recognition system by using one machine (GPU + many core machines)• Before this, it’s only realized by a big company.

47

Page 44: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Same things happened in computer vision

48

Page 45: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

49from https://en.wikipedia.org/wiki/Geoffrey_Hinton

Page 46: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

ImageNet challenge (Large scale data)

50L. Fei-Fei and O. Russakovsky, Analysis of Large-Scale Visual Recognition, Bay Area Vision Meeting, October, 2013

Page 47: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

ImageNet challengeAlexNet, GoogLeNet, VGG, ResNet, …

51

Erro

r rat

e

Deep learning!

Page 48: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Same things happened in text processing

• Recurrent neural network language model (RNNLM) [Mikolov+ (2010)]

52

w0

w1

w1

w2

w2

w3

w3

w4

w4

w5

w5

w6

y2y1 y3 y4 y5

”I” “want” “to” “go” “to” “my

“<s>” ”I” “want” “to” “go” “to”

Perplexity

N-gram (conventional)

336

RNNLM 156

Page 49: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Word embedding examplehttps://www.tensorflow.org/tutorials/word2vec

53

Page 50: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Neural machine translation (New York Times, December 2016)

54

Page 51: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

55from https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

Page 52: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Deep neural network toolkit (2013-)

• Theano• Caffe• Torch• CNTK• Chainer• Keras

Later• Tensorflow• PyTorch (used in this course)• MxNet• Etc.

56

Page 53: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Summary

• Before 2011• GMM/HMM, limitation of the performance, bit boring• This is because they are linear models…

• After 2011• DNN/HMM• Toolkit• Public large data• GPU• NLP, image/vision were also moved to DNN• Always something exciting

57

Page 54: EN. 601.467/667 Introductionto Human Language Technology ... · •Brief introduction of deep neural network •Why deep neural network after 2010? •Success in the other HLT areas

Any questions?

58