Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model

Deep learning for (more than)speech recognition

IndabaX Western Cape, UCT, Apr. 2018

Herman Kamper

E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/

http://www.kamperh.com/

Success in automatic speech recognition (ASR)

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40


[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40


[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

Talk outline




2 / 40

Talk outline


2. Examples of non-ASR speech processing

(the first rant)


2 / 40

Talk outline




2 / 40

Talk outline



3. Examples of local work

(a second rant)

2 / 40

Talk outline




2 / 40

State-of-the-art speech recognition

Supervised speech recognition

i had to think of some example speech

since speech recognition is really cool

4 / 40

Feature extraction for speech processing

25 ms

10 ms

X

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

5 / 40

Feature extraction for speech processing

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

6 / 40

Name these networks

Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Name these networks

x(i)

y(i)

8 / 40

Name these networks

Image: http://deeplearning.net/tutorial/lenet.html9 / 40

http://deeplearning.net/tutorial/lenet.html

Name these networks

p(x(1),x(2), . . . ,x(N))|[ih])

10 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40


W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U


= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )


12 / 40

Hidden Markov models (HMMs)p(X|[ih])

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

X

x1

x2

...xD

x(5)

x1

x2

...xD

x(6)

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Hidden Markov models (HMMs)p(X|[ih])

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

X

x1

x2

...xD

x(5)

x1

x2

...xD

x(6)

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]

14 / 40

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]14 / 40

End-to-end speech recognition






Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40






y(i)s1 s2 . . . s9000

18 / 40






y(i)s1 s2 . . . s9000

18 / 40





• Can be seen asrepresentation learningtrained jointly with classifier

x(i)

y(i)s1 s2 . . . s9000

18 / 40






y(i)s1 s2 . . . s9000

18 / 40

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

19 / 40

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

19 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

https://github.com/espnet/espnet





1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40






1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40














Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40






21 / 40






21 / 40






21 / 40

Rant 1: Do we always need/have ASR?

Examples of non-ASR speech processing

What if we do not have supervision?

• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)

• Data: 2000 hours transcribed speech audio; ∼350M/560M words text

• Can we do this for all 7000 languages spoken in the world?

• Many of these languages are endangered and unwritten

23 / 40

Example 1: Query-by-example search

Spoken query:Spoken query:Spoken query:Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40


Spoken query:

Spoken query:Spoken query:Spoken query:




Spoken query:

Spoken query:

Spoken query:Spoken query:




Spoken query:Spoken query:

Spoken query:

Spoken query:




Spoken query:Spoken query:Spoken query:

Spoken query:



Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/25 / 40

http://www.stevenbird.net/

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/26 / 40

http://www.stevenbird.net/

Example 3: Learning robots to understand speech

[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40

Rant 2: Taking inspiration from humans

Examples of local work




Can we acquire language from audio alone?

29 / 40




Can we acquire language from audio alone?

29 / 40

Full-coverage segmentation and clustering

31 / 40


31 / 40


31 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)


fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40


Speech waveform


fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)



fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40


Speech waveform


fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)



fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40


Speech waveform


fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)



fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40


Speech waveform


fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)



fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40


Speech waveform


fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)



fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40


Speech waveform


fa(·)fa(·) fa(·)

Speech waveform


fa(·)fa(·) fa(·)

Speech waveform

fe(·)



fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform


fe(·)

Acoustic

modelling

Wordsegm

entation



p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Listen to discovered clusters

• Small-vocabulary cluster 45: Play

• Large-vocabulary English cluster 1214: Play

• Large-vocabulary Xitsonga cluster 629: Play

33 / 40

Arrival

Using images for grounding language

Consider images paired with unlabellel spoken captions:

Play

35 / 40

Using images for grounding language

Consider images paired with unlabellel spoken captions:

Play

35 / 40

Map images and speech into common space

X

VGG

max

conv

max

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]

36 / 40

Map images and speech into common space

X

VGG

max

conv

max

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]36 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40



beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play

largeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play

playsittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung




beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic


Summary and conclusion

What did we chat about today?

• Supervised speech recognition: From HMMs all the way to CLDNNs

• Structure is still important in speech recognition

• Saw three examples of models that do not require ASR

• Looked at local work taking inspiration from humans

39 / 40

What’s next (specifically for us)?

• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing






40 / 40








40 / 40








40 / 40








40 / 40








40 / 40








40 / 40


https://github.com/kamperh


https://github.com/kamperh

Backup slides

Acoustic word embeddings (AWe)

Acoustic word embeddings x ∈ RD

fe(Y1)

fe(Y2)

Y2

Y1

x1

x2

[Levin et al., ASRU’13]43 / 40

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Retrieval in common (semantic) space

y ∈ RD in D-dimensional space

yvis

yspch

[Harwath et al., NIPS’16]45 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat


I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]

46 / 40


VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat





VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat


X

f(X)

max

conv

max

feedfwdma

nhat




Documents

Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model