Upload
others
View
30
Download
1
Embed Size (px)
Citation preview
Deep learning for (more than)speech recognition
IndabaX Western Cape, UCT, Apr. 2018
Herman Kamper
E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 40
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 40
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing
(the first rant)
3. Examples of local work (a second rant)
2 / 40
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work
(a second rant)
2 / 40
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
State-of-the-art speech recognition
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
4 / 40
Feature extraction for speech processing
25 ms
10 ms
X
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
5 / 40
Feature extraction for speech processing
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
6 / 40
Name these networks
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40
Name these networks
x(i)
y(i)
8 / 40
Name these networks
Image: http://deeplearning.net/tutorial/lenet.html9 / 40
Name these networks
p(x(1),x(2), . . . ,x(N))|[ih])
10 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
Hidden Markov models (HMMs)p(X|[ih])
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
X
x1
x2
...xD
x(5)
x1
x2
...xD
x(6)
Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).
13 / 40
Hidden Markov models (HMMs)p(X|[ih])
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
X
x1
x2
...xD
x(5)
x1
x2
...xD
x(6)
Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).
13 / 40
Back to today: End-to-end speech recognition
[Chan et al., arXiv’15]
14 / 40
Back to today: End-to-end speech recognition
[Chan et al., arXiv’15]14 / 40
End-to-end speech recognition
[Chan et al., arXiv’15]15 / 40
End-to-end speech recognition
[Chan et al., arXiv’15]16 / 40
End-to-end speech recognition
[Chan et al., arXiv’15]17 / 40
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier
x(i)
y(i)s1 s2 . . . s9000
18 / 40
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
What about convolutional neural networks?
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
19 / 40
What about convolutional neural networks?
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
19 / 40
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]
20 / 40
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]
20 / 40
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
Rant 1: Do we always need/have ASR?
Examples of non-ASR speech processing
What if we do not have supervision?
• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
• Data: 2000 hours transcribed speech audio; ∼350M/560M words text
• Can we do this for all 7000 languages spoken in the world?
• Many of these languages are endangered and unwritten
23 / 40
Example 1: Query-by-example search
Spoken query:Spoken query:Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
Example 1: Query-by-example search
Spoken query:
Spoken query:Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
Example 1: Query-by-example search
Spoken query:
Spoken query:
Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
Example 1: Query-by-example search
Spoken query:Spoken query:
Spoken query:
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
Example 1: Query-by-example search
Spoken query:Spoken query:Spoken query:
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
Example 2: Linguistic and cultural documentation
http://www.stevenbird.net/25 / 40
Example 2: Linguistic and cultural documentation
http://www.stevenbird.net/26 / 40
Example 3: Learning robots to understand speech
[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40
Rant 2: Taking inspiration from humans
Examples of local work
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
Can we acquire language from audio alone?
29 / 40
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
Can we acquire language from audio alone?
29 / 40
Full-coverage segmentation and clustering
31 / 40
Full-coverage segmentation and clustering
31 / 40
Full-coverage segmentation and clustering
31 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
Listen to discovered clusters
• Small-vocabulary cluster 45: Play
• Large-vocabulary English cluster 1214: Play
• Large-vocabulary Xitsonga cluster 629: Play
33 / 40
Arrival
Using images for grounding language
Consider images paired with unlabellel spoken captions:
Play
35 / 40
Using images for grounding language
Consider images paired with unlabellel spoken captions:
Play
35 / 40
Map images and speech into common space
X
VGG
max
conv
max
feedfwd
d(yvis,yspch) distance
yvis yspch
[Harwath et al., NIPS’16]
36 / 40
Map images and speech into common space
X
VGG
max
conv
max
feedfwd
d(yvis,yspch) distance
yvis yspch
[Harwath et al., NIPS’16]36 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play
largeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play
playsittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic
[Kamper et al., Interspeech’17]37 / 40
Summary and conclusion
What did we chat about today?
• Supervised speech recognition: From HMMs all the way to CLDNNs
• Structure is still important in speech recognition
• Saw three examples of models that do not require ASR
• Looked at local work taking inspiration from humans
39 / 40
What’s next (specifically for us)?
• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
http://www.kamperh.com/
https://github.com/kamperh
Backup slides
Acoustic word embeddings (AWe)
Acoustic word embeddings x ∈ RD
fe(Y1)
fe(Y2)
Y2
Y1
x1
x2
[Levin et al., ASRU’13]43 / 40
Word similarity Siamese CNN
Use idea of Siamese networks [Bromley et al., PatRec’93]
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
distancel(x1,x2)
[Kamper et al., ICASSP’15]44 / 40
Word similarity Siamese CNN
Use idea of Siamese networks [Bromley et al., PatRec’93]
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
distancel(x1,x2)
[Kamper et al., ICASSP’15]44 / 40
Retrieval in common (semantic) space
y ∈ RD in D-dimensional space
yvis
yspch
[Harwath et al., NIPS’16]45 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]
46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40