TSD 2015: 18th Int. Conf. on Text, Speech and DialogueSep. 14-17, Plzen, Czech Republic
The Statistical Approach to Speech Recognition and Natural LanguageProcessing: Achievements and Open Problems – where do we stand?
Hermann Ney
Human Language Technology and Pattern RecognitionLehrstuhl Informatik 6
RWTH Aachen University, Aachen, Germany
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 1 TSD, Plzen, 14-Sep-2015
My talk is based on joint work with members of my chair (Informatik 6):
• acoustic modelling:
Ralf Schlüter, Zoltan Tüske, Muhammad Ali Tahir, Simon Wiesler, Markus Nussbaum, ...
• language modelling:
Martin Sundermeyer, Kazuki Irie, ...
• conference papers:
INTERSPEECH and ICASSP
• public toolkits with open source code
for speech recognition and neural networks for acoustic and language modelling
webpage: RWTH Aachen, Informatik 6, Software
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 2 TSD, Plzen, 14-Sep-2015
Outline
1 Speech and Language Technology 4
2 Statistical Approach: Principles 8
3 From Generative to Discriminative Modelling 19
4 Language Modelling and ANN 26
5 Conclusions 40
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 3 TSD, Plzen, 14-Sep-2015
1 Speech and Language Technology
Speech Recognition
we want to preserve this great idea
Machine Translation
wir wollen diese große Idee erhalten
we want to preserve this great idea
Text Image Recognition
we want to preserve this great idea
three tasks for machine learning:
– automatic speech recognition
– text image recognition
– machine translation
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 4 TSD, Plzen, 14-Sep-2015
Speech and Language Technology
terminology: tasks in speech and natural language processing (NLP)
• automatic speech recognition (ASR)
• text image recognition (prined and handwritten text, offline)
• machine translation (MT) (of language and speech)
• ...
characteristic properties of these tasks:
• well-defined ’classification’ tasks:
– due to 5000-year history of (written!) language
– well-defined classes: letters or words of the language
• easy task for humans
(at least in their native language!)
• hard task for computers
(as the last 40 years have shown!)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 5 TSD, Plzen, 14-Sep-2015
Projects
activities of RWTH team in joint projects:
• TC-STAR 2004-2007: funded by EU
– first research system for speech-to-speech translation on real-life data (EU parliament)
– partners: KIT Karlsruhe, FBK Trento, LIMSI Paris, UPC Barcelona, IBM-US Research, ...
• GALE 2005-2011: funded by US DARPA
emphasis on Chinese and Arabic speech and text
• BOLT 2011-2015: funded by US DARPA
emphasis on colloquial text for Arabic and Chinese
• QUAERO 2008-2013: funded by OSEO France
European languages, more colloquial speech, handwriting
• BABEL 2012-2017: funded by US IARPA
spoken term detection with noisy and limited training data
• EU projects 2012-2014: EU-Bridge, TransLectures
emphasis on recognition and translation of lectures (academic, TED, ...)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 6 TSD, Plzen, 14-Sep-2015
Speech and Language: Characteristic Properties
typical situation:
input string → output string
tasks:
• speech recognition:
speech signal → string of words/letters
• recognition of image text (printed and written characters):
text image → string of words/letters
• machine translation:
string of source words → string of target words/letters
common property:
output string = string of words/letters in a natural language
terminology:
– compound decision theory
– contextual pattern recognition
– structured output
elementary pattern classification
and machine learning:
single class index
without any structure
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 7 TSD, Plzen, 14-Sep-2015
2 Statistical Approach: Principles
starting points:
• very complex problem:
no perfect knowledge of the dependencies in speech and language:
– different from conventional computer science
– like a problem in natural sciences (e.g. approximative models in physics)
• perfect solution will be difficult:
– we accept that the system will make errors
– but we try to find the best compromise
• fairly general view:
– input sequence over time t: X := x1...xt...xT
– output sequence: W := w1...wn...wN of unknown length N
• we need a generation mechanism:
X → W (X)
• to this purpose, we assume a
– posterior distribution pr(W |X)
– which can be extremely complex: both arguments are strings!
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 8 TSD, Plzen, 14-Sep-2015
Bayes Decision Rule
performance measure or loss function (e. g. edit or Levenshtein distance)
between true output sequence W and hypothesized output sequence W :
L[W, W ]
Bayes decision rule minimizes expected loss:
X → W (X) := argminW
{
∑
W
pr(W |X) · L[W ,W ]}
Under these two conditions:
L[W, W ] : satisfies triangle inequality
maxW
{pr(W |X)} > 0.5
we have [Schlüter et al. 2012]:
X → W (X) := argmaxW
{
pr(W |X)}
Since [IBM 1982], this approximation is widely used
for speech recognition, handwriting recognition, machine translation, ...
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 9 TSD, Plzen, 14-Sep-2015
Speech Recognition: Principles
Problem in Bayes decision rule:
– true posterior distribution: unknown
– to replace it, assume suitable model distributions with free parameters:
p(W |X) =p(W ) · p(X|W )
∑
W p(W ) · p(X|W )
– generative model: language model p(W ) and acoustic model p(X|W )
structure of spoken language implies a hierarchy of levels:
• sentence: concatenation of words W = w1...wn...wN
modelled by (statistical) language model p(W )
• words: concatenation of phonemes (sounds)
modelled by (conventional) pronunciation dictionary
• phonemes (sounds): time signal or waveform
– acoustic (spectral) analysis every 10 msec
– result: sequence of observed acoustic vectors (10-ms events) X = x1...xt...xT
– fundamental problem in modelling p(X|W ): variability in speaking rate
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 10 TSD, Plzen, 14-Sep-2015
Speech Input
AcousticAnalysis
Phoneme Inventory
Pronunciation Lexicon
Language Model
Global Search:
maximize
x1 ... xT
Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)
w1 ... wN
RecognizedWord Sequence
over
Pr(x1 ... xT | w1...wN )
Pr(w1 ... wN)
Statistical Approach to Automatic
Speech Recognition (ASR)
(IBM: Mercer, Bahl, Jelinek 1982)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 11 TSD, Plzen, 14-Sep-2015
Hidden Markov Models (HMM)
• fundamental problem in ASR:
non-linear time alignment
• Hidden Markov Model:
– linear chain of states s = 1, ..., S
– transitions: forward, loop and skip
• trellis:
– unfold HMM over time t = 1, ..., T
– path: state sequence sT1 = s1...st...sT– observations: xT
1 = x1...xt...xT
STA
TE
IN
DE
X
TIME INDEX
2 31 5 64
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 12 TSD, Plzen, 14-Sep-2015
Hidden Markov Models (HMM)
The acoustic model p(X|W ) provides the link between
sentence hypothesis W and observations sequence X = xT1 = x1...xt...xT :
• acoustic probability p(xT1 |W ) using hidden state sequences sT1 :
p(xT1 |W ) =
∑
sT1
p(xT1 , s
T1 |W ) =
∑
sT1
∏
t
[p(st|st−1,W ) · p(xt|st,W )]
• two types of distributions:
– transition probability p(s|s′,W ): not important
– emission probability p(xt|s,W ): key quantity
realized by GMM: Gaussian mixtures models (trained by EM algorithm)
• phonetic labels (allophones, sub-phones): (s,W ) → α = αsW
p(xt|s,W ) = p(xt|αsW )
typical approach: phoneme models in triphone context:
decision trees (CART) for finding equivalence classes
• refinements:
– augmented feature vector: context window around position t
– subsequent LDA (linear discriminant analysis)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 13 TSD, Plzen, 14-Sep-2015
Recognition: From Speech to Characters
image text recognition:
– define vertical slots over horizontal axis
– result: image signal = (quasi) one-dim. structure like speech signal
Language Database Example
French RIMES
Arabic IfN/ENIT
English IAM
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 14 TSD, Plzen, 14-Sep-2015
From Speech Recognition to Machine Translation
En
vert
u de
les
nou
vell
es
pro
posi
tion
s ,
qu
el
est le
cou
t
pre
vu de
ad
min
istr
ati
on et
de
perc
ep
tion de
les
dro
its ?
What
is
the
anticipated
cost
of
administering
and
collecting
fees
under
the
new
proposal
?
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 15 TSD, Plzen, 14-Sep-2015
HMM: Speech Recognition vs. Machine Translation
speech recognition machine translation
Pr(xT1 |T,w) = Pr(fJ
1 |J, eI1) =
∑
sT1
∏
t
[p(st|st−1, Sw, w) p(xt|st, w)]∑
aJ1
∏
j
[p(aj|aj−1, I) p(fj|eaj)]
time t = 1, ..., T source positions j = 1, ..., J
observations xT1 observations fJ
1
with acoustic vectors xt with source words fj
states s = 1, ..., Sw target positions i = 1, ..., I
of word w with target words eI1path: t → s = st alignment: j → i = aj
always: monotone sometimes: montone
transition prob. p(st|st−1, Sw, w) alignment prob. p(aj|aj−1, I)
emission prob. p(xt|st, w) lexicon prob. p(fj|eaj)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 16 TSD, Plzen, 14-Sep-2015
Statistical Approach to Speech Recognition (and other NLP tasks)
TrainingData
TestData
ProbabilisticModels
Performance Measure(Loss Function)
Training Criterion
Bayes Decision Rule(Efficient Algorithm)
Output
ParameterEstimates
Evaluation
Optimization(Efficient Algorithm)
Speech Recognition = Modelling + Statistics + Efficient Algorithms
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 17 TSD, Plzen, 14-Sep-2015
Statistical Approach Revisited
four ingredients:
• performance measure: often edit distance
we have to decide how to judge the quality of the system
• probabilistic models (with a suitable strcture):
to capture the dependencies within and between X and W
– elementary observations: Gaussian mixtures, log-linear models, SVM, ANN, ...
– strings: n-gram Markov models, HMM, RNN, ...
• training criterion:
to learn the free parameters of the models
– ideally should be linked to performance criterion
– might result in complex mathematical optimization (efficient algorithms!)
• Bayes decision rule:
to generate the output word sequence
– combinatorial problem (efficient algorithms)
– should exploit structure of models
examples: DP and beam search, A∗ and heuristic search, ...
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 18 TSD, Plzen, 14-Sep-2015
3 From Generative to Discriminative Modelling
starting point:
• models pθ(W ) and pθ(X|W ) with unknown parameters θ
• training data: set of (audio, sentence) pairs (Xr,Wr), r = 1, ..., R
training criteria in ASR:
• generative model: maximum likelihood (along with EM/Viterbi algorithm):
F (θ) =∑
r
log pθ(Wr, Xr) =∑
r
log pθ(Wr) +∑
r
log pθ(Xr|Wr)
nice property: decomposition into two separate problems:
– language model pθ(W ): without annotation!
– acoustic model pθ(X|W ): with annotation!
• sentence posterior probability (MMI = maximum mutual information)
[1986 Mercer, 1991 Normandin]:
F (θ) =∑
r
log pθ(Wr|Xr) pθ(W |Xr) :=pθ(W ) pθ(Xr|W )
∑
W ′ pθ(W ′) pθ(Xr|W ′)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 19 TSD, Plzen, 14-Sep-2015
• (C.-H. Lee’s) MCE: minimum classification error
(old concept in pattern recognition) with smoothing parameter β
F (θ) =∑
r
1
1 +( pθ(Xr,Wr)
∑
W 6=Wr
pθ(Xr,W )
)2β
• (Povey’s) MWE/MPE: minimum word/phoneme error (= expected ’accuracy’):
F (θ) =∑
r
∑
W
pθ(W |Xr) · A(W,Wr)
with the accuracy A(W,Wr) of hypothesis W for correct sentence Wr:
= count of correct frame labels (e.g. phoneme: 1 out of 50)
remarks:
– initialization by maximum likelihood
– complex optimization problem: sum over all sentences in denominator
– approximation: word lattice, many shortcuts, ...
– experiments: relative improvement by 5-10% over maximum likelihood
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 20 TSD, Plzen, 14-Sep-2015
HMM and EM Revisited
baseline training of HMM:
– maximum likelihood by EM (expectation/maximization) algorithm
– looks like the ultimate and perfect solution
positive properties:
• FULL generative model: pθ(W,X) = pθ(W ) · pθ(X|W )
along with HMM for pθ(X|W ): describes the problem completely
• natural training criterion:
– maximum likelihood, i.e. maxθ
{∑
r log pθ(Wr, Xr)}
– virtually closed form solutions by EM algorithm
– nice from the mathematical point of view
negative properties:
• EM or maximum likelihood criterion
– solves a problem that is more complex than required, i.e. pθ(W,X) vs. pθ(W |X)
– VERY hard from the estimation point view
• well-known in classical pattern recognition, but ignored/overlooked in ASR:
density estimation, i.e. learning pθ(X|W ) or pθ(xt|α), is much harder than
classification, i.e. learning pθ(W |X) or pθ(α|xt)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 21 TSD, Plzen, 14-Sep-2015
Acoustic Modelling and ANN: Hybrid Approach
consider modelling the acoustic vector xt in an HMM:
• re-write the emission probability for label α and acoustic vector xt:
p(xt|α) = p(xt) ·p(α|xt)
p(α)
– prior probability p(α): estimated as relative frequencies
– for recognition purposes: the term p(xt) can be dropped
• result: model the label posterior probability by an ANN:
xt → p(α|xt)
rather than the state emission distribution p(xt|α)
• justification:
– easier learning problem: labels α = 1, ..., 5000 vs. vectors xt ∈ IRD=40
– well-known result in pattern recognition (but ignored in ASR!)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 22 TSD, Plzen, 14-Sep-2015
History: ANN in Acoustic Modelling
approaches in ASR:
• 1989 Waibel, Hanazawa, Hinton, Shikano, Lang:
Phoneme recognition using time-delay neural networks
• 1990 Bourlard & Wellekens:
– for squared error criterion, ANN outputs can be interpreted as
class posterior probabilities (rediscovered: Patterson & Womack 1966)
– they advocated the use of MLP outputs
to replace the emission probabilities in HMMs
• 1990 Bridle:
softmax operation for probability normalization in output layer
• 1994 Robinson: recurrent neural network
– competitive results on WSJ task
– work remained a singularity in ASR
• ...
experimental situation:
until recently, ANNs were never really competitive with GMMs
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 23 TSD, Plzen, 14-Sep-2015
History: ANN in Acoustic Modelling
related approaches:
• 1996 Fritsch, Finke, Waibel:
Hierarchical mixtures of experts
• 1997 Hochreiter & Schmidhuber:
Long short-term memory neural computation
• 1998 Yann LeCun: convolutional neural networks
(second) renaissance: concepts of deep learning:
• 2000 Hermansky & Ellis: tandem approach: ANN for feature extraction in ASR
• 2002 Utgoff & Stracuzzi: many-layered learning for symbolic processing
• 2006 Hinton: introduced what he called deep learning (belief nets)
• 2011 Seide, Deng, Yu (Microsoft):
– combined Hinton’s deep learning with hybrid approach
– significant improvement by deep MLP on a large-scale task
• since 2012: other teams confirmed reductions of WER by 20% to 30%
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 24 TSD, Plzen, 14-Sep-2015
(Simplified) Summary: What is Different Now after 25 Years?
comparison: today’s systems vs. 1989-1994:
• number of hidden layers: 10 (or more)
rather than 2-3
• number of output nodes: 5000 (or more)
rather than 50
• optimization strategy:
practical experience and heuristics,
e.g. layer-by-layer pretraining
• computation power: much more
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 25 TSD, Plzen, 14-Sep-2015
4 Language Modelling and ANN
so far: sub-symbolic processing:
speech/audio, text images, image/video (computer vision)
now: symbolic processing for language modelling (and translation):
• 1989 M. Nakamura & K. Shikano:
English word category prediction based on neural networks.
• 1997 J.M. Castro, F. Casacuberta, F. Prat:
Towards connectionist language models.
• 1997 A. Castano, F. Casacuberta:
A connectionist approach to machine translation.
• 2007 H. Schwenk:
Continuous space language models.
• 2006 H. Schwenk, M.R. Costa-jussa, J.A.R. Fonollosa:
Smooth bilingual n-gram translation.
• 2012 H. Son Le, A. Allauzen, F. Yvon:
Continuous space translation models with neural networks.
today: ANNs in language modelling/translation show competitive results.
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 26 TSD, Plzen, 14-Sep-2015
Review: Language Modelling
language model: word sequence wN1 := w1...wn...wN
language model: conditional probability p(wn|wn−10 )
p(wN1 ) =
N∏
n=1
p(wn|wn−10 )
with artificial start symbol w0
approaches to modelling p(wn|wn−10 )
• count models (Markov chain):
– limit history wn−10 to k predecessor words
– smooth relative frequencies (e.g. SRI toolkit)
• MLP models:
– limit history too
– use predecessor words as input to MLP
• RNN models: unlimited history!
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 27 TSD, Plzen, 14-Sep-2015
ANN for Language Modelling
structure of ANN for language modelling:
• input layer: k predecessor words with
1-of-V coding (V = vocabulary size)
• first layer: projection layer
– idea: dimension reduction (e.g. from 150k to 600!)
– a linear operation (matrix multiplication) without sigmoid activation
– shared accross all predecessor words of the history h
• output layer: conditional probability of language model p(w|h):
softmax operation for normalization
• training criterion:
– perplexity: equivalent to cross-entropy
– early stopping using cross-validation on dev corpus
• properties of softmax operation:
– computationally expensive (sum over full vocabulary)
– remedy: word classes (automatically trained)
– normalized outputs of softmax fit nicely into perplexity criterion
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 28 TSD, Plzen, 14-Sep-2015
MLP without and with Word Classes: Two Predecessor Words
factorization of conditional language model probability p(w|h) for each history h:
p(w|h) = p(g|h) · p(w|g, h)
using a unique word classe g for each word h
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 29 TSD, Plzen, 14-Sep-2015
RNN without and with Word Classes
ANN with memory for sequence processing:
left-to-right processing of word sequence w1...wn....wN :
p(wN1 ) =
∏
n
p(wn|wn−10 ) =
∏
n
p(wn|wn−1, hn−1)
with output hn−1 of hidden layer at position (n − 1): hn−1 = hn−1(wn−21 )
input to RNN: immediate predecessor word wn−1
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 30 TSD, Plzen, 14-Sep-2015
LSTM RNN
[LSTM: Hochreuter & Schmidhuber 1997]
refinement of RNN:
LSTM = long short-term memory
– RNN: problems with vanishing gradients
– remedy: cells with gates rather than nodes
– details: see literature
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 31 TSD, Plzen, 14-Sep-2015
Experimental Conditions
[Sundermeyer et al., IEEE ASL, March 2015]
• results on QUAERO English 2013 (like before):
– vocabulary size: 150k words
– training text: 50M words
– dev and eval sets: 39k and 35k words
• MLP: structure:
– projection layer: 300 nodes
– hidden layer: 600 nodes
– size of MLP is dominated
by input and output layers:
150k · 300 + 600 · 150k = 135M
• RNN (and LSTM RNN): structure
– projection and hidden layer: each 600 nodes
– size of RNN is dominated
by input and output layers:
150k · 600 + 600 · 150k = 180M
perplexity PPL on dev data:
approach PPL
count model 163.7
10-gram MLP 136.5
RNN 125.2
LSTM-RNN 107.8
10-gram MLP with 2 layers 130.9
LSTM-RNN with 2 layers 100.5
observation:
(huge) improvement by 40%
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 32 TSD, Plzen, 14-Sep-2015
Complexity: Computation Times
Training times (without GPUs!) for training corpus of 50 Million words:
Models PPL CPU Time (Order)
Count model 163.7 30 min
MLP 136.5 1 week
LSTM-RNN 107.8 3 weeks
• problem: high computation times
• remedy: two types of language models:
– count model: trained on a huge corpus: 3.1 Billion words
– ANN models: trained on a small corpus: 50 Million words
• resulting language model:
linear interpolation of TWO models
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 33 TSD, Plzen, 14-Sep-2015
Interpolated Language Models: Perplexity and WER
• linear interpolation of TWO models: count model + ANN model
• perplexity and word error rate on test data:
Models PPL WER[%]
count model 131.2 12.4
+ 10-gram MLP 112.5 11.5
+ Recurrent NN 108.1 11.1
+ LSTM-RNN 96.7 10.8
+ 10-gram MLP with 2 layers 110.2 11.3
+ LSTM-RNN with 2 layers 92.0 10.4
• experimental result:
– significant improvements by ANN language models
– best improvement in perplexity: 30% reduction (from 131 to 92)
– empirical observation:
power law between WER and perplexity (cube to square root)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 34 TSD, Plzen, 14-Sep-2015
Plot: Perplexity vs. Word Error Rate
10.4
10.8
11.2
11.6
12
12.4
90 100 110 120 130
Wor
d E
rror
Rat
e (%
)
Perplexity
Count-based
+ Feedforward
+ RNN
+ LSTM
Regression
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 35 TSD, Plzen, 14-Sep-2015
Extended Range: Perplexity vs. Word Error Rate
11 12
14
16 18 20 22
25 28 31
100 125 160 200 250 315 400 500 630 800 1000 1250 1600 2000
Wor
d E
rror
Rat
e (%
)
Perplexity
Count-based
+ Feedforward
+ RNN
+ LSTM
Regression
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 36 TSD, Plzen, 14-Sep-2015
Plot: Perplexity vs. Character Error Rate
6.5
6.8
7
7.2
7.4
7.6
90 100 110 120 130
Cha
ract
er E
rror
Rat
e (%
)
Perplexity
Count-based
+ Feedforward
+ RNN
+ LSTM
Regression
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 37 TSD, Plzen, 14-Sep-2015
Extended Range: Perplexity vs. Character Error Rate
6.5 7
7.5 8
9
10
12
14
16
18.5
100 125 160 200 250 315 400 500 630 800 1000 1250 1600 2000
Cha
ract
er E
rror
Rat
e (%
)
Perplexity
Count-based
+ Feedforward
+ RNN
+ LSTM
Regression
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 38 TSD, Plzen, 14-Sep-2015
Overall Improvements by ANNs: Experimental Results
conditions:
– QUAERO English Eval 2013
– baseline system: competitive with LIMSI and KIT
– acoustic input features: optimized for each model
Language Model PP Acoustic Model WER[%]
Count Fourgram 131.2 Gaussian Mixture 19.2
deep MLP 10.7
+ LSTM-RNN 92.0 Gaussian Mixture 16.5
deep MLP 9.3
remarks:
– overall improvements by ANNs: 50% (from 19.2% to 9.3%)
– lion’s share of improvement: acoustic model
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 39 TSD, Plzen, 14-Sep-2015
5 Conclusions
many tasks in speech (and other NLP) technology:
mapping from input string to output string
• statistical approach: four key ingredients
– choice of performance measure: errors at string, word, phoneme, frame level
– probabilistic models at these levels and the interaction between these levels
– training criterion along with an optimization algorithm
– Bayes decision rule along with an efficient implementation
• about recent work on artificial neural nets:
– they result in significant improvements
– they provide one more type of probabilistic models
– they are PART of the statistical approach
• specific future challenges for statistical approach in general:
– complex mathematical model that is difficult to analyze
– questions: can we find suitable mathematical approximations
with more explicit descriptions of the dependencies and level interactions
and of the performance criterion (error rate)?
• specific challenges for ANNs:
– can the HMM-based alignment mechanism be replaced?
– can we find ANNs with more explicit probabilistic structures?
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 40 TSD, Plzen, 14-Sep-2015
Statistical Approach to Speech Recognition (and other NLP tasks)
TrainingData
TestData
ProbabilisticModels
Performance Measure(Loss Function)
Training Criterion
Bayes Decision Rule(Efficient Algorithm)
Output
ParameterEstimates
Evaluation
Optimization(Efficient Algorithm)
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 41 TSD, Plzen, 14-Sep-2015
THE END
H. Ney: Statistical Approach to ASR and NLP c©RWTH Aachen 42 TSD, Plzen, 14-Sep-2015
The Blackslide
GoBack