Upload
-
View
146
Download
2
Embed Size (px)
Citation preview
The Main Conceptsof Speech Recognition
by Johnny Coldsheep Yang
Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder
Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder
What is Speech Recognition?● Input speech signal, output text transcription.● Other synonyms:
○ Automatic Speech Recognition (ASR)○ Large Volcabulary Continuous Speech Recognition (LVCSR)○ Speech to Text (STT)
● So, how to do that with machines?
How human do the same task?
Pronuciation?
Which Word Might
Comes Next?
CandidateWords?
Final Text Transcription
Mmm...
How human do the same task?
Pronuciation?
Which Word Might
Comes Next?
CandidateWords?
Final Text Transcription
AcousticModel
LanguageModel
Lexicon
Decoder
The Basic ASR System
The Basic ASR System
Offline Part
The Basic ASR System Onleine Part
Offline Part
Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder
The Basic ASR System
What is a Lexicon in ASR?● A file which includes all the possible
words.● Each word is mapped to a
phone-sequence.● There are about 160k words in
current system.
Example:
, sil
一一道来 ii i1 ii i1 d ao4 l ai2
我 uu uo3
你 n i3
喜欢 x i3 h uan1
...
Why Lexicon with Phone Sequence?Three Reasons:
● Data Coverage● Model Limitation● New Word Extension
Example:
, sil
一一道来 ii i1 ii i1 d ao4 l ai2
我 uu uo3
你 n i3
喜欢 x i3 h uan1
...
Why Lexicon with Phone Sequence?● Data Coverage
○ Our pronounciations are often influenced by context.○ Need many training data to model them in different context.○ Polyphonic words.○ There are much more phone-samples than word-samples.
All 3 “s” sound differently I like this. This is a book.
Polyphonic word “the” The book The earth
Why Lexicon with Phone Sequence?● Model Limitation
○ There are about 1k~5k phone-combination HMM states.○ And there are 160k words in current system.○ 160k words with HMM states? Too large to train and predict.
● For example, a DNN with hidden layer 2048 nodes:
DNN 5k DNN
160k
10M parameters 327M parameters
Why Lexicon with Phone Sequence?● New Word Extension
○ With phone-level system, we can just modify Lexicon and Language Model.○ Without phone-level system:
■ Modify Lexicon and Language Model.■ Collect enough speech data of new word.■ Re-train the Acoustic Model again.
How to Establish a Lexicon in Chinese?
他出生在少林寺他出生在 少林寺
他 t a1出生 ch u1 sh eng1在 z ai4少 sh ao4林 l in2寺 s iy4
他 t a1出生 ch u1 sh eng1在 z ai4少林寺 sh ao4 l in2 s iy4
Challenges in Lexicon Auto Build● Word Segmentation
○ What is the most reasonable way?○ Ex: 一切正常 vs. 一切 正常
● Polyphonic words○ 少 sh ao3○ 少 sh ao4
Then we have a Lexicon now
Lexicon
Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder
The Basic ASR System
Lexicon
What is an Acoustic Model?● A classifier try to identify the pronounciation you are speaking, based on
input acoustic feature.
Feature Extractor Acoustic Model
pdf1 0.87pdf2 0.11pdf3 0.02pdf4 0pdf5 0pdf6 0pdf7 0… …
Hidden Markov Model(HMM)● Typically, we use left-to-right HMM to model a phone.
○ Speech won’t go backword.○ A phone’s pronounciation varies in different phases.
● An example of phone-HMM topology:
If we got many HMMs already… ● When a speech frame input comes… ● How to know which phone, which HMM-state it is?● We need an Acoustic Model to tell us!
○ HMM-Gaussian Mixure Model○ Deep Neural Network○ Other NNs like CNN, TDNN, LSTM...
HMM-Gaussian Mixture Model(GMM)● Use multiple Gaussian models to model HMM states.
○ For Example: Use 39 Gaussians, corresponding to the dimension of MFCC.
● A state might have different GMMs.● Some states may share the same GMM. Real Example:
phone 61 means “i2”2nd column: states3rd column: GMM pdf
Force Alignment● With built Lexicon, we know the phone-sequence for each wav files.● Now we want to train HMM-GMM for each phone, but...● How we know where the phones are in speech data?
○ Mark them manually? Too slow!○ We need Force Alignment.
他 出 生 在 少 林 寺t a1 ch u1 sh eng1 z ai4 sh ao4 l in2 s iy4
Force Alignment● Estimate and Measure Algorithm (EM Algorithm)
○ Estimate: guess the boundary first.○ Measure: does the estimation reach the criteria? (Ex: Maximum Likelihood.)
■ YES: Stop the loop.■ NO: Use current estimation to set a model ,estimate again.
● For real example:
35iters
EquallyAlignment
Mono-phoneAlignment
OtherAlignment
Tri-phoneAlignment
40iters
b ai4 t uo1 b ai4 t uo1 ?-b+ai4 ai4-t+uo1 b-ai4+t t-uo1+?
…
Deep Learning in Acoustic Model● Use NN to predict the probability of each GMM(pdf).
Feature Extractor DNN
pdf0 0.87pdf1 0.11pdf2 0.02pdf3 0pdf4 0pdf5 0pdf6 0… …
featvec
NN Training as Acoustic Model● With force-alignment, we can start DNN supervised learning.
DNN
fbank/mfccfeatvec
pdfvec
0.870.5430.940.380.78
0.66660.7777
...0.886
0010000...0
Input:Acoustic Features
(mfcc/fbank)
Output:Answer fromForce Align
NN Training as Acoustic Model● With force-alignment, we can start DNN supervised learning.● You can append other features as input as well.● For example:
DNN
fbankfeatvec pdf
vecspkrvec
Add speakeror i-vectors.
DNN pdfvec
-3-2-1
vec+1+2+3
Add context frame informations
Then we have a DNN AM now
NeuralNetwork
Lexicon
Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder
The Basic ASR System
NerualNetwork
Lexicon
What is a Language Model(LM)?● Usually use n-gram LM.
○ RNN-LM only used for rescoring so far, due to speed issue.
● Given previous words, return a probability of next word.● Some usage examples:
○ getLMProbability( [你, 好], 嗎 ) ■ return 3-gram probability: 10^-0.428
○ getLMProbability( [好], 喔 ) ■ return 2-gram probability: 10^-6.15
○ getLMProbability( [你, 好], 啊 )■ not found in 3-gram, return getLMProbability( [好], 啊 )■ not found in 2-gram, return getLMProbability( [], 啊 )■ return 1-gram probability: 10^-1.432
\data\ngram 1=189234ngram 2=2668678ngram 3=1555761
\1-grams:-1.432 啊...\2-grams:-6.15 好 喔...\3-grams:-0.428 你 好 嗎
Why we need Language Model?● Some words’ pronounciations are really similiar.● Users may have unusual pronounciation.● For example, with the same pronounciation, we can use LM to decide which
word-sequence is more reasonable:○ 下次 再見 (Reasonable)○ 下次 在 劍 (Obviously bad)
How to Train a Language Model (Simply)?● Word count trough all the transcriptions:
○ 你 好 1377○ 你 好 啊 40○ 你 好 冷漠 1○ …
● Then we can have 3-gram probability:○ getLMProbability( [你, 好], 啊 ) = 40/1377 ≈ 0.02905 ≈ log(1.536874)○ Save it into the LM file:
Then we have the N-gram LM now!
DNN
Lexicon
N-gram
Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder
The Basic ASR System
DNN
Lexicon
N-gram
What is a Decoder? ● Integrate Lexicon, Language Model and Acoustic Model, find possible paths
with input speech.● The path with highest score presents the most reasonable word sequence.● Two types of Decoder in ASR:
○ Weight Finite State Transducer : kaldi○ Viterbi : HDecode(HTK)
Example of WFST● Start from state 0.● input: acca● output: yxxx● weighted value: 0.4⊗0.7⊗0.9⊗1.0⊗1.3
acoustic feature
ϵ/tri-phone/phone/word
Score of the pathGiven by AM/LM
4 Levels HCLG in WFST Decoder● HCLG: 4 levels of states in WFST Decoder.● H (HMM):
● C (Context):
Tranducer Input Output
H HMM state
Context phone
C Contextphone phone
L phone word
G word word
DNNAM
probability
4 Levels HCLG in WFST Decoder● HCLG: 4 levels of states in WFST Decoder.● L (Lexicon):
● G (Grammar):
Example Lexiconspeech s p iy chthe dh axthe dh iy…
N-gram Language Model
Token & Beam Pruning● The WFST is like a map.● Token is the traveler, which contains:
○ Current position○ History path (History Words)○ Score
● Token travel through the map:○ Make a copy while facing branches.○ Killed itself if the score is too behind.
Position: 1History: give, aScore: 0.87
Token & Beam Pruning● The loop of Decode:
○ Token Copy○ Update AM/LM scores○ Record the highest_score over all tokens○ If Token.score < (highest_score - beam), kill it.
● Finally, we choose the token with the highest score.● Output it history words as answer.
beam pruning
History: give, a, speech
Got the answer! Ya!!
Summary
Pronuciation?
Which Word Might
Comes Next?
CandidateWords?
Final Text Transcription
DNNAM
N-gramLM
Phone-LabeledLexicon
WFSTDecoder
Thank You For Listening