The Main Concepts of Speech Recognition

The Main Conceptsof Speech Recognition

by Johnny Coldsheep Yang

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder


What is Speech Recognition?● Input speech signal, output text transcription.● Other synonyms:

○ Automatic Speech Recognition (ASR)○ Large Volcabulary Continuous Speech Recognition (LVCSR)○ Speech to Text (STT)

● So, how to do that with machines?

How human do the same task?

Pronuciation?

Which Word Might

Comes Next?

CandidateWords?

Final Text Transcription

Mmm...

How human do the same task?

Pronuciation?

Which Word Might

Comes Next?

CandidateWords?


AcousticModel

LanguageModel

Lexicon

Decoder

The Basic ASR System


Offline Part

The Basic ASR System Onleine Part

Offline Part



What is a Lexicon in ASR?● A file which includes all the possible

words.● Each word is mapped to a

phone-sequence.● There are about 160k words in

current system.

Example:

， sil

一一道来 ii i1 ii i1 d ao4 l ai2

我 uu uo3

你 n i3

喜欢 x i3 h uan1

...

Why Lexicon with Phone Sequence?Three Reasons:

● Data Coverage● Model Limitation● New Word Extension

Example:

， sil

一一道来 ii i1 ii i1 d ao4 l ai2

我 uu uo3

你 n i3

喜欢 x i3 h uan1

...

Why Lexicon with Phone Sequence?● Data Coverage

○ Our pronounciations are often influenced by context.○ Need many training data to model them in different context.○ Polyphonic words.○ There are much more phone-samples than word-samples.

All 3 “s” sound differently I like this. This is a book.

Polyphonic word “the” The book The earth

Why Lexicon with Phone Sequence?● Model Limitation

○ There are about 1k~5k phone-combination HMM states.○ And there are 160k words in current system.○ 160k words with HMM states? Too large to train and predict.

● For example, a DNN with hidden layer 2048 nodes:

DNN 5k DNN

160k

10M parameters 327M parameters

Why Lexicon with Phone Sequence?● New Word Extension

○ With phone-level system, we can just modify Lexicon and Language Model.○ Without phone-level system:

■ Modify Lexicon and Language Model.■ Collect enough speech data of new word.■ Re-train the Acoustic Model again.

How to Establish a Lexicon in Chinese?

他出生在少林寺他出生在少林寺

他 t a1出生 ch u1 sh eng1在 z ai4少 sh ao4林 l in2寺 s iy4

他 t a1出生 ch u1 sh eng1在 z ai4少林寺 sh ao4 l in2 s iy4

Challenges in Lexicon Auto Build● Word Segmentation

○ What is the most reasonable way?○ Ex: 一切正常 vs. 一切正常

● Polyphonic words○ 少 sh ao3○ 少 sh ao4

Then we have a Lexicon now

Lexicon



Lexicon

What is an Acoustic Model?● A classifier try to identify the pronounciation you are speaking, based on

input acoustic feature.

Feature Extractor Acoustic Model

pdf1 0.87pdf2 0.11pdf3 0.02pdf4 0pdf5 0pdf6 0pdf7 0… …

Hidden Markov Model(HMM)● Typically, we use left-to-right HMM to model a phone.

○ Speech won’t go backword.○ A phone’s pronounciation varies in different phases.

● An example of phone-HMM topology:

If we got many HMMs already… ● When a speech frame input comes… ● How to know which phone, which HMM-state it is?● We need an Acoustic Model to tell us!

○ HMM-Gaussian Mixure Model○ Deep Neural Network○ Other NNs like CNN, TDNN, LSTM...

HMM-Gaussian Mixture Model(GMM)● Use multiple Gaussian models to model HMM states.

○ For Example: Use 39 Gaussians, corresponding to the dimension of MFCC.

● A state might have different GMMs.● Some states may share the same GMM. Real Example:

phone 61 means “i2”2nd column: states3rd column: GMM pdf

Force Alignment● With built Lexicon, we know the phone-sequence for each wav files.● Now we want to train HMM-GMM for each phone, but...● How we know where the phones are in speech data?

○ Mark them manually? Too slow!○ We need Force Alignment.

他出生在少林寺t a1 ch u1 sh eng1 z ai4 sh ao4 l in2 s iy4

Force Alignment● Estimate and Measure Algorithm (EM Algorithm)

○ Estimate: guess the boundary first.○ Measure: does the estimation reach the criteria? (Ex: Maximum Likelihood.)

■ YES: Stop the loop.■ NO: Use current estimation to set a model ,estimate again.

● For real example:

35iters

EquallyAlignment

Mono-phoneAlignment

OtherAlignment

Tri-phoneAlignment

40iters

b ai4 t uo1 b ai4 t uo1 ?-b+ai4 ai4-t+uo1 b-ai4+t t-uo1+?

…

Deep Learning in Acoustic Model● Use NN to predict the probability of each GMM(pdf).

Feature Extractor DNN

pdf0 0.87pdf1 0.11pdf2 0.02pdf3 0pdf4 0pdf5 0pdf6 0… …

featvec

NN Training as Acoustic Model● With force-alignment, we can start DNN supervised learning.

DNN

fbank/mfccfeatvec

pdfvec

0.870.5430.940.380.78

0.66660.7777

...0.886

0010000...0

Input:Acoustic Features

(mfcc/fbank)

Output:Answer fromForce Align

NN Training as Acoustic Model● With force-alignment, we can start DNN supervised learning.● You can append other features as input as well.● For example:

DNN

fbankfeatvec pdf

vecspkrvec

Add speakeror i-vectors.

DNN pdfvec

-3-2-1

vec+1+2+3

Add context frame informations

Then we have a DNN AM now

NeuralNetwork

Lexicon



NerualNetwork

Lexicon

What is a Language Model(LM)?● Usually use n-gram LM.

○ RNN-LM only used for rescoring so far, due to speed issue.

● Given previous words, return a probability of next word.● Some usage examples:

○ getLMProbability( [你, 好], 嗎 ) ■ return 3-gram probability: 10^-0.428

○ getLMProbability( [好], 喔 ) ■ return 2-gram probability: 10^-6.15

○ getLMProbability( [你, 好], 啊 )■ not found in 3-gram, return getLMProbability( [好], 啊 )■ not found in 2-gram, return getLMProbability( [], 啊 )■ return 1-gram probability: 10^-1.432

\data\ngram 1=189234ngram 2=2668678ngram 3=1555761

\1-grams:-1.432 啊...\2-grams:-6.15 好喔...\3-grams:-0.428 你好嗎

Why we need Language Model?● Some words’ pronounciations are really similiar.● Users may have unusual pronounciation.● For example, with the same pronounciation, we can use LM to decide which

word-sequence is more reasonable:○ 下次再見 (Reasonable)○ 下次在劍 (Obviously bad)

How to Train a Language Model (Simply)?● Word count trough all the transcriptions:

○ 你好 1377○ 你好啊 40○ 你好冷漠 1○ …

● Then we can have 3-gram probability:○ getLMProbability( [你, 好], 啊 ) = 40/1377 ≈ 0.02905 ≈ log(1.536874)○ Save it into the LM file:

Then we have the N-gram LM now!

DNN

Lexicon

N-gram



DNN

Lexicon

N-gram

What is a Decoder? ● Integrate Lexicon, Language Model and Acoustic Model, find possible paths

with input speech.● The path with highest score presents the most reasonable word sequence.● Two types of Decoder in ASR:

○ Weight Finite State Transducer : kaldi○ Viterbi : HDecode(HTK)

Example of WFST● Start from state 0.● input: acca● output: yxxx● weighted value: 0.4⊗0.7⊗0.9⊗1.0⊗1.3

acoustic feature

ϵ/tri-phone/phone/word

Score of the pathGiven by AM/LM

4 Levels HCLG in WFST Decoder● HCLG: 4 levels of states in WFST Decoder.● H (HMM):

● C (Context):

Tranducer Input Output

H HMM state

Context phone

C Contextphone phone

L phone word

G word word

DNNAM

probability

4 Levels HCLG in WFST Decoder● HCLG: 4 levels of states in WFST Decoder.● L (Lexicon):

● G (Grammar):

Example Lexiconspeech s p iy chthe dh axthe dh iy…

N-gram Language Model

Token & Beam Pruning● The WFST is like a map.● Token is the traveler, which contains:

○ Current position○ History path (History Words)○ Score

● Token travel through the map:○ Make a copy while facing branches.○ Killed itself if the score is too behind.

Position: 1History: give, aScore: 0.87

Token & Beam Pruning● The loop of Decode:

○ Token Copy○ Update AM/LM scores○ Record the highest_score over all tokens○ If Token.score < (highest_score - beam), kill it.

● Finally, we choose the token with the highest score.● Output it history words as answer.

beam pruning

History: give, a, speech

Got the answer! Ya!!

Summary

Pronuciation?

Which Word Might

Comes Next?

CandidateWords?


DNNAM

N-gramLM

Phone-LabeledLexicon

WFSTDecoder

Thank You For Listening

Documents

The Main Concepts of Speech Recognition