47
The Main Concepts of Speech Recognition by Johnny Coldsheep Yang

The Main Concepts of Speech Recognition

  • Upload
    -

  • View
    146

  • Download
    2

Embed Size (px)

Citation preview

Page 1: The Main Concepts of Speech Recognition

The Main Conceptsof Speech Recognition

by Johnny Coldsheep Yang

Page 2: The Main Concepts of Speech Recognition

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder

Page 3: The Main Concepts of Speech Recognition

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder

Page 4: The Main Concepts of Speech Recognition

What is Speech Recognition?● Input speech signal, output text transcription.● Other synonyms:

○ Automatic Speech Recognition (ASR)○ Large Volcabulary Continuous Speech Recognition (LVCSR)○ Speech to Text (STT)

● So, how to do that with machines?

Page 5: The Main Concepts of Speech Recognition

How human do the same task?

Pronuciation?

Which Word Might

Comes Next?

CandidateWords?

Final Text Transcription

Mmm...

Page 6: The Main Concepts of Speech Recognition

How human do the same task?

Pronuciation?

Which Word Might

Comes Next?

CandidateWords?

Final Text Transcription

AcousticModel

LanguageModel

Lexicon

Decoder

Page 7: The Main Concepts of Speech Recognition

The Basic ASR System

Page 8: The Main Concepts of Speech Recognition

The Basic ASR System

Offline Part

Page 9: The Main Concepts of Speech Recognition

The Basic ASR System Onleine Part

Offline Part

Page 10: The Main Concepts of Speech Recognition

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder

Page 11: The Main Concepts of Speech Recognition

The Basic ASR System

Page 12: The Main Concepts of Speech Recognition

What is a Lexicon in ASR?● A file which includes all the possible

words.● Each word is mapped to a

phone-sequence.● There are about 160k words in

current system.

Example:

, sil

一一道来 ii i1 ii i1 d ao4 l ai2

我 uu uo3

你 n i3

喜欢 x i3 h uan1

...

Page 13: The Main Concepts of Speech Recognition

Why Lexicon with Phone Sequence?Three Reasons:

● Data Coverage● Model Limitation● New Word Extension

Example:

, sil

一一道来 ii i1 ii i1 d ao4 l ai2

我 uu uo3

你 n i3

喜欢 x i3 h uan1

...

Page 14: The Main Concepts of Speech Recognition

Why Lexicon with Phone Sequence?● Data Coverage

○ Our pronounciations are often influenced by context.○ Need many training data to model them in different context.○ Polyphonic words.○ There are much more phone-samples than word-samples.

All 3 “s” sound differently I like this. This is a book.

Polyphonic word “the” The book The earth

Page 15: The Main Concepts of Speech Recognition

Why Lexicon with Phone Sequence?● Model Limitation

○ There are about 1k~5k phone-combination HMM states.○ And there are 160k words in current system.○ 160k words with HMM states? Too large to train and predict.

● For example, a DNN with hidden layer 2048 nodes:

DNN 5k DNN

160k

10M parameters 327M parameters

Page 16: The Main Concepts of Speech Recognition

Why Lexicon with Phone Sequence?● New Word Extension

○ With phone-level system, we can just modify Lexicon and Language Model.○ Without phone-level system:

■ Modify Lexicon and Language Model.■ Collect enough speech data of new word.■ Re-train the Acoustic Model again.

Page 17: The Main Concepts of Speech Recognition

How to Establish a Lexicon in Chinese?

他出生在少林寺他出生在 少林寺

他 t a1出生 ch u1 sh eng1在 z ai4少 sh ao4林 l in2寺 s iy4

他 t a1出生 ch u1 sh eng1在 z ai4少林寺 sh ao4 l in2 s iy4

Page 18: The Main Concepts of Speech Recognition

Challenges in Lexicon Auto Build● Word Segmentation

○ What is the most reasonable way?○ Ex: 一切正常 vs. 一切 正常

● Polyphonic words○ 少 sh ao3○ 少 sh ao4

Page 19: The Main Concepts of Speech Recognition

Then we have a Lexicon now

Lexicon

Page 20: The Main Concepts of Speech Recognition

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder

Page 21: The Main Concepts of Speech Recognition

The Basic ASR System

Lexicon

Page 22: The Main Concepts of Speech Recognition

What is an Acoustic Model?● A classifier try to identify the pronounciation you are speaking, based on

input acoustic feature.

Feature Extractor Acoustic Model

pdf1 0.87pdf2 0.11pdf3 0.02pdf4 0pdf5 0pdf6 0pdf7 0… …

Page 23: The Main Concepts of Speech Recognition

Hidden Markov Model(HMM)● Typically, we use left-to-right HMM to model a phone.

○ Speech won’t go backword.○ A phone’s pronounciation varies in different phases.

● An example of phone-HMM topology:

Page 24: The Main Concepts of Speech Recognition

If we got many HMMs already… ● When a speech frame input comes… ● How to know which phone, which HMM-state it is?● We need an Acoustic Model to tell us!

○ HMM-Gaussian Mixure Model○ Deep Neural Network○ Other NNs like CNN, TDNN, LSTM...

Page 25: The Main Concepts of Speech Recognition

HMM-Gaussian Mixture Model(GMM)● Use multiple Gaussian models to model HMM states.

○ For Example: Use 39 Gaussians, corresponding to the dimension of MFCC.

● A state might have different GMMs.● Some states may share the same GMM. Real Example:

phone 61 means “i2”2nd column: states3rd column: GMM pdf

Page 26: The Main Concepts of Speech Recognition

Force Alignment● With built Lexicon, we know the phone-sequence for each wav files.● Now we want to train HMM-GMM for each phone, but...● How we know where the phones are in speech data?

○ Mark them manually? Too slow!○ We need Force Alignment.

他 出 生 在 少 林 寺t a1 ch u1 sh eng1 z ai4 sh ao4 l in2 s iy4

Page 27: The Main Concepts of Speech Recognition

Force Alignment● Estimate and Measure Algorithm (EM Algorithm)

○ Estimate: guess the boundary first.○ Measure: does the estimation reach the criteria? (Ex: Maximum Likelihood.)

■ YES: Stop the loop.■ NO: Use current estimation to set a model ,estimate again.

● For real example:

35iters

EquallyAlignment

Mono-phoneAlignment

OtherAlignment

Tri-phoneAlignment

40iters

b ai4 t uo1 b ai4 t uo1 ?-b+ai4 ai4-t+uo1 b-ai4+t t-uo1+?

Page 28: The Main Concepts of Speech Recognition

Deep Learning in Acoustic Model● Use NN to predict the probability of each GMM(pdf).

Feature Extractor DNN

pdf0 0.87pdf1 0.11pdf2 0.02pdf3 0pdf4 0pdf5 0pdf6 0… …

featvec

Page 29: The Main Concepts of Speech Recognition

NN Training as Acoustic Model● With force-alignment, we can start DNN supervised learning.

DNN

fbank/mfccfeatvec

pdfvec

0.870.5430.940.380.78

0.66660.7777

...0.886

0010000...0

Input:Acoustic Features

(mfcc/fbank)

Output:Answer fromForce Align

Page 30: The Main Concepts of Speech Recognition

NN Training as Acoustic Model● With force-alignment, we can start DNN supervised learning.● You can append other features as input as well.● For example:

DNN

fbankfeatvec pdf

vecspkrvec

Add speakeror i-vectors.

DNN pdfvec

-3-2-1

vec+1+2+3

Add context frame informations

Page 31: The Main Concepts of Speech Recognition

Then we have a DNN AM now

NeuralNetwork

Lexicon

Page 32: The Main Concepts of Speech Recognition

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder

Page 33: The Main Concepts of Speech Recognition

The Basic ASR System

NerualNetwork

Lexicon

Page 34: The Main Concepts of Speech Recognition

What is a Language Model(LM)?● Usually use n-gram LM.

○ RNN-LM only used for rescoring so far, due to speed issue.

● Given previous words, return a probability of next word.● Some usage examples:

○ getLMProbability( [你, 好], 嗎 ) ■ return 3-gram probability: 10^-0.428

○ getLMProbability( [好], 喔 ) ■ return 2-gram probability: 10^-6.15

○ getLMProbability( [你, 好], 啊 )■ not found in 3-gram, return getLMProbability( [好], 啊 )■ not found in 2-gram, return getLMProbability( [], 啊 )■ return 1-gram probability: 10^-1.432

\data\ngram 1=189234ngram 2=2668678ngram 3=1555761

\1-grams:-1.432 啊...\2-grams:-6.15 好 喔...\3-grams:-0.428 你 好 嗎

Page 35: The Main Concepts of Speech Recognition

Why we need Language Model?● Some words’ pronounciations are really similiar.● Users may have unusual pronounciation.● For example, with the same pronounciation, we can use LM to decide which

word-sequence is more reasonable:○ 下次 再見 (Reasonable)○ 下次 在 劍 (Obviously bad)

Page 36: The Main Concepts of Speech Recognition

How to Train a Language Model (Simply)?● Word count trough all the transcriptions:

○ 你 好 1377○ 你 好 啊 40○ 你 好 冷漠 1○ …

● Then we can have 3-gram probability:○ getLMProbability( [你, 好], 啊 ) = 40/1377 ≈ 0.02905 ≈ log(1.536874)○ Save it into the LM file:

Page 37: The Main Concepts of Speech Recognition

Then we have the N-gram LM now!

DNN

Lexicon

N-gram

Page 38: The Main Concepts of Speech Recognition

Outline● What is Speech Recognition?● Lexicon● Acoustic Model● Language Model● WFST Decoder

Page 39: The Main Concepts of Speech Recognition

The Basic ASR System

DNN

Lexicon

N-gram

Page 40: The Main Concepts of Speech Recognition

What is a Decoder? ● Integrate Lexicon, Language Model and Acoustic Model, find possible paths

with input speech.● The path with highest score presents the most reasonable word sequence.● Two types of Decoder in ASR:

○ Weight Finite State Transducer : kaldi○ Viterbi : HDecode(HTK)

Page 41: The Main Concepts of Speech Recognition

Example of WFST● Start from state 0.● input: acca● output: yxxx● weighted value: 0.4⊗0.7⊗0.9⊗1.0⊗1.3

acoustic feature

ϵ/tri-phone/phone/word

Score of the pathGiven by AM/LM

Page 42: The Main Concepts of Speech Recognition

4 Levels HCLG in WFST Decoder● HCLG: 4 levels of states in WFST Decoder.● H (HMM):

● C (Context):

Tranducer Input Output

H HMM state

Context phone

C Contextphone phone

L phone word

G word word

DNNAM

probability

Page 43: The Main Concepts of Speech Recognition

4 Levels HCLG in WFST Decoder● HCLG: 4 levels of states in WFST Decoder.● L (Lexicon):

● G (Grammar):

Example Lexiconspeech s p iy chthe dh axthe dh iy…

N-gram Language Model

Page 44: The Main Concepts of Speech Recognition

Token & Beam Pruning● The WFST is like a map.● Token is the traveler, which contains:

○ Current position○ History path (History Words)○ Score

● Token travel through the map:○ Make a copy while facing branches.○ Killed itself if the score is too behind.

Position: 1History: give, aScore: 0.87

Page 45: The Main Concepts of Speech Recognition

Token & Beam Pruning● The loop of Decode:

○ Token Copy○ Update AM/LM scores○ Record the highest_score over all tokens○ If Token.score < (highest_score - beam), kill it.

● Finally, we choose the token with the highest score.● Output it history words as answer.

beam pruning

History: give, a, speech

Got the answer! Ya!!

Page 46: The Main Concepts of Speech Recognition

Summary

Pronuciation?

Which Word Might

Comes Next?

CandidateWords?

Final Text Transcription

DNNAM

N-gramLM

Phone-LabeledLexicon

WFSTDecoder

Page 47: The Main Concepts of Speech Recognition

Thank You For Listening