Robust ASR

ROBUST SPEECH RECOGNITIONUSING DEEP LEARNING FOR FAR-FIELD SPEECH SIGNALS

ROBUST SPEECH RECOGNITION

• Achieving a Low Word Error Rate (WER) in environments with Low SNR.• In which situations does this occur?• Noisy environments (e. g., running automobiles, crowded rooms)• Faint speech signals due to attenuation (e. g., far-field microphone

arrays)• Reverberant environments (e. g., large rooms, significant distance

between speaker and microphone)

SPEECH RECOGNITION PIPELINE

Feature Extracti

onAcoustic Model

Language

Model

FEATURE-BASED METHODS

• Adaptive Filters• Learn to cancel out noise by optimizing FIR or IIR filter parameters• Assume noise signals are stationary random processes• Some human speech (e. g., an unvoiced fricative) is similar to colored noise

• Spectral Subtraction• Use a filter bank to split a signal into frequency bands• Estimate the noise floor in each frequency band and subtract it from the input signals

• System Identification and Blind Deconvolution• Estimate the system composed of the speaker, room and microphone• Remove noise and system artifacts to recover original signal

FEATURE EXTRACTION

FFT MFCC Context LDA

ADAPTIVE FILTER

• Find a FIR or IIR filter such that:

• Where is the noisy signal and is the original clean signal

SPECTRAL SUBTRACTION

1. Apply Filter Bank to the periodogram: 2. Compute the energy in each frequency band: 3. Use the minimum energy observed: 4. Subtract the minimum energy: 5. Use the difference as the observed filter bank output

BLIND DECONVOLUTION

• Estimate the input signal given the output signal and system characteristics

• System Identification involves estimating the system response• Requires many input and output pairs• Complexity depends on number of filter coefficients and noise

statistics

WHY THESE METHODS OFTEN FAIL

• Noise does not always mean Additive White Gaussian Noise (AWGN)• Often interpreted as road noise, cocktail party noise or background music• These signals are non-stationary

• Their statistics change over time• Difficult to estimate meaningful bounds

• Changing model parameters often means retraining acoustic models• A feature-based method may work well for road noise, but fails for cocktail party noise• Estimating parameters is difficult and requires a significant amount of training data

• Gathering and transcribing noisy data is expensive

ACOUSTIC MODEL

GMM HMM Viterbi Search

Lexicon

ACOUSTIC MODEL-BASED METHODS

• Maximum-Likelihood Linear Regression (MLLR)• Determines an affine transform for each acoustic state• Incorporates observed state likelihoods into estimation procedure• Uses a small amount of data from a speaker or environment to determine transformations

• Model Adaptation• Maximum a posteriori methods – assumes that model parameters have a prior

distribution• Minimum classification error – Uses the classification error as the loss function• Minimum phone error – Uses the center phone error as the loss function

LIMITATIONS OF MODEL-BASED METHODS

• An adaption model is needed for every environment• For automotive systems, an adaptation model can improve performance• An adaptation model may be required for different vehicle types• Requires creating models for individual speakers in unknown environments

• Speaker Adaptive Training (SAT)• Determines a transformation for every speaker in the training set• Requires an enrollment period during which the model parameters are

estimated for each new user

DEEP LEARNING APPLICATIONS

• Acoustic States typically modeled as a mixture of Gaussian distributions

• The parameters are estimated from labeled training data• The size of the model is limited to a finite number of states and a finite number of total

parameters

• Artificial neural networks with soft-max as output layer approximate the Gaussian mixtures• Many modern ASR systems replace the Gaussian mixture models with

multilayer perceptrons

ADVANTAGES OF MULTILAYER PERCEPTRONS• With the exception of the final layer, all parameters contribute to all outputs• The parameters are optimized over all training examples• This improves generalization over the entire training set• Classes with only a few observations show better performance

• Parameter estimates will not be biased towards a few speakers• Estimating covariance matrices require at least as many observations as the matrix

dimension• Neural network training algorithms use the Hessian of the objective function during optimization• Natural gradient methods use the Fisher information matrix computed over all training examples

DEEP LEARNING INNOVATIONS

• Pretraining• Each layer is pretrained on a small data set randomly selected training examples• The pretrained output layer is discarded, but the newly trained hidden layer remains

• Rectified Linear Units• The ramp function is used instead of the sigmoid or hyperbolic tangent• Its derivative is the step function • This addresses the issue of vanishing gradients during training

• Dropout• There are symmetries between the layers• To break the symmetry, there is a probability that an update will not be applied to any specific layer

PRETRAINING

1. Start with a neural network with one hidden layer with random weights

2. Train the network with a small set of randomly selected training examples

3. Discard the output layer4. Add an additional hidden layer and output layer with

random weights5. GOTO 2.

DROPOUT JUSTIFICATION

• There are symmetries between each layer of a multilayer perceptron• The rows and columns of a weight matrix can be randomly permuted• However, the output would be unaltered• Similarly, the inputs and outputs of a layer could also be multiplied

by -1 with the output unaltered• Dropout helps by placing each layer on a slightly different trajectory

in the parameter space

LEARNING BETTER FEATURES

• Gaussian mixture models (GMM)• Use Mel Frequency Cepstral Coefficients (MFCC) or Perceptually Linear

Parameters (PLP)• Combined in context with Linear Discriminant Analysis (LDA)

• Many DNN acoustic models use the Mel filter bank outputs or periodograms• Convolutional neural networks are used to learn better filter

bank coefficients

TRADITIONAL SPEECH LEARNING PROCESS

FFT MFCC Context LDA GMM HMM

DEEP LEARNING PROCESS

FFT Filter Bank

Context DNN HMM

INCORPORATING CONVOLUTIONAL INPUT LAYERS

FFT CNN

LSTM

DNN

HMM

LONG SHORT TERM MEMORY (LSTM)

• A recurrent neural network• Has an input gate, an output gate and a forget gate• The current state of the memory is modified by the input and

the prior state• The gates control what is stored in the memory, how long it

stays in the memory and when the memory should output its values

THE FAR-FIELD PROBLEM

• Far-field speech arrives at the microphone with lower energy• Speech energy is dispersed according to an inverse-square law• Speakers will use different inflections when they project their voice

• Reverberation• Time-delayed echoes will also be recorded by the microphone

SOLUTIONS TO THE FAR-FIELD PROBLEM

• Boosting the gain• Amplifies speech and noise• The SNR remains nearly the same

• Removing reverberation• Equivalent to blind deconvolution without a priori knowledge of room geometry• Estimating the reverberation parameters from chirp frequency response

• Speech Enhancement• Model the room as a communication channel and attempt blind equalization• Create filters to remove noise before boosting the gain

COLLECTING FAR-FIELD DATA

• Con: Expensive & Time Consuming• Rent apartments• Pay people to speak• Pay transcriptionists to label speech data

• Pro: Representative Data!• Record close-talking and far-field data simultaneously• Record from multiple locations in a room simultaneously by using

multiple devices

MODEL ADAPTATION VS. DATA COLLECTION• Adapting Models• Moderately improves performance when model used in a different

environment• Requires small data collection effort• Can improve WER from 30% to between 20% and 15%

• Data Collection• Significantly improves performance because the training data

matches the environment• WER is typically less than 10%

CONCLUSIONS

• A model is only as good as the data used to train it• A mismatch between training data and field data will impact

performance significantly• Collecting representative training data is expensive, but may be

necessary

• Deep Learning can improve performance• Using the right learning structures• Let the model learn its own optimal feature representation• Combine data from multiple environments into one model (e. g.,

close-talking, far-field)

QUESTIONS?• Not all at once• Raise your hand

Documents

Robust ASR