Robust ASR

  • View

  • Download

Embed Size (px)

Text of Robust ASR

Robust Speech recognition of far-field SIGNALS

Robust Speech recognitionUsing deep learning for far-field speech signals

Robust speech recognitionAchieving a Low Word Error Rate (WER) in environments with Low SNR.In which situations does this occur?Noisy environments (e. g., running automobiles, crowded rooms)Faint speech signals due to attenuation (e. g., far-field microphone arrays)Reverberant environments (e. g., large rooms, significant distance between speaker and microphone)

Speech recognition pipeline

Feature-based methodsAdaptive FiltersLearn to cancel out noise by optimizing FIR or IIR filter parametersAssume noise signals are stationary random processesSome human speech (e. g., an unvoiced fricative) is similar to colored noiseSpectral SubtractionUse a filter bank to split a signal into frequency bandsEstimate the noise floor in each frequency band and subtract it from the input signalsSystem Identification and Blind DeconvolutionEstimate the system composed of the speaker, room and microphoneRemove noise and system artifacts to recover original signal

Feature Extraction

Adaptive Filter

Spectral Subtraction

Blind Deconvolution

Why These Methods Often FailNoise does not always mean Additive White Gaussian Noise (AWGN)Often interpreted as road noise, cocktail party noise or background musicThese signals are non-stationaryTheir statistics change over timeDifficult to estimate meaningful boundsChanging model parameters often means retraining acoustic modelsA feature-based method may work well for road noise, but fails for cocktail party noiseEstimating parameters is difficult and requires a significant amount of training dataGathering and transcribing noisy data is expensive

Acoustic Model

Acoustic Model-Based MethodsMaximum-Likelihood Linear Regression (MLLR)Determines an affine transform for each acoustic stateIncorporates observed state likelihoods into estimation procedureUses a small amount of data from a speaker or environment to determine transformationsModel AdaptationMaximum a posteriori methods assumes that model parameters have a prior distributionMinimum classification error Uses the classification error as the loss functionMinimum phone error Uses the center phone error as the loss function

Limitations of model-based methodsAn adaption model is needed for every environmentFor automotive systems, an adaptation model can improve performanceAn adaptation model may be required for different vehicle typesRequires creating models for individual speakers in unknown environmentsSpeaker Adaptive Training (SAT)Determines a transformation for every speaker in the training setRequires an enrollment period during which the model parameters are estimated for each new user

Deep Learning Applications

Advantages of Multilayer perceptronsWith the exception of the final layer, all parameters contribute to all outputsThe parameters are optimized over all training examplesThis improves generalization over the entire training setClasses with only a few observations show better performanceParameter estimates will not be biased towards a few speakersEstimating covariance matrices require at least as many observations as the matrix dimensionNeural network training algorithms use the Hessian of the objective function during optimizationNatural gradient methods use the Fisher information matrix computed over all training examples

Deep Learning Innovations

PretrainingStart with a neural network with one hidden layer with random weightsTrain the network with a small set of randomly selected training examplesDiscard the output layerAdd an additional hidden layer and output layer with random weightsGOTO 2.

Dropout JustificationThere are symmetries between each layer of a multilayer perceptronThe rows and columns of a weight matrix can be randomly permutedHowever, the output would be unalteredSimilarly, the inputs and outputs of a layer could also be multiplied by -1 with the output unalteredDropout helps by placing each layer on a slightly different trajectory in the parameter space

Learning Better FeaturesGaussian mixture models (GMM)Use Mel Frequency Cepstral Coefficients (MFCC) or Perceptually Linear Parameters (PLP)Combined in context with Linear Discriminant Analysis (LDA)Many DNN acoustic models use the Mel filter bank outputs or periodogramsConvolutional neural networks are used to learn better filter bank coefficients

Traditional Speech Learning Process

Deep Learning Process

Incorporating Convolutional Input Layers

Long Short Term Memory (LSTM)A recurrent neural networkHas an input gate, an output gate and a forget gateThe current state of the memory is modified by the input and the prior stateThe gates control what is stored in the memory, how long it stays in the memory and when the memory should output its values

The Far-Field Problem

Solutions to the Far-Field ProblemBoosting the gainAmplifies speech and noiseThe SNR remains nearly the sameRemoving reverberationEquivalent to blind deconvolution without a priori knowledge of room geometryEstimating the reverberation parameters from chirp frequency responseSpeech EnhancementModel the room as a communication channel and attempt blind equalizationCreate filters to remove noise before boosting the gain

Collecting Far-Field DataCon: Expensive & Time ConsumingRent apartmentsPay people to speakPay transcriptionists to label speech dataPro: Representative Data!Record close-talking and far-field data simultaneouslyRecord from multiple locations in a room simultaneously by using multiple devices

Model Adaptation vs. Data CollectionAdapting ModelsModerately improves performance when model used in a different environmentRequires small data collection effortCan improve WER from 30% to between 20% and 15%Data CollectionSignificantly improves performance because the training data matches the environmentWER is typically less than 10%

ConclusionsA model is only as good as the data used to train itA mismatch between training data and field data will impact performance significantlyCollecting representative training data is expensive, but may be necessaryDeep Learning can improve performanceUsing the right learning structuresLet the model learn its own optimal feature representationCombine data from multiple environments into one model (e. g., close-talking, far-field)


Not all at onceRaise your hand