Transcript
Page 1: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Deep Neural Networks forAcoustic Modeling in Speech

Recognition

Hinton,Geoffrey,etal.“Deepneuralnetworksforacousticmodelinginspeechrecognition:Thesharedviewsoffourresearchgroups.” Signal

ProcessingMagazine,IEEE 29.6(2012):82-97.

Presented by PeidongWang04/04/2016

1

Page 2: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

2

Page 3: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

3

Page 4: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SpeechRecognitionSystem

• Goal• Convertingspeechtotext

• AMathematicalPerspective

orw = argmax

w{P(w |Y )}

w = argmaxw

{P(Y |w)P(w)}

4

Page 5: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

5

Page 6: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GMM-HMMModel

• GMM and HMM• GMM is short for Gaussian Mixture Model, and HMM isshort for Hidden Markov Model.

• PredecessorofDNNs• Before Deep Neural Networks (DNNs), the most commonlyused speech recognition systemswere consistedof GMMsand HMMs.

6

Page 7: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GMM-HMMModel

• HMM• HMMisusedtodealwiththetemporalvariabilityofspeech.

• GMM• GMMisusedtorepresenttherelationshipbetweenHMMstatesandtheacousticinput.

7

Page 8: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GMM-HMMModel

• Features• ThefeaturesistypicallyrepresentedbyconcatenatingMel-frequencycepstralcoefficients(MFCCs)orperceptuallinearpredictivecoefficients(PLPs)computedfromtherawwaveformandtheirfirst- andsecond-ordertemporaldifferences.

8

Page 9: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GMM-HMMModel

• Shortcoming• GMM-HMMmodelsarestatisticallyinefficientformodelingdatathatlieonornearanonlinearmanifoldinthedataspace.• Forexample,modelingthesetofpointsthatlieveryclosetothesurfaceofasphere.

9

Page 10: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

10

Page 11: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

TrainingDeepNeuralNetworks

• DeepNeuralNetwork(DNN)• ADNNisafeed-forward,artificialneuralnetworkthathasmorethanonelayerofhiddenunitsbetweenitsinputsanditsoutputs.•Withnonlinearactivationfunctions,DNNisabletomodelanarbitrarynonlinearfunction(projectionfrominputstooutputs).[*]

[*]Addedbythepresenter.

11

Page 12: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

TrainingDeepNeuralNetworks

• ActivationFunctionoftheOutputUnits• Theactivationfunctionoftheoutputunitsis“softmax”function.• Themathematicalexpressionisasfollows.

pj =exp(x j )exp(xk )

k∑

12

Page 13: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

TrainingDeepNeuralNetworks

• ObjectiveFunction•Whenusingthesoftmaxoutputfunction,thenaturalobjectivefunction(costfunction)Cisthecross-entropybetweenthetargetprobabilitiesdandtheoutputsofthesoftmax,p.• Themathematicalexpressionisasfollows.

C = dj log pjj∑

13

Page 14: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

TrainingDeepNeuralNetworks

•WeightPenaltiesandEarlyStopping• Toreduceoverfitting,largeweightscanbepenalizedinproportiontotheirsquaredmagnitude,orthelearningcansimplybeterminatedatthepointwhichperformanceonaheld-outvalidationsetstartsgettingworse.

14

Page 15: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

TrainingDeepNeuralNetworks

• OverfittingReduction• Generallyspeaking,therearethreemethods.•Weightpenaltiesandearlystoppingcanreducetheoverfittingbutonlybyremovingmuchofthemodelingpower.• Verylargetrainingsetscanreduceoverfittingbutonlybymakingtrainingverycomputationallyexpensive.• GenerativePretraining

15

Page 16: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

16

Page 17: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• Purpose• Themultiplelayersoffeaturedetectors(theresultofthisstep)canbeusedasagoodstartingpointforadiscriminative“fine-tuning”phaseduringwhichbackpropagationthroughtheDNNslightlyadjuststheweightsandimprovestheperformance.• Inaddition,thisstepcansignificantlyreduceoverfitting.

17

Page 18: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• RestrictedBoltzmannMachine(RBM)• RBMconsistsofalayerofstochasticbinary“visible”unitsthatrepresentbinaryinputdataconnectedtoalayerofstochasticbinaryhidden (latent)unitsthatlearntomodelsignificantnonindependenciesbetweenthevisibleunits.• Thereareundirectedconnectionsbetweenvisibleandhiddenunitsbutnovisible-visibleorhidden-hiddenconnections.

18

Page 19: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• RestrictedBoltzmannMachine(RBM)(Cont’d)• TheframeworkofanRBMisshownbelow.

From:SlidesinCSE5526NeuralNetworks19

Page 20: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• RestrictedBoltzmannMachine(RBM)(Cont’d)• RBMusesasinglesetofparameters,W,todefinethejointprobabilityofavectorofvaluesoftheobservablevariables,v,andavectorofvaluesofthelatentvariables,h,viaanenergyfunction,E.

20

p(v,h;W ) = 1Ze−E (v,h;W ),Z = e−E (v ',h ';W )

v ',h '∑

E(v,h) = − aivii∈visible∑ − bjhj

j∈visible∑ − vihjwij

i, j∑

Page 21: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• RestrictedBoltzmannMachine(RBM)(Cont’d)• Theprobabilitythatthenetworkassignstoavisiblevector,v,isgivenbysummingoverallpossiblehiddenvectors.

• Thederivativeofthelogprobabilityofatrainingsetwithrespecttoaweightissurprisinglysimple.Theanglebracketsdenoteexpectationsunderthecorrespondingdistribution.

p(v) = 1Z

e−E (v,h)h∑

1N

∂log p(vn )∂wijn=1

N

∑ =< vihj >data − < vihj >model

21

Page 22: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• RestrictedBoltzmannMachine(RBM)(Cont’d)• Thelearningruleisthusasfollows.

• Abetterlearningprocedureiscontrastivedivergence(CD),whichisshownbelow.Thesubscript“recon”denotesastepinCDwhenthestatesofvisibleunitsareassigned0or1accordingtothecurrentstatesofthehiddenunits.

Δwij = ε(< vihj >data − < vihj >model )

Δwij = ε(< vihj >data − < vihj >recon )

22

Page 23: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

•ModelingReal-ValuedData• Real-valueddata,suchasMFCCs,aremorenaturallymodeledbylinearvariableswithGaussiannoiseandtheRBMenergyfunctioncanbemodifiedtoaccommodatesuchvariables,givingaGaussian-BernoulliRBM(GRBM).

E(v,h) = (vi − ai )2

2σ i2

i∈vis∑ − bjhj

j∈hid∑ − vi

σ i

hjwiji, j∑

23

Page 24: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• StackingRBMstoMakeaDeepBeliefNetwork• AftertraininganRBMonthedata,theinferredstatesofthehiddenunitscanbeusedasdatafortraininganotherRBMthatlearnstomodelthesignificantdependenciesbetweenthehiddenunitsofthefirstRBM.• Thiscanberepeatedasmanytimesasdesiredtoproducemanylayersofnonlinearfeaturedetectorsthatrepresentprogressivelymorecomplexstatisticalstructureinthedata.

24

Page 25: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• StackingRBMstoMakeaDeepBeliefNetwork(Cont’d)

From:Thepaper25

Page 26: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• InterfacingaDNNwithanHMM• InanHMMframework,thehiddenvariablesdenotethestatesofthephonesequence,andthe“visible”variablesdenotethefeaturevectors.[*]

[*]Addedbythepresenter

From:Gales,Mark,andSteveYoung."TheapplicationofhiddenMarkovmodels inspeechrecognition.”Foundationsandtrendsinsignalprocessing 1.3(2008):195-304. 26

Page 27: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• InterfacingaDNNwithanHMM(Cont’d)• TocomputeaViterbialignmentortoruntheforward-backwardalgorithmwithintheHMMframework,werequirethelikelihoodp(AcousticInput|HMMstate).• ADNN,however,outputsprobabilitiesoftheformp(HMMstate|AcousticInput).

27

Page 28: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• InterfacingaDNNwithanHMM(Cont’d)• TheposteriorprobabilitiesthattheDNNoutputscanbeconvertedintothescaledlikelihoodbydividingthembythefrequenciesoftheHMMstatesintheforcedalignmentthatisusedforfine-tuningtheDNN.• Forcedalignment isaprocedureusedtogeneratelabelsforthetrainingprocess.[*]

[*]Addedbythepresenter

28

Page 29: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

GenerativePretraining

• InterfacingaDNNwithanHMM(Cont’d)• All of the likelihoods produced in this way are scaled by thesame unknown factor of p(AcousticInput).• Although this appears to have little effect on somerecognition tasks, it can be important for tasks wheretraining labels are highly unbalanced.

29

Page 30: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

30

Page 31: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Experiments

• PhoneticClassificationandRecognitiononTIMIT• TheTIMITdatasetisarelativelysmalldatasetwhichprovidesasimpleandconvenientwayoftestingnewapproachestospeechrecognition.

31

Page 32: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Experiments

• PhoneticClassificationandRecognitiononTIMIT(Cont’d)

From:Thepaper32

Page 33: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Experiments

• Bing-Voice-SearchSpeechRecognitionTask• Thistaskused24hoftrainingdatawithahighdegreeofacousticvariabilitycausedbynoise,music,side-speech,accents,sloppypronunciation,etal.• ThebestDNN-HMMacousticmodelachievedasentenceaccuracyof69.6%onthetestset,comparedwith63.8%forastrong,minimumphoneerror(MPE)-trainedGMM-HMMbaseline.

33

Page 34: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Experiments

• Bing-Voice-SearchSpeechRecognitionTask(Cont’d)

From:Thepaper 34

Page 35: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Experiments

• OtherLargeVocabularyTasks• SwitchboardSpeechRecognitionTask(acorpuscontainingover300hoftrainingdata)• GoogleVoiceInputSpeechRecognitionTask• YouTubeSpeechRecognitionTask• EnglishBroadcastNewsSpeechRecognitionTask

35

Page 36: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Experiments

• OtherLargeVocabularyTasks(Cont’d)

From:Thepaper 36

Page 37: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

37

Page 38: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• ConvolutionalDNNsforPhoneClassificationandRecognition• AlthoughconvolutionalmodelsalongthetemporaldimensionachievedgoodclassificationresultsonTIMITcorpus,applyingthemtophonerecognitionisnotstraightforward.• ThisisbecausetemporalvariationsinspeechcanbepartiallyhandledbythedynamicprogramingprocedureintheHMMcomponentandhiddentrajectorymodels.

38

Page 39: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• SpeedingUpDNNsatRecognitionTime• ThetimethataDNN-HMMsystemrequirestorecognize1sofspeechcanbereducedfrom1.6sto210ms,withoutdecreasingrecognitionaccuracy,byquantizingtheweightsdownto8busingCPU.• Alternatively,itcanbereducedto66msbyusingagraphicsprocessingunit(GPU).

39

Page 40: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• AlternativePretrainingMethodsforDNNs• ItispossibletolearnaDNNbystartingwithashallowneuralnetwithasinglehiddenlayer.Oncethisnethasbeentraineddiscriminatively,asecondhiddenlayerisinterposedbetweenthefirsthiddenlayerandthesoftmaxoutputunitsandthewholenetworkisagaindiscriminativelytrained.Thiscanbecontinueduntilthedesirednumberofhiddenlayersisreached,afterwhichfullbackpropagationfine-tuningisapplied.

40

Page 41: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• AlternativePretrainingMethodsforDNNs(Cont’d)• PurelydiscriminativetrainingofthewholeDNNfromrandominitialweightsworkswell,too.• Varioustypesofautoencoderwithonehiddenlayercanalsobeusedinthe layer-by-layergenerativepretrainingprocess.

41

Page 42: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• AlternativeFine-TuningMethodsforDNNs•MostDBN-DNNacousticmodelsarefine-tunedbyapplyingstochasticgradientdescentwithmomentumtosmallminibatchesoftrainingcases.•Moresophisticatedoptimizationmethodscanbeused,butitisnotclearthatthemoresophisticatedmethodsareworthwhilesincethefine-tuningprocessistypicallystoppedearlytopreventoverfitting.

42

Page 43: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• UsingDBN-DNNstoProvideInputFeaturesforGMM-HMMSystems• ThisclassofmethodsuseneuralnetworkstoprovidethefeaturevectorsforthetrainingprocessoftheGMMinaGMM-HMMsystem.• Themostcommonapproachistotrainarandomlyinitializedneuralnetwithanarrowbottleneckmiddlelayerandtousetheactivationsofthebottleneckhiddenunitsasfeatures.

43

Page 44: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• UsingDNNstoEstimateArticulatoryFeaturesforDetection-BasedSpeechRecognition• DBN-DNNsareeffectivefordetectingsubphoneticspeechattributes(alsoknownasphonologicalorarticulatoryfeatures).

44

Page 45: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• Summary•MostofthegaincomesfromusingDNNstoexploitinformationinneighboringframesandfrommodelingtiedcontext-dependentstates.• Thereisnoreasontobelievethattheoptimaltypesofhiddenunitsortheoptimalnetworkarchitecturesareused,anditishighlylikelythatboththepretrainingandfine-tuningalgorithmscanbemodifiedtoreducetheamountofoverfittingandtheamountofcomputation.

45

Page 46: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Thank You!

46

Page 47: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

InvestigationofSpeechSeparationasaFront-Endfor

NoiseRobustSpeechRecognition

Narayanan,Arun,andDeLiangWang."Investigationofspeechseparationasafront-endfornoiserobustspeechrecognition."Audio,Speech,andLanguageProcessing,IEEE/ACMTransactionson 22.4

(2014):826-835.

Presented by PeidongWang04/04/2016

47

Page 48: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• Introduction• SystemDescription• EvaluationResults• Discussion

48

Page 49: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• Introduction• SystemDescription• EvaluationResults• Discussion

49

Page 50: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Introduction

• Background• Althoughautomaticspeechrecognition(ASR)systemshavebecomefairlypowerful,theinherentvariabilitycanstillposechallenges.• Typically,ASRsystemsthatworkwellincleanconditionssufferfromadrasticlossofperformanceinthepresenceofnoise.

50

Page 51: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Introduction

• Feature-BasedMethods• Thisclassofmethodsfocusonfeatureextractionorfeaturenormalization.• Feature-basedtechniqueshavethepotentialtogeneralizewell,butdonotalwaysproducethebestresults.

51

Page 52: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Introduction

• TwoGroupsofFeature-BasedMethods•Whenstereo[*] data isunavailable,priorknowledgeaboutspeechand/ornoiseisused,suchasspectralreconstructionbasedmissingfeaturemethods,directmaskingmethodsandfeatureenhancementmethods.•Whenstereodataisavailable,featuremappingmethodsandrecurrentneuralnetworkshavebeenused.

[*]Bystereowemeannoisyandthecorresponding cleansignals.

52

Page 53: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Introduction

•Model-BasedMethods• TheASRmodelparametersareadaptedtomatchthedistributionofnoisyorenhancedfeatures.•Model-basedmethodsworkwellwhentheunderlyingassumptionsaremet,buttypicallyinvolvesignificantcomputationaloverhead.• Thebestperformancesareusuallyobtainedbycombiningfeature-basedandmodel-basedmethods.

53

Page 54: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Introduction

• SupervisedClassificationBasedSpeechSeparation• Stereotrainingdataisalsousedbysupervisedclassificationbasedspeechseparationalgorithms.• Suchalgorithmstypicallyestimatetheidealbinarymask(IBM)-abinarymaskdefinedinthetime-frequency(T-F)domainthatidentifiesspeechdominantandnoisedominantT-Funits.• Theabovemethodcanbeextendedtoidealratiomask(IRM),which representstheratioofspeechtomixture energy.

54

Page 55: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• Introduction• SystemDescription• EvaluationResults• Discussion

55

Page 56: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• BlockDiagramoftheProposedSystem

From:Thepaper56

Page 57: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• AddressingAdditiveNoiseandConvolutionalDistortion• Theadditivenoiseandtheconvolutionaldistortionaredealtwithintwoseparatestages:Noiseremovalfollowedbychannelcompensation.• NoiseisremovedviaT-FmaskingusingtheIRM.Tocompensateforchannelmismatchandtheerrorsintroducedbymasking,welearnanon-linearmappingfunctionthatundoesthesedistortions.

57

Page 58: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking

58

Page 59: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• HeretheauthorsperformT-Fmaskinginthemel-frequencydomain,unlikesomeoftheothersystemsthatoperateinthegammatonefeaturedomain.• Toobtainthemel-spectrogramofasignal,itisfirstpre-emphasizedandtransformedtothelinearfrequencydomainusinga320channelfastFouriertransform(FFT).A20msecHammingwindowisused. The161-dimensionalspectrogramisthenconvertedtoa26-channelmel-spectrogram.

59

Page 60: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• TheauthorsuseDNNstoestimatetheIRMasDNNsshowgoodperformanceandtrainingusingstochasticgradientdescentscaleswellcomparedtoothernonlineardiscriminativeclassifiers.

60

Page 61: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• TargetSignal• Theidealratiomaskisdefinedastheratioofthecleansignalenergytothemixtureenergyateachtime-frequencyunit.• Themathematicalexpressionisshownbelow.

IRM (t, f ) = 10(SNR(t , f )/10)

10(SNR(t , f )/10) +1SNR(t, f ) = 10 log10 (X(t, f ) / N(t, f ))

61

Page 62: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• TargetSignal• RatherthanestimatingIRMdirectly,theauthorsestimateatransformedversionoftheSNR.• Themathematicalexpressionofthesigmoidaltransformationisshownbelow.

d(t, f ) = 11+ exp(−α (SNR(t, f )− β ))

62

Page 63: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• TargetSignal• Duringtesting,thevaluesoutputfromtheDNNaremappedbacktotheircorrespondingIRMvalues.

63

Page 64: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• Features• Featureextractionisperformedbothatthefullbandandthesubbandlevel.• Thecombinationoffeatures,31dimensionalMFCCs,13dimensionalFASTAfilteredPLPsand15dimensionalamplitudemodulationspectrogram(AMS)features,areused.

64

Page 65: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• Features• ThefullbandfeaturesarederivedbysplicingtogetherfullbandMFCCsandRASTA-PLPs,alongwiththeirdeltaandaccelerationcomponents,andsubbandAMSfeatures.• ThesubbandfeaturesarederivedbysplicingtogethersubbandMFCCs,RASTA-PLPs,andAMSfeatures.Someauxiliarycomponentsarealsoadded.

65

Page 66: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• SupervisedLearning• IRMestimationisperformedintwostages.Inthefirststage,multipleDNNsaretrainedusingfullbandandsubbandfeatures.ThefinalestimateisobtainedusinganMLPthatcombinestheoutputofthefullbandandthesubbandDNNs.

66

Page 67: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)• SupervisedLearning• ThefullbandDNNswouldbecognizantoftheoverallspectralshapeoftheIRMandtheinformationconveyedbythefullbandfeatures,whereasthesubbandDNNsareexpectedtobemorerobusttonoiseoccurringatfrequenciesoutsidetheirpassband.

67

Page 68: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• Time-FrequencyMasking(Cont’d)

From:Thepaper 68

Page 69: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping

69

Page 70: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)• EvenafterT-Fmasking,channelmismatchcanstillsignificantlyimpactperformance.• Thishappensfortworeasons.Firstly,thealgorithmlearnstoestimatetheratiomaskusingmixturesofspeechandnoiserecordedusingasinglemicrophone.Secondly,becausechannelmismatchisconvolutional,speechandnoise,whichnowincludesbothbackgroundnoiseandconvolutivenoise,areclearlynotuncorrelated.

70

Page 71: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)• Thegoaloffeaturemappinginthisworkistolearnspectro-temporalcorrelationsthatexistinspeechtoundothedistortionsintroducedbyunseenmicrophonesandthefirststageofthealgorithm.

71

Page 72: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)• TargetSignal• Thetargetisthecleanlog-melspectrogram(LMS).The“clean”LMSherecorrespondstothoseobtainedfromthecleansignalsrecordedusingasinglemicrophoneinasinglefiltersetting.

72

Page 73: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)• TargetSignal• InsteadofusingtheLMSdirectlyasthetarget,theauthorsapplyalineartransformtolimitthetargetvaluestotherange[0,1]tousethesigmoidaltransferfunctionfortheoutputlayeroftheDNN.• Themathematicalexpressionisasfollows.

Xd (t, f ) =ln(X(t, f ))−min(ln(X(⋅, f )))

max(ln(X(⋅, f )))−min(ln(X(⋅, f )))

73

Page 74: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)• TargetSignal• Duringtesting,theoutputoftheDNNismappedbacktothedynamicrangeoftheutterancesintrainingset.

74

Page 75: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)• Features• TheauthorsuseboththenoisyandthemaskedLMS.

• SupervisedLearning• UnliketheDNNsusedforIRMestimation,thehiddenlayersoftheDNNforthistaskuserectifiedlinearunits(ReLUs).Inaddition,theoutputlayerusessigmoidactivations.

75

Page 76: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• FeatureMapping(Cont’d)

From:Thepaper76

Page 77: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• AcousticModeling

77

Page 78: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• AcousticModeling(Cont’d)• TheacousticmodelsaretrainedusingtheAurora-4dataset.• Aurora-4isa5000-wordclosedvocabularyrecognitiontaskbasedontheWallStreetJournaldatabase.Thecorpushastwotrainingsets,cleanandmulti-condition,bothwith7138utterances.

78

Page 79: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• AcousticModeling(Cont’d)• GaussianMixtureModels• TheHMMsandtheGMMsareinitiallytrainedusingthecleantrainingset.Thecleanmodelsarethenusedtoinitializethemulti-conditionmodels;bothcleanandmulti-conditionmodelshavethesamestructureanddifferonlyintransitionandobservationprobabilitydensities.

79

Page 80: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• AcousticModeling(Cont’d)• DeepNeuralNetworks• Theauthorsfirstalignthecleantrainingsettoobtainsenonelabelsateachtime-frameforallutterancesinthetrainingset.DNNsarethentrainedtopredicttheposteriorprobabilityofsenonesusingeithercleanfeaturesorfeaturesextractedfromthemulti-conditionset.

80

Page 81: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression

81

Page 82: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression(Cont’d)• dFDLRisasemi-supervisedfeatureadaptationtechnique.• ThemotivationfordevelopingdFDLRistoaddresstheproblemofgeneralizationtounseenmicrophoneconditionsinourdataset,whichiswheretheDNN-HMMsystemsperformtheworst.

82

Page 83: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression(Cont’d)• ToapplydFDLR,wefirstobtainaninitialsenone-levellabelingforourtestutterancesusingtheunadaptedmodels.Featuresarethentransformedtominimizethecross-entropyerrorinpredictingtheselabels.• Themathematicalexpressionsareasfollow.

Ot ( f ) = wf iOt ( f )+ bf

min E(st ,Dout (Ot−5...Ot+5 ))t∑

83

Page 84: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression(Cont’d)• TheparameterscaneasilybelearnedwithintheDNNframeworkbyaddingalayerbetweentheinputlayerandthefirsthiddenlayeroftheoriginalDNN. Afterinitialization,thestandardbackpropagationalgorithmisrunfor10epochstolearntheparametersofthedFDLRmodel. Duringbackpropagation,weightsoftheoriginalhiddenlayersarekeptunchangedandonlytheparametersinthedFDLRareupdated.

84

Page 85: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• Introduction• SystemDescription• EvaluationResults• Discussion

85

Page 86: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

EvaluationResults

From:Thepaper86

Page 87: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

EvaluationResults

From:Thepaper87

Page 88: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Content

• Introduction• SystemDescription• EvaluationResults• Discussion

88

Page 89: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• Severalinterestingobservationscanbemadefromtheresultspresentedintheprevioussection.• Firstly,theresultsclearlyshowthatthespeechseparationfront-endisdoingagoodjobatremovingnoiseandhandlingchannelmismatch.• Secondly,withnochannelmismatch,T-Fmaskingaloneworkedwellinremovingnoise.

89

Page 90: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Discussion

• Finally,directlyperformingfeaturemappingfromnoisyfeaturestocleanfeaturesperformsreasonably,butitdoesnotperformaswellastheproposedfront-end.

90

Page 91: DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Thank You!

91


Recommended