Author
dangdat
View
214
Download
1
Embed Size (px)
Deep Neural Networks forAcoustic Modeling in Speech
Recognition
Hinton,Geoffrey,etal.Deepneuralnetworksforacousticmodelinginspeechrecognition:Thesharedviewsoffourresearchgroups. Signal
ProcessingMagazine,IEEE 29.6(2012):82-97.
Presented by PeidongWang04/04/2016
1
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
2
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
3
SpeechRecognitionSystem
Goal Convertingspeechtotext
AMathematicalPerspective
orw = argmax
w{P(w |Y )}
w = argmaxw
{P(Y |w)P(w)}
4
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
5
GMM-HMMModel
GMM and HMM GMM is short for Gaussian Mixture Model, and HMM isshort for Hidden Markov Model.
PredecessorofDNNs Before Deep Neural Networks (DNNs), the most commonlyused speech recognition systemswere consistedof GMMsand HMMs.
6
GMM-HMMModel
HMM HMMisusedtodealwiththetemporalvariabilityofspeech.
GMM GMMisusedtorepresenttherelationshipbetweenHMMstatesandtheacousticinput.
7
GMM-HMMModel
Features ThefeaturesistypicallyrepresentedbyconcatenatingMel-frequencycepstralcoefficients(MFCCs)orperceptuallinearpredictivecoefficients(PLPs)computedfromtherawwaveformandtheirfirst- andsecond-ordertemporaldifferences.
8
GMM-HMMModel
Shortcoming GMM-HMMmodelsarestatisticallyinefficientformodelingdatathatlieonornearanonlinearmanifoldinthedataspace. Forexample,modelingthesetofpointsthatlieveryclosetothesurfaceofasphere.
9
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
10
TrainingDeepNeuralNetworks
DeepNeuralNetwork(DNN) ADNNisafeed-forward,artificialneuralnetworkthathasmorethanonelayerofhiddenunitsbetweenitsinputsanditsoutputs.Withnonlinearactivationfunctions,DNNisabletomodelanarbitrarynonlinearfunction(projectionfrominputstooutputs).[*]
[*]Addedbythepresenter.
11
TrainingDeepNeuralNetworks
ActivationFunctionoftheOutputUnits Theactivationfunctionoftheoutputunitsissoftmaxfunction. Themathematicalexpressionisasfollows.
pj =exp(x j )exp(xk )
k
12
TrainingDeepNeuralNetworks
ObjectiveFunctionWhenusingthesoftmaxoutputfunction,thenaturalobjectivefunction(costfunction)Cisthecross-entropybetweenthetargetprobabilitiesdandtheoutputsofthesoftmax,p. Themathematicalexpressionisasfollows.
C = dj log pjj
13
TrainingDeepNeuralNetworks
WeightPenaltiesandEarlyStopping Toreduceoverfitting,largeweightscanbepenalizedinproportiontotheirsquaredmagnitude,orthelearningcansimplybeterminatedatthepointwhichperformanceonaheld-outvalidationsetstartsgettingworse.
14
TrainingDeepNeuralNetworks
OverfittingReduction Generallyspeaking,therearethreemethods.Weightpenaltiesandearlystoppingcanreducetheoverfittingbutonlybyremovingmuchofthemodelingpower. Verylargetrainingsetscanreduceoverfittingbutonlybymakingtrainingverycomputationallyexpensive. GenerativePretraining
15
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
16
GenerativePretraining
Purpose Themultiplelayersoffeaturedetectors(theresultofthisstep)canbeusedasagoodstartingpointforadiscriminativefine-tuningphaseduringwhichbackpropagationthroughtheDNNslightlyadjuststheweightsandimprovestheperformance. Inaddition,thisstepcansignificantlyreduceoverfitting.
17
GenerativePretraining
RestrictedBoltzmannMachine(RBM) RBMconsistsofalayerofstochasticbinaryvisibleunitsthatrepresentbinaryinputdataconnectedtoalayerofstochasticbinaryhidden (latent)unitsthatlearntomodelsignificantnonindependenciesbetweenthevisibleunits. Thereareundirectedconnectionsbetweenvisibleandhiddenunitsbutnovisible-visibleorhidden-hiddenconnections.
18
GenerativePretraining
RestrictedBoltzmannMachine(RBM)(Contd) TheframeworkofanRBMisshownbelow.
From:SlidesinCSE5526NeuralNetworks19
GenerativePretraining
RestrictedBoltzmannMachine(RBM)(Contd) RBMusesasinglesetofparameters,W,todefinethejointprobabilityofavectorofvaluesoftheobservablevariables,v,andavectorofvaluesofthelatentvariables,h,viaanenergyfunction,E.
20
p(v,h;W ) = 1ZeE (v,h;W ),Z = eE (v ',h ';W )
v ',h '
E(v,h) = aiviivisible bjhj
jvisible vihjwij
i, j
GenerativePretraining
RestrictedBoltzmannMachine(RBM)(Contd) Theprobabilitythatthenetworkassignstoavisiblevector,v,isgivenbysummingoverallpossiblehiddenvectors.
Thederivativeofthelogprobabilityofatrainingsetwithrespecttoaweightissurprisinglysimple.Theanglebracketsdenoteexpectationsunderthecorrespondingdistribution.
p(v) = 1Z
eE (v,h)h
1N
log p(vn )wijn=1
N
=< vihj >data < vihj >model
21
GenerativePretraining
RestrictedBoltzmannMachine(RBM)(Contd) Thelearningruleisthusasfollows.
Abetterlearningprocedureiscontrastivedivergence(CD),whichisshownbelow.ThesubscriptrecondenotesastepinCDwhenthestatesofvisibleunitsareassigned0or1accordingtothecurrentstatesofthehiddenunits.
wij = (< vihj >data < vihj >model )
wij = (< vihj >data < vihj >recon )
22
GenerativePretraining
ModelingReal-ValuedData Real-valueddata,suchasMFCCs,aremorenaturallymodeledbylinearvariableswithGaussiannoiseandtheRBMenergyfunctioncanbemodifiedtoaccommodatesuchvariables,givingaGaussian-BernoulliRBM(GRBM).
E(v,h) = (vi ai )2
2 i2
ivis bjhj
jhid vi i
hjwiji, j
23
GenerativePretraining
StackingRBMstoMakeaDeepBeliefNetwork AftertraininganRBMonthedata,theinferredstatesofthehiddenunitscanbeusedasdatafortraininganotherRBMthatlearnstomodelthesignificantdependenciesbetweenthehiddenunitsofthefirstRBM. Thiscanberepeatedasmanytimesasdesiredtoproducemanylayersofnonlinearfeaturedetectorsthatrepresentprogressivelymorecomplexstatisticalstructureinthedata.
24
GenerativePretraining
StackingRBMstoMakeaDeepBeliefNetwork(Contd)
From:Thepaper25
GenerativePretraining
InterfacingaDNNwithanHMM InanHMMframework,thehiddenvariablesdenotethestatesofthephonesequence,andthevisiblevariablesdenotethefeaturevectors.[*]
[*]Addedbythepresenter
From:Gales,Mark,andSteveYoung."TheapplicationofhiddenMarkovmodels inspeechrecognition.Foundationsandtrendsinsignalprocessing 1.3(2008):195-304. 26
GenerativePretraining
InterfacingaDNNwithanHMM(Contd) TocomputeaViterbialignmentortoruntheforward-backwardalgorithmwithintheHMMframework,werequirethelikelihoodp(AcousticInput|HMMstate). ADNN,however,outputsprobabilitiesoftheformp(HMMstate|AcousticInput).
27
GenerativePretraining
InterfacingaDNNwithanHMM(Contd) TheposteriorprobabilitiesthattheDNNoutputscanbeconvertedintothescaledlikelihoodbydividingthembythefrequenciesoftheHMMstatesintheforcedalignmentthatisusedforfine-tuningtheDNN. Forcedalignment isaprocedureusedtogeneratelabelsforthetrainingprocess.[*]
[*]Addedbythepresenter
28
GenerativePretraining
InterfacingaDNNwithanHMM(Contd) All of the likelihoods produced in this way are scaled by thesame unknown factor of p(AcousticInput). Although this appears to have little effect on somerecognition tasks, it can be important for tasks wheretraining labels are highly unbalanced.
29
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
30
Experiments
PhoneticClassificationandRecognitiononTIMIT TheTIMITdatasetisarelativelysmalldatasetwhichprovidesasimpleandconvenientwayoftestingnewapproachestospeechrecognition.
31
Experiments
PhoneticClassificationandRecognitiononTIMIT(Contd)
From:Thepaper32
Experiments
Bing-Voice-SearchSpeechRecognitionTask Thistaskused24hoftrainingdatawithahighdegreeofacousticvariabilitycausedbynoise,music,side-speech,accents,sloppypronunciation,etal. ThebestDNN-HMMacousticmodelachievedasentenceaccuracyof69.6%onthetestset,comparedwith63.8%forastrong,minimumphoneerror(MPE)-trainedGMM-HMMbaseline.
33
Experiments
Bing-Voice-SearchSpeechRecognitionTask(Contd)
From:Thepaper 34
Experiments
OtherLargeVocabularyTasks SwitchboardSpeechRecognitionTask(acorpuscontainingover300hoftrainingdata) GoogleVoiceInputSpeechRecognitionTask YouTubeSpeechRecognitionTask EnglishBroadcastNewsSpeechRecognitionTask
35
Experiments
OtherLargeVocabularyTasks(Contd)
From:Thepaper 36
Content
SpeechRecognitionSystem GMM-HMMModel TrainingDeepNeuralNetworks GenerativePretraining Experiments Discussion
37
Discussion
ConvolutionalDNNsforPhoneClassificationandRecognition AlthoughconvolutionalmodelsalongthetemporaldimensionachievedgoodclassificationresultsonTIMITcorpus,applyingthemtophonerecognitionisnotstraightforward. ThisisbecausetemporalvariationsinspeechcanbepartiallyhandledbythedynamicprogramingprocedureintheHMMcomponentandhiddentrajectorymodels.
38
Discussion
SpeedingUpDNNsatRecognitionTime ThetimethataDNN-HMMsystemrequirestorecognize1sofspeechcanbereducedfrom1.6sto210ms,withoutdecreasingrecognitionaccuracy,byquantizingtheweightsdownto8busingCPU. Alternatively,itcanbereducedto66msbyusingagraphicsprocessingunit(GPU).
39
Discussion
AlternativePretrainingMethodsforDNNs ItispossibletolearnaDNNbystartingwithashallowneuralnetwithasinglehiddenlayer.Oncethisnethasbeentraineddiscriminatively,asecondhiddenlayerisinterposedbetweenthefirsthiddenlayerandthesoftmaxoutputunitsandthewholenetworkisagaindiscriminativelytrained.Thiscanbecontinueduntilthedesirednumberofhiddenlayersisreached,afterwhichfullbackpropagationfine-tuningisapplied.
40
Discussion
AlternativePretrainingMethodsforDNNs(Contd) PurelydiscriminativetrainingofthewholeDNNfromrandominitialweightsworkswell,too. Varioustypesofautoencoderwithonehiddenlayercanalsobeusedinthe layer-by-layergenerativepretrainingprocess.
41
Discussion
AlternativeFine-TuningMethodsforDNNsMostDBN-DNNacousticmodelsarefine-tunedbyapplyingstochasticgradientdescentwithmomentumtosmallminibatchesoftrainingcases.Moresophisticatedoptimizationmethodscanbeused,butitisnotclearthatthemoresophisticatedmethodsareworthwhilesincethefine-tuningprocessistypicallystoppedearlytopreventoverfitting.
42
Discussion
UsingDBN-DNNstoProvideInputFeaturesforGMM-HMMSystems ThisclassofmethodsuseneuralnetworkstoprovidethefeaturevectorsforthetrainingprocessoftheGMMinaGMM-HMMsystem. Themostcommonapproachistotrainarandomlyinitializedneuralnetwithanarrowbottleneckmiddlelayerandtousetheactivationsofthebottleneckhiddenunitsasfeatures.
43
Discussion
UsingDNNstoEstimateArticulatoryFeaturesforDetection-BasedSpeechRecognition DBN-DNNsareeffectivefordetectingsubphoneticspeechattributes(alsoknownasphonologicalorarticulatoryfeatures).
44
Discussion
SummaryMostofthegaincomesfromusingDNNstoexploitinformationinneighboringframesandfrommodelingtiedcontext-dependentstates. Thereisnoreasontobelievethattheoptimaltypesofhiddenunitsortheoptimalnetworkarchitecturesareused,anditishighlylikelythatboththepretrainingandfine-tuningalgorithmscanbemodifiedtoreducetheamountofoverfittingandtheamountofcomputation.
45
Thank You
46
InvestigationofSpeechSeparationasaFront-Endfor
NoiseRobustSpeechRecognition
Narayanan,Arun,andDeLiangWang."Investigationofspeechseparationasafront-endfornoiserobustspeechrecognition."Audio,Speech,andLanguageProcessing,IEEE/ACMTransactionson 22.4
(2014):826-835.
Presented by PeidongWang04/04/2016
47
Content
Introduction SystemDescription EvaluationResults Discussion
48
Content
Introduction SystemDescription EvaluationResults Discussion
49
Introduction
Background Althoughautomaticspeechrecognition(ASR)systemshavebecomefairlypowerful,theinherentvariabilitycanstillposechallenges. Typically,ASRsystemsthatworkwellincleanconditionssufferfromadrasticlossofperformanceinthepresenceofnoise.
50
Introduction
Feature-BasedMethods Thisclassofmethodsfocusonfeatureextractionorfeaturenormalization. Feature-basedtechniqueshavethepotentialtogeneralizewell,butdonotalwaysproducethebestresults.
51
Introduction
TwoGroupsofFeature-BasedMethodsWhenstereo[*] data isunavailable,priorknowledgeaboutspeechand/ornoiseisused,suchasspectralreconstructionbasedmissingfeaturemethods,directmaskingmethodsandfeatureenhancementmethods.Whenstereodataisavailable,featuremappingmethodsandrecurrentneuralnetworkshavebeenused.
[*]Bystereowemeannoisyandthecorresponding cleansignals.
52
Introduction
Model-BasedMethods TheASRmodelparametersareadaptedtomatchthedistributionofnoisyorenhancedfeatures.Model-basedmethodsworkwellwhentheunderlyingassumptionsaremet,buttypicallyinvolvesignificantcomputationaloverhead. Thebestperformancesareusuallyobtainedbycombiningfeature-basedandmodel-basedmethods.
53
Introduction
SupervisedClassificationBasedSpeechSeparation Stereotrainingdataisalsousedbysupervisedclassificationbasedspeechseparationalgorithms. Suchalgorithmstypicallyestimatetheidealbinarymask(IBM)-abinarymaskdefinedinthetime-frequency(T-F)domainthatidentifiesspeechdominantandnoisedominantT-Funits. Theabovemethodcanbeextendedtoidealratiomask(IRM),which representstheratioofspeechtomixture energy.
54
Content
Introduction SystemDescription EvaluationResults Discussion
55
SystemDescription
BlockDiagramoftheProposedSystem
From:Thepaper56
SystemDescription
AddressingAdditiveNoiseandConvolutionalDistortion Theadditivenoiseandtheconvolutionaldistortionaredealtwithintwoseparatestages:Noiseremovalfollowedbychannelcompensation. NoiseisremovedviaT-FmaskingusingtheIRM.Tocompensateforchannelmismatchandtheerrorsintroducedbymasking,welearnanon-linearmappingfunctionthatundoesthesedistortions.
57
SystemDescription
Time-FrequencyMasking
58
SystemDescription
Time-FrequencyMasking(Contd) HeretheauthorsperformT-Fmaskinginthemel-frequencydomain,unlikesomeoftheothersystemsthatoperateinthegammatonefeaturedomain. Toobtainthemel-spectrogramofasignal,itisfirstpre-emphasizedandtransformedtothelinearfrequencydomainusinga320channelfastFouriertransform(FFT).A20msecHammingwindowisused. The161-dimensionalspectrogramisthenconvertedtoa26-channelmel-spectrogram.
59
SystemDescription
Time-FrequencyMasking(Contd) TheauthorsuseDNNstoestimatetheIRMasDNNsshowgoodperformanceandtrainingusingstochasticgradientdescentscaleswellcomparedtoothernonlineardiscriminativeclassifiers.
60
SystemDescription
Time-FrequencyMasking(Contd) TargetSignal Theidealratiomaskisdefinedastheratioofthecleansignalenergytothemixtureenergyateachtime-frequencyunit. Themathematicalexpressionisshownbelow.
IRM (t, f ) = 10(SNR(t , f )/10)
10(SNR(t , f )/10) +1SNR(t, f ) = 10 log10 (X(t, f ) / N(t, f ))
61
SystemDescription
Time-FrequencyMasking(Contd) TargetSignal RatherthanestimatingIRMdirectly,theauthorsestimateatransformedversionoftheSNR. Themathematicalexpressionofthesigmoidaltransformationisshownbelow.
d(t, f ) = 11+ exp( (SNR(t, f ) ))
62
SystemDescription
Time-FrequencyMasking(Contd) TargetSignal Duringtesting,thevaluesoutputfromtheDNNaremappedbacktotheircorrespondingIRMvalues.
63
SystemDescription
Time-FrequencyMasking(Contd) Features Featureextractionisperformedbothatthefullbandandthesubbandlevel. Thecombinationoffeatures,31dimensionalMFCCs,13dimensionalFASTAfilteredPLPsand15dimensionalamplitudemodulationspectrogram(AMS)features,areused.
64
SystemDescription
Time-FrequencyMasking(Contd) Features ThefullbandfeaturesarederivedbysplicingtogetherfullbandMFCCsandRASTA-PLPs,alongwiththeirdeltaandaccelerationcomponents,andsubbandAMSfeatures. ThesubbandfeaturesarederivedbysplicingtogethersubbandMFCCs,RASTA-PLPs,andAMSfeatures.Someauxiliarycomponentsarealsoadded.
65
SystemDescription
Time-FrequencyMasking(Contd) SupervisedLearning IRMestimationisperformedintwostages.Inthefirststage,multipleDNNsaretrainedusingfullbandandsubbandfeatures.ThefinalestimateisobtainedusinganMLPthatcombinestheoutputofthefullbandandthesubbandDNNs.
66
SystemDescription
Time-FrequencyMasking(Contd) SupervisedLearning ThefullbandDNNswouldbecognizantoftheoverallspectralshapeoftheIRMandtheinformationconveyedbythefullbandfeatures,whereasthesubbandDNNsareexpectedtobemorerobusttonoiseoccurringatfrequenciesoutsidetheirpassband.
67
SystemDescription
Time-FrequencyMasking(Contd)
From:Thepaper 68
SystemDescription
FeatureMapping
69
SystemDescription
FeatureMapping(Contd) EvenafterT-Fmasking,channelmismatchcanstillsignificantlyimpactperformance. Thishappensfortworeasons.Firstly,thealgorithmlearnstoestimatetheratiomaskusingmixturesofspeechandnoiserecordedusingasinglemicrophone.Secondly,becausechannelmismatchisconvolutional,speechandnoise,whichnowincludesbothbackgroundnoiseandconvolutivenoise,areclearlynotuncorrelated.
70
SystemDescription
FeatureMapping(Contd) Thegoaloffeaturemappinginthisworkistolearnspectro-temporalcorrelationsthatexistinspeechtoundothedistortionsintroducedbyunseenmicrophonesandthefirststageofthealgorithm.
71
SystemDescription
FeatureMapping(Contd) TargetSignal Thetargetisthecleanlog-melspectrogram(LMS).ThecleanLMSherecorrespondstothoseobtainedfromthecleansignalsrecordedusingasinglemicrophoneinasinglefiltersetting.
72
SystemDescription
FeatureMapping(Contd) TargetSignal InsteadofusingtheLMSdirectlyasthetarget,theauthorsapplyalineartransformtolimitthetargetvaluestotherange[0,1]tousethesigmoidaltransferfunctionfortheoutputlayeroftheDNN. Themathematicalexpressionisasfollows.
Xd (t, f ) =ln(X(t, f ))min(ln(X(, f )))
max(ln(X(, f )))min(ln(X(, f )))
73
SystemDescription
FeatureMapping(Contd) TargetSignal Duringtesting,theoutputoftheDNNismappedbacktothedynamicrangeoftheutterancesintrainingset.
74
SystemDescription
FeatureMapping(Contd) Features TheauthorsuseboththenoisyandthemaskedLMS.
SupervisedLearning UnliketheDNNsusedforIRMestimation,thehiddenlayersoftheDNNforthistaskuserectifiedlinearunits(ReLUs).Inaddition,theoutputlayerusessigmoidactivations.
75
SystemDescription
FeatureMapping(Contd)
From:Thepaper76
SystemDescription
AcousticModeling
77
SystemDescription
AcousticModeling(Contd) TheacousticmodelsaretrainedusingtheAurora-4dataset. Aurora-4isa5000-wordclosedvocabularyrecognitiontaskbasedontheWallStreetJournaldatabase.Thecorpushastwotrainingsets,cleanandmulti-condition,bothwith7138utterances.
78
SystemDescription
AcousticModeling(Contd) GaussianMixtureModels TheHMMsandtheGMMsareinitiallytrainedusingthecleantrainingset.Thecleanmodelsarethenusedtoinitializethemulti-conditionmodels;bothcleanandmulti-conditionmodelshavethesamestructureanddifferonlyintransitionandobservationprobabilitydensities.
79
SystemDescription
AcousticModeling(Contd) DeepNeuralNetworks Theauthorsfirstalignthecleantrainingsettoobtainsenonelabelsateachtime-frameforallutterancesinthetrainingset.DNNsarethentrainedtopredicttheposteriorprobabilityofsenonesusingeithercleanfeaturesorfeaturesextractedfromthemulti-conditionset.
80
SystemDescription
DiagonalFeatureDiscriminantLinearRegression
81
SystemDescription
DiagonalFeatureDiscriminantLinearRegression(Contd) dFDLRisasemi-supervisedfeatureadaptationtechnique. ThemotivationfordevelopingdFDLRistoaddresstheproblemofgeneralizationtounseenmicrophoneconditionsinourdataset,whichiswheretheDNN-HMMsystemsperformtheworst.
82
SystemDescription
DiagonalFeatureDiscriminantLinearRegression(Contd) ToapplydFDLR,wefirstobtainaninitialsenone-levellabelingforourtestutterancesusingtheunadaptedmodels.Featuresarethentransformedtominimizethecross-entropyerrorinpredictingtheselabels. Themathematicalexpressionsareasfollow.
Ot ( f ) = wf iOt ( f )+ bf
min E(st ,Dout (Ot5...Ot+5 ))t
83
SystemDescription
DiagonalFeatureDiscriminantLinearRegression(Contd) TheparameterscaneasilybelearnedwithintheDNNframeworkbyaddingalayerbetweentheinputlayerandthefirsthiddenlayeroftheoriginalDNN. Afterinitialization,thestandardbackpropagationalgorithmisrunfor10epochstolearntheparametersofthedFDLRmodel. Duringbackpropagation,weightsoftheoriginalhiddenlayersarekeptunchangedandonlytheparametersinthedFDLRareupdated.
84
Content
Introduction SystemDescription EvaluationResults Discussion
85
EvaluationResults
From:Thepaper86
EvaluationResults
From:Thepaper87
Content
Introduction SystemDescription EvaluationResults Discussion
88
Discussion
Severalinterestingobservationscanbemadefromtheresultspresentedintheprevioussection. Firstly,theresultsclearlyshowthatthespeechseparationfront-endisdoingagoodjobatremovingnoiseandhandlingchannelmismatch. Secondly,withnochannelmismatch,T-Fmaskingaloneworkedwellinremovingnoise.
89
Discussion
Finally,directlyperformingfeaturemappingfromnoisyfeaturestocleanfeaturesperformsreasonably,butitdoesnotperformaswellastheproposedfront-end.
90
Thank You
91