ICASSP review - People | MIT CSAILpeople.csail.mit.edu/mitra/meetings/2017-April18-YuZhang.pdf · 2017. 4. 18. · 2.0.2. Backward propagation ... network, only few parameter been

ICASSP reviewYu Zhang

• Advances in All-Neural Speech Recognition

• New symbol Inventory

• “yes he has one” -> “YesHeHasOne”

• Position-dependent “space”

• Symbol to symbol mapping network

• Noise input from CTC -> GroundTruth

• Really nice results, 14% on SWB (<10 for Hybrid system).

• Word piece model vs. Characters

• Fixed representation

• Can we learn it from the data?

• Context-dependent vs context-independent

• Symbol-to-symbol network

• Less burden on the decoder side

• Can we learn it jointly?

• Residual Memory Networks: Feed-forward approach to learn long temporal dependencies

• Delay more time steps for residual networks

A combination of these two components allows RMN to learnlong-term dependencies and higher level abstracts simultane-ously in a much simpler and efficient way. Bi-directionalRMN (BRMN) is also formulated in this work, which is asimple extension to RMN by adding an extra connection withshared weights for learning future information. Computationalcomplexity is relatively less for BRMN over BLSTM or bi-directional RNNs which is detailed in section 2.1.

In section 3, we explain about AMI corpus and the base-line model configurations used in our experiments. A detailedexplanation to build proposed RMN and BRMN model is in sec-tion 3.3 and 3.4. Empirical evaluation is conducted in section 4to validate the structure of RMN for speech recognition tasks.Comparison of the RMN with the best LVCSR systems in lit-erature in listed in section 4.4, which is followed by conclusionand future work.

2. Residual memory networks

RMN is composed of memory layers and residual connectionsas shown in figure 2. The residual connection connects the pre-vious output to the current input by skipping few layers. Eachmemory layer contains two weight transforms: The first affinetransform Wl learns current time step and is different for eachlayer l = 1, 2, ..L as in standard DNNs. The second weighttransform Ws is shared across all layers and learns past infor-mation by varying delay in decreasing order. For example infigure 2, Ws receives t � T

th frame in first layer, second layerreceives t � (T � 1)th frame and the delay keep decreasingas we proceed to higher layers. In RMN T is fixed based onthe number of layers. For instance 18 layered network cap-tures 18 time steps. In this network, relu activation is usedafter each memory layer as it is efficient for training deeper net-works [11, 10]. Thus, RMN can be represented as a variant ofdeep feed-forward neural network which harnesses the impor-tant characteristics of unfolded-RNN and residual networks.

2.0.1. Forward propagation

Figure 2 shows the series of computations done in RMN ar-chitecture, where input x(t), {t = 1, 2, ..T} at time instant tis processed using Wl matrix in the layer l to get hl(t). Theshared weight Ws receives hl(t � m) by delaying hl(t) by m

time steps. The feed-forward output after each memory layer is

yl(t) = � (x(t)Wl + hl(t�m)Ws), l = 1, 2, ..L (1)

where hl(t) = x(t)Wl and � is the relu activation output.

2.0.2. Backward propagation

Backpropagation for computing the parameter Wl is done inthe same way as in standard DNNs. The shared parameter Ws

is computed by taking into account error gradients from all Ttime instants which is exactly equal to L memory layers. Theerror derivative w.r.t to Ws is

@E(t)@Ws

=XT

k=1

@E(t)@z(t)

@z(t)@hl(t)

�@hl(t)@hk(k)

@hk(k)@Ws

(2)

where z(t) is the softmax output,z(t) is target label andEt(.) denotes cross-entropy loss function.

2.1. Bi-directional residual memory network

In this section, the structure of bi-directional RMN (BRMN) isdiscussed. The BRMN is an extension of RMN with one addi-

tional shared weight transform which receives future frames asinput. The forward propagation output is given as

yl(t) = � (x(t).Wl + hl(t�m).Ws + hl(t+m).Wb) (3)

where t�m is the time instant delayed by m steps and Wb isthe shared weight across layers. Unlike bi-directional RNN de-scribed in [12], BRMN does not require two separate recurrentunits for training future and past frames. The past and futureframes are not treated as independent entities and merged af-ter each layer. A possible explanation for bi-directional RNNto have two separate layers is because RNN is tend to lookover all frames during prediction which leads to performancedrop[12]. In case of BRMN the network is constrained to pre-defined context size based on the the number of memory layersand thus connecting the forward states and backward states aftereach memory layer shows improvement in performance. Also,BRMN requires only one extra weight transform over RMN andhence the number of parameters is significantly less when com-pared to bi-directional RNN and BLSTM [13, 12].

t-T

+

t-T+1

+

t-T+2

+

t-1

+

Residual connection

Memory layers

Figure 2: Architecture of residual memory network (RMN)with number of memory layers L = 18. The memory layerscan model temporal context size of 18.

3. Experimental setup

The experiments were conducted on the AMI meeting conversa-tion corpus 1, using the independent headset microphone (IHM)recordings. The database is composed of 77 hours - train dataand 9 hours of each dev and eval data. 16KHz sampled wave-form was used to extract 13-dimensional MFCC features. Thesefeatures were mean normalized, spliced over 7 frames and pro-jected down to 40 dimensions using linear discriminant analy-sis (LDA) obtained from LDA+MLLT model. The LDA fea-tures were fed to speaker adaptive training (SAT) using speakerbased feature-space maximum likelihood linear regression (fM-LLR) transforms to obtain fMLLR features. 80 dimensional logMel-filterbank (fbank) features were also used for comparison.The standard GMM-HMM and DNN is trained by followingthe Kaldi toolkit [14]. The LSTM and RMN is trained usingthe CNTK toolkit [15]. SAT alignments using 4006 tied-stateswere used as targets for neural network training. Testing wasdone using eval set with trigram language model.

3.1. Baseline DNN and LSTM models

The DNN configuration includes 440 (40 x 11 splice) dimen-sional fMLLR features at input and 4006 senones at softmaxoutput. DNN containing 6 hidden layers, 2048 neurons wereinitialized using RBM pretraining and fine-tuned with mini-batch SGD frame-classification training. The LSTM training

1http:corpus.amiproject.org/

• Summary:

• 9.9% on SWB (with i-vector adaptation)

• Going deeper with less parameters than other highway variation.

• I-vector needs more data!!

• KNOWLEDGE DISTILLATION FOR SMALL-FOOTPRINT HIGHWAY NETWORKS

• Using a larger DNN to teach a small Highway DNN (by tying the weights)

• The gain may from it just mimic a RNN networks…

• But….

• Some recent paper shows for deep residual network, only few parameter been modified

• Can we skip some layer directly during decoding?

• JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING

• Combine CTC and seq2seq loss

• CTC -> monotonically constraint

• Corpus: Japanese and Mandarin, shorter utterance!!

• UNSUPERVISED SPEAKER ADAPTATION OF BATCH NORMALIZED ACOUSTIC MODELS FOR ROBUST ASR

• Adapt the scaling and shift factor for each speaker

• Very similar to LHUC (after non-linearity)

sfdphojujpo [28]/ Ui f l f z jefb jt up gjstu opsn bmj{f ui f joqvu pg fbdi i jeefo mbzfs vt joh ui f n fbo boe wbsjbodf dbmdvmbufe gspn fbdi n joj.cbudi jo ui f gpsx bse qbt t - boe ui fo mjofbsmz tdbmf boe t i jgu ui f opsn bmj{fe joqvu cfgpsf bqqmzjoh opo.mjofbs bdujwbujpot / N bui fn bujdbmmz-

(

) )2*

x i fsf jt ui f pvuqvu pg ui f ui i jeefo mbzfs- sfqsf t fout ui f opo.mjofbs bdujwbujpo gvodujpo- - - boe bsf ui f n fbo- tuboebse efwjbujpo- tdbmjoh gbdups boe t i jgujoh gbdups bu ui f ui i jeefo mbzfs- sf tqfdujwfmz- boe jt kvtu ui f joqvu pg ui f ofux psl / Ui f cjbt ufsn pg fbdi i jeefo mbzfs jt opu jodmvefe i f sf bt ju x pvme cf dbodfmmfe pvu cz ui f n fbo tvcusbdujpo pqfsbujpo jo ui f gpsx bse qbt t /

Ju jt tvhhf tufe jo [26] ui bu evsjoh EOO usbjojoh- ui f ejtusjcvujpo pg ui f joqvu pg i jeefo mbzfst dpvme di bohf gsf r vfoumz- bt ui f qbsbn fufst jo qsfwjpvtmz mbzfst di bohf / Bt b sf tvmu- ui f pqujn j{bujpo qspdf t t jt tmpx fe epx o t jhojgjdboumz/ Ui jt qspcmfn dbo cf bmmfwjbufe cz qfsgpsn joh mbzfs.x jt f opsn bmj{bujpo/ Ui jt x bz- n vdi mbshfs mfbsojoh sbuf t dbo cf tbgfmz vt fe- boe ui fsfcz n vdi gbtufs dpowfshfodf boe qpufoujbmmz cfuufs sf tvmut dpvme cf bdi jfwfe/

Ui f mjofbs usbotgpsn bujpo ufsn t - j/f / boe - bsf dsjujdbm jo cbudi opsn bmj{bujpo/ Jg ui fz bsf opu jodpsqpsbufe- ui f joqvu pg ui f opo.mjofbs gvodujpot x pvme cf dpodfousbufe bspvoe {fsp/ Ui jt n fbot ui bu ui f bdujwbujpot x pvme cf dmptf up mjn jufe mjofbs usbotgpsn bujpot gps t jhn pjebm vojut [26]- boe bqqspyjn bufmz i bmg pg ui f bdujwbujpot x pvme cf jn qmjdjumz gpsdfe up cf {fsp boe i bmg up cf qpt jujwf gps SfMVt / X jui ui f mjofbs usbotgpsn bujpo ufsn t - ui f ofux psl dbo bvupn bujdbmmz di ppt f x i jdi t fhn fout pg ui f bdujwbujpo gvodujpo up vt f gps b cfuufs qfsgpsn bodf /

Bgufs ui f usbjojoh jt epof- x f gf fe.gpsx bse bmm ui f usbjojoh ebub up ui f ofux psl boe sfdpse ui f n fbo boe wbsjbodf bu fwfsz i jeefo mbzfs/ X f vt f ui f n fbot boe wbsjbodf t dbmdvmbufe ui jt x bz- efopufe bt

boe - gps mbzfs.x jt f opsn bmj{bujpo

bu ui f uf tu tubhf /

2.2. Linear Input Network

Ui f jefb pg ui f mjofbs joqvu ofux psl bqqspbdi jt up mfbso b mjofbs usbotgpsn bujpo pg ui f joqvu gfbuvsf t pg ui f bdpvtujd n pefm [5]- [6]- [29]/ Jo ui jt tuvez- bt ui f bn pvou pg bebqubujpo ebub jt mjn jufe- x f dpotusbjo ui f ovn cfs pg qbsbn fufst up cf mfbsofe cz gpsdjoh ui f mjofbs usbotgpsn bujpo up cf ejbhpobm- x jui qbsbn fufst ujfe bdspt t of jhi cpsjoh gsbn f t ;

)3*

x i fsf efopuf t ui f vo.opsn bmj{fe joqvu gfbuvsf- sfqsf t fout ui f bebqufe gfbuvsf - boe tuboe gps ui f n fbo boe tuboebse efwjbujpo dpn qvufe gspn ui f x i pmf usbjojoh ebub- boe boe joefyf t ujn f boe gsf r vfodz- sf tqfdujwfmz/ boe bsf ui f qbsbn fufst up cf mfbsofe gps fbdi tqfbl fs/ X f jojujbmj{f up cf bo bmm.pof wfdups boe up cf bo bmm.{fsp wfdups cfgpsf bebqubujpo/

Ui jt n fui pe jt sfbtpobcmf jo ui f t fot f ui bu ui f ejtusjcvujpo pg uf tu ebub n bz cf wfsz ejggfsfou gspn ui bu pg usbjojoh ebub/ Ui fsfgpsf - b ejbhpobm mjofbs usbotgpsn bujpo jt mfbsofe up tdbmf boe t i jgu ui f opsn bmj{fe uf tu ebub up cfuufs n budi ui f n fbo boe wbsjbodf pg ui f usbjojoh ebub/

2.3. Adaptation of Batch Normalized Models

P of qspcmfn pg ui f MJO bqqspbdi jt ui bu ju pomz usjf t up n budi ui f ejtusjcvujpo pg uf tu ebub x jui ui bu pg usbjojoh ebub bu ui f joqvu mfwfm/ I px fwfs- bgufs n boz mbzfst pg bggjof usbotgpsn bujpot boe opo.mjofbs pqfsbujpot - ui f ejtusjcvujpo pg ui f i jeefo bdujwbujpot pg uf tu ebub dpvme cfdpn f n psf boe n psf n jtn budi fe x jui ui bu pg usbjojoh ebub/ P omz n budi joh ui f ejtusjcvujpo bu ui f joqvu mfwfm x pvme opu cf hppe cfdbvt f ui f MJO bqqspbdi jut fmg n bz opu cfi bwf x fmm/ Up efbm x jui ui jt qspcmfn - x f mfbso b mjofbs usbotgpsn bujpo gps ui f joqvu pg fwfsz i jeefo mbzfs tp ui bu ui f mjofbs usbotgpsn fe joqvu dbo cfuufs n budi ui f ejtusjcvujpo pg ui f usbjojoh ebub bu fwfsz i jeefo mbzfs/

Ui fo- ui f qspcmfn jt i px up hfu ui f ejtusjcvujpo pg ui f usbjojoh ebub bu fwfsz i jeefo mbzfs tp ui bu x f dbo n budi ui f ejtusjcvujpo pg ui f uf tu ebub poup/ Cbudi opsn bmj{fe bdpvtujd n pefmt obuvsbmmz qspwjef vt ui f

boe bu ui f ui

i jeefo mbzfs/ Ui fsfgpsf - jo ui jt tuvez x f bekvtu ui f tdbmjoh gbdups boe t i jgujoh gbdups bu fwfsz i jeefo mbzfs gps cbudi opsn bmj{fe bdpvtujd n pefmt

(

) )4*

x i fsf jt ui f bebqufe i jeefo bdujwbujpot - boe boe bsf ui f pomz qbsbn fufst up cf bekvtufe gps fbdi tqfbl fs/ Ui f ejn fot jpot pg boe bsf ui f tbn f bt ui f ovn cfs pg i jeefo vojut bu ui f ui i jeefo mbzfs/ Ui f ovn cfs pg qbsbn fufst up cf mfbsofe jt ui fsfgpsf mjn jufe/ Opuf ui bu x f ep opu di bohf bt n pejgzjoh ju dpvme ef tuspz ui f x fmm.mfbsofe gjmufst jo ui f x f jhi u n busjy/

Jo ui jt qbqfs- x f qfsgpsn votvqfswjt fe bebqubujpo gps fbdi tqfbl fs/ X f gjstu efdpef bmm ui f vuufsbodf t pg fbdi tqfbl fs jo ui f uf tu t fu vt joh ui f tqfbl fs.joefqfoefou cbudi opsn bmj{fe bdpvtujd n pefmt up pcubjo ui f gjstu.qbt t efdpejoh sf tvmut- gspn x i jdi x f bekvtu ui f qbsbn fufst up n jojn j{f ui f dspt t .fouspqz dsjufsjpo vt joh ui f cbdl .qspqbhbujpo bmhpsjui n t /

Ju t i pvme cf n foujpofe ui bu jo [2: ] boe [31]- Qbx fm et al. qspqpt f bo bebqubujpo ufdi ojr vf cbt fe po mfbsojoh i jeefo voju dpousjcvujpot )MI VD*- x i jdi f t t foujbmmz mfbsot b x f jhi u gps fwfsz i jeefo voju up sf .dpn cjof bmm ui f bdujwbujpot gps b ubshfu tqfbl fs ps fowjspon fou/ Jo [32]- b qbsbn fufsj{fe i jeefo bdujwbujpo gvodujpo bqqspbdi jt qspqpt fe up sf .x f jhi u ui f jn qpsubodf pg fbdi i jeefo voju gps tqfbl fs bebqubujpo/ E jggfsfou gspn ui f t f tuvejf t - pvs n fui pe jt qspqpt fe gps cbudi opsn bmj{fe bdpvtujd n pefmt- boe x f bekvtu ui f t i jgujoh gbdupst uphfui fs x jui ui f tdbmjoh gbdupst up sfdpn cjof ui f bdujwbujpo cfgpsf- sbui fs ui bo bgufs- ui f opo.mjofbs gvodujpo bu bmm ui f i jeefo mbzfst / Ui jt n bz cfuufs bebqu bo bdpvtujd n pefm up b ubshfu tqfbl fs ps b ubshfu fowjspon fou/ Jo beejujpo- pvs n fui pe epf t opu n bl f boz bt tvn qujpot po ui f uzqf pg bdujwbujpo gvodujpot /

3. EXPERIMENTAL SETUP X f fwbmvbuf pvs n fui pet po ui f sfdfoumz qspqpt fe DI jN F.4 dpsqvt [33]/ Ui f DI jN F.4 ebubtfu jodmvef t sfbm boe t jn vmbufe ebub jo gpvs di bmmfohjoh ebjmz fowjspon fout - j/f / dbgf- tusf fu kvodujpo- qvcmjd usbotqpsu- boe qfef tusjbo bsfb- boe dpot jtut pg t jy.di boofm n jdspqi pof bssbz ebub/ Ui f sfbm sfdpsejoht bsf vuufsfe cz sfbm tqfbl fst jo ui f bcpwfn foujpofe fowjspon fout - boe bsf sfdpsefe vt joh b tqfdjbmmz ef t jhofe ubcmfu x jui gjwf n jdspqi pof t n pvoufe jo ui f gspou boe pof jo ui f sfbs/ Ui f t jn vmbufe ebub jt dsfbufe cz gjstu dpowpmwjoh dmfbo vuufsbodf t

4891

• Summary:

• Nice gain on ChiME using DNN

• I did the same thing on LSTM

• Batchnorm degrades the performance of LSTM even it is helpful for the adaptation

• For more robustness, it may make more sense:

• Domain adaptation through batch norm

• RECURRENT CONVOLUTIONAL NEURAL NETWORK FOR SPEECH PROCESSING

• Replacing inner product in RNN as a Convolution

• Much less gain than Seq2Seq model

• Seq2Seq was more easier to overfit

• Joint Optimization of Tandem Systems Using Gaussian Mixture Density Neural Network Discriminative Sequence Training

• The gradient of NN was from the HMM-GMM training

• Interesting part was on the details: variance floor, I-smoothing, etc

• Nice guide to review the old HMM world (From Cambridge speech group)

• IBM: Reaching new records in speech recognition

• Human parity was not 5.9%!

• We got 5.5% but we won’t claim we reach human performance

• Adversarial speaker adaptation

Speaker invariant LSTM

LSTM

LSTM

Phone label

LSTM

Input

Object: Predict Speakers

Gradient Reverse Layer

�@Ls

@✓f

@Ld

@✓f

@Ls

@✓s

BackProp

Ld Ls

• Unreliable/noise reversed gradients

• Speaker information was chunk level instead of frame-based

• temporal pooling layer

• LSTM final hidden output

Preliminary Results

• Switchboard, 300h training, SWB and CHM, filter-bank features

• SWB supposed to be more close to the speakers in the training set

System features SWB CHM

LC-BLSTMP fbank 11.1 23.1

ADS-LC-BLSTMP fbank 11.0 22.2

LC-BLSTMP fmllr 11.3 21.8

ADS-LC-BLSTMP fmllr 11.4 21.7

110

Documents

ICASSP review - People | MIT CSAILpeople.csail.mit.edu/mitra/meetings/2017-April18-YuZhang.pdf · 2017. 4. 18. · 2.0.2. Backward propagation ... network, only few parameter been