8
2114 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013 Adaptation of Hidden Markov Models for Recognizing Speech of Reduced Frame Rate Lee-Min Lee, Member, IEEE, and Fu-Rong Jean Abstract—The frame rate of the observation sequence in dis- tributed speech recognition applications may be reduced to suit a resource-limited front-end device. In order to use models trained using full-frame-rate data in the recognition of reduced- frame-rate (RFR) data, we propose a method for adapting the transition probabilities of hidden Markov models (HMMs) to match the frame rate of the observation. Experiments on the recognition of clean and noisy connected digits are conducted to evaluate the proposed method. Experimental results show that the proposed method can effectively compensate for the frame-rate mismatch between the training and the test data. Using our adapted model to recognize the RFR speech data, one can signifi- cantly reduce the computation time and achieve the same level of accuracy as that of a method, which restores the frame rate using data interpolation. Index Terms—Adaptation, distributed speech recognition (DSR), hidden Markov model (HMM), reduced frame rate (RFR). I. I NTRODUCTION H UMAN speech is the most straightforward and conve- nient way of delivering commands to mobile devices in order to ask for services while on the move. Mobile devices attribute their restrictions to small size, limited computing power and access speed, small memory capacity, and limited battery life. Therefore, reducing the significant computational burden of employing automatic speech recognition (ASR) on mobile devices leads to an attractive client–server architecture known as distributed speech recognition (DSR) [1]–[8]. In DSR applications, speech features are extracted and quantized at the user’s end (client end) and sent to a speech recogni- tion server. Furthermore, mel-frequency cepstral coefficients (MFCCs) have been proven to be very successful for ASR and are used in most DSR front ends. DSR standardization activities have been conducted by the Speech Processing, Transmission and Quality Aspects (STQ) AURORA DSR working group in the European Telecommunication Standards Institute (ETSI) in order to support and spread DSR applications. Because the computation, memory, communication band- width, and power resources of a client device may be very Manuscript received August 17, 2011; revised May 18, 2012, September 26, 2012, and December 13, 2012; accepted January 7, 2013. Date of publication February 13, 2013; date of current version November 18, 2013. This work was supported by Dayeh University under Grant ORD-9961. This paper was recommended by Associate Editor V. Murino. L.-M. Lee is with the Department of Electrical Engineering, Dayeh Univer- sity, Changhua 51591, Taiwan (e-mail: [email protected]). F.-R. Jean is with the Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2240450 limited [9]–[14], considerable efforts have been devoted to reduce the frame rate or dimension of the feature vector in DSR systems. A detailed discussion on the capabilities and the limitations of the client device can be found in [11]. Tan et al. investigated a half-frame-rate (HFR) front end for DSR and justified that the performance achieved by HFR is comparable with the full-frame-rate (FFR) front end [9]. They also proposed the use of adaptive multiframe-rate scheme in which the DSR system switches between HFR and FFR schemes. Flynn and Jones examined several feature selection methods that can reduce the bandwidth for data transmission from the client to the server [12]. Flynn and Jones also proposed a transmission bandwidth reduction method by using a wavelet decomposition on the feature vectors discarding the detail coefficients and only transmitting the approximation coefficients to the receive side [13]. The use of the variable frame rate instead of the FFR processing of speech features is another approach to reduce the computation and bandwidth resources. The variable- frame-rate analysis method can be traced back to the early 1990s [15], [16]. In this method, slowly changing frames that provide little information are abandoned, and rapidly changing frames that are more related to speech perception are pre- served. The obvious advantages of this method are the rela- tively small bandwidth for the features transmission and the relatively short processing time for recognition if the number of processed frames is reduced or can perform at a superior recognition accuracy at a given frame rate. A measurement of the spectral change level for every frame is required for deciding which frame of data is to be dropped [14]–[19]. A relatively simple way to decrease the data rate is to reduce the frame rate at the front-end stage. The reduced-frame-rate (RFR) observations can be obtained using the constant mul- tiple frame shift of the FFR in the extraction of the feature vectors. For example, the HFR feature extraction exploits a fixed double frame shift of the FFR, and thus, one half of the computation load for the feature extraction, as well as the bandwidth, can be saved. However, some compensation is necessary to alleviate the performance degradation due to the mismatched frame rates between the received feature vectors and the pretrained models. There are two major approaches to compensate for the frame-rate mismatch. The first approach reconstructs an FFR feature vector sequence from the received RFR data and then decodes the reconstructed FFR data using the FFR models [10], [19]. In contrast to reconstructing the frame rate of the received data, the second approach adapts the state transition probabilities of the hidden Markov models (HMMs) to match the received RFR speech feature vector sequences [20]. 2168-2267 © 2013 IEEE

Adaptation of Hidden Markov Models for Recognizing Speech of Reduced Frame Rate

  • Upload
    hoangex

  • View
    17

  • Download
    2

Embed Size (px)

Citation preview

  • 2114 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

    Adaptation of Hidden Markov Models forRecognizing Speech of Reduced Frame Rate

    Lee-Min Lee, Member, IEEE, and Fu-Rong Jean

    AbstractThe frame rate of the observation sequence in dis-tributed speech recognition applications may be reduced to suita resource-limited front-end device. In order to use modelstrained using full-frame-rate data in the recognition of reduced-frame-rate (RFR) data, we propose a method for adapting thetransition probabilities of hidden Markov models (HMMs) tomatch the frame rate of the observation. Experiments on therecognition of clean and noisy connected digits are conducted toevaluate the proposed method. Experimental results show that theproposed method can effectively compensate for the frame-ratemismatch between the training and the test data. Using ouradapted model to recognize the RFR speech data, one can signifi-cantly reduce the computation time and achieve the same level ofaccuracy as that of a method, which restores the frame rate usingdata interpolation.

    Index TermsAdaptation, distributed speech recognition(DSR), hidden Markov model (HMM), reduced frame rate (RFR).

    I. INTRODUCTION

    HUMAN speech is the most straightforward and conve-nient way of delivering commands to mobile devices inorder to ask for services while on the move. Mobile devicesattribute their restrictions to small size, limited computingpower and access speed, small memory capacity, and limitedbattery life. Therefore, reducing the significant computationalburden of employing automatic speech recognition (ASR) onmobile devices leads to an attractive clientserver architectureknown as distributed speech recognition (DSR) [1][8]. InDSR applications, speech features are extracted and quantizedat the users end (client end) and sent to a speech recogni-tion server. Furthermore, mel-frequency cepstral coefficients(MFCCs) have been proven to be very successful for ASR andare used in most DSR front ends. DSR standardization activitieshave been conducted by the Speech Processing, Transmissionand Quality Aspects (STQ) AURORA DSR working group inthe European Telecommunication Standards Institute (ETSI) inorder to support and spread DSR applications.

    Because the computation, memory, communication band-width, and power resources of a client device may be very

    Manuscript received August 17, 2011; revised May 18, 2012, September 26,2012, and December 13, 2012; accepted January 7, 2013. Date of publicationFebruary 13, 2013; date of current version November 18, 2013. This workwas supported by Dayeh University under Grant ORD-9961. This paper wasrecommended by Associate Editor V. Murino.

    L.-M. Lee is with the Department of Electrical Engineering, Dayeh Univer-sity, Changhua 51591, Taiwan (e-mail: [email protected]).

    F.-R. Jean is with the Department of Electrical Engineering, National TaipeiUniversity of Technology, Taipei 10608, Taiwan (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TCYB.2013.2240450

    limited [9][14], considerable efforts have been devoted toreduce the frame rate or dimension of the feature vector inDSR systems. A detailed discussion on the capabilities and thelimitations of the client device can be found in [11]. Tan et al.investigated a half-frame-rate (HFR) front end for DSR andjustified that the performance achieved by HFR is comparablewith the full-frame-rate (FFR) front end [9]. They also proposedthe use of adaptive multiframe-rate scheme in which the DSRsystem switches between HFR and FFR schemes. Flynn andJones examined several feature selection methods that canreduce the bandwidth for data transmission from the client tothe server [12]. Flynn and Jones also proposed a transmissionbandwidth reduction method by using a wavelet decompositionon the feature vectors discarding the detail coefficients andonly transmitting the approximation coefficients to the receiveside [13]. The use of the variable frame rate instead of theFFR processing of speech features is another approach toreduce the computation and bandwidth resources. The variable-frame-rate analysis method can be traced back to the early1990s [15], [16]. In this method, slowly changing frames thatprovide little information are abandoned, and rapidly changingframes that are more related to speech perception are pre-served. The obvious advantages of this method are the rela-tively small bandwidth for the features transmission and therelatively short processing time for recognition if the numberof processed frames is reduced or can perform at a superiorrecognition accuracy at a given frame rate. A measurementof the spectral change level for every frame is required fordeciding which frame of data is to be dropped [14][19].A relatively simple way to decrease the data rate is to reducethe frame rate at the front-end stage. The reduced-frame-rate(RFR) observations can be obtained using the constant mul-tiple frame shift of the FFR in the extraction of the featurevectors. For example, the HFR feature extraction exploits afixed double frame shift of the FFR, and thus, one half ofthe computation load for the feature extraction, as well asthe bandwidth, can be saved. However, some compensation isnecessary to alleviate the performance degradation due to themismatched frame rates between the received feature vectorsand the pretrained models. There are two major approachesto compensate for the frame-rate mismatch. The first approachreconstructs an FFR feature vector sequence from the receivedRFR data and then decodes the reconstructed FFR data usingthe FFR models [10], [19]. In contrast to reconstructing theframe rate of the received data, the second approach adaptsthe state transition probabilities of the hidden Markov models(HMMs) to match the received RFR speech feature vectorsequences [20].

    2168-2267 2013 IEEE

  • LEE AND JEAN: ADAPTATION OF HMMS FOR RECOGNIZING SPEECH OF RFR 2115

    Fig. 1. Overall system architecture of the frame-based interpolation approach.

    Fig. 2. Overall system architecture of the proposed model adaptation approach.

    A systematic description of the frame-rate reconstructionapproach is illustrated in Fig. 1, in which D is a positive integerthat represents the frame-rate reduction factor (decimation fac-tor). Tan et al. [19] suggested repeating each frame of HFR datato construct an FFR data sequence and then using the originalHMM to decode it. Deng et al. [10] exploited linear and splineinterpolation methods for the missing frames in a study of theDSR with a variable frame rate.

    The model adaptation approach was proposed by the firstauthor in a study on the recognition of HFR speech [20]. Therecognition accuracy of HFR data using the model adaptationmethod was comparable with that of the FFR data used inthat study. Fig. 2 depicts the system architecture of the modeladaptation approach. In this paper, we extended the modeladaptation method so that it could be used in the recognitionof RFR data for general frame-rate reduction factors. Exten-sive experiments on the recognition of both clean and noisyspeeches were conducted to investigate the effectiveness ofthe proposed method. The computational cost of the proposedHMM adaptation is very low, and such adaptation can beconducted offline prior to the recognition phase. Furthermore,the decoding time for the RFR data is smaller than that for theFFR data since only a fraction of the FFR feature vectors needto be processed. In this paper, the accuracy and the recognitiontime of the proposed method are reported and compared withthose of a frame-rate reconstruction approach. This paper alsoanalyzed the distribution of the error types for the recognitionof the RFR data using the FFR models.

    In practice, since the delta and acceleration feature sequencesare supposed to be extracted from the transmitted RFR featuresequence at the back end, the preprocessing of the RFR featuresequence for generating both delta and acceleration feature se-quences is expected and has to be conducted prior to the RFRdecoding. The method of dealing with this problem is discussedin detail in Section III-B. The rest of this paper is organizedas follows: In Section II, the derivation of HMM model forthe RFR data is described in detail. In Section III, a series ofexperiments on the recognition of the RFR speech features arepresented and analyzed. A discussion on the limitations of andthe suggestion for applying the proposed adaptation methodto phone models with a small number of states is presentedin Section IV. The concluding remarks are finally given inSection V.

    II. ADAPTATION OF HMM FOR RFR DATA

    A typical speech recognition system consists of a front-endfeature extraction unit and a back-end decoding unit. The front-end unit divides the input signal into overlapped frames andconverts each frame to a feature vector. The typical frame lengthand frame shift times are 25 and 10 ms, respectively. The mostwidely used feature vector consists of 12 MFCCs, log energy,and their first- and second-order regression coefficients.

    In the application of DSR with a resource-limited front-end device, the frame rate will be reduced. Therefore, only adecimated version of the FFR data is received by the recognizer.

  • 2116 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

    Suppose that the frame-rate reduction factor is D; then, we shallderive the formulas to decode the RFR sequence OD,O2D,. . . ,OTD, which can be assumed to be decimated from the FFRsequence O1,O2, . . . ,OTD.

    A. Speech Recognition of Full Frame Rate DataLet O1,O2, . . . ,OTD be the FFR observation sequence to

    be recognized; a speech recognizer is to find a word sequenceW such that

    W = argmaxwV

    P (w|O1,O2, . . . ,OTD)

    = argmaxwV

    P (O1,O2, . . . ,OTD|w)P (w) (1)

    where V represents the set of all possible word se-quences from the vocabulary set V of the recognition task,P (O1,O2, . . . ,OTD|w) is the acoustic model likelihoodscore, and P (w) is the language model score. In this paper,we focus on the acoustic model. In the HMM of the acousticlikelihood, each observation Ot is associated with an under-lying unobservable state qt. The Markovian property assumesthat both the probability distribution of the observation at timet and the probability distribution of the state at time t+ 1 can becompletely determined by the state at time t. When the HMMis applied to speech recognition, the transition structure of amodel is usually arranged from left to right so that each staterepresents the characteristics of a time position. In addition,two nonemitting null states are often used for representing theboundary before the first observation and the boundary after thelast observation, respectively.

    We can use the Markovian property to derive an efficientrecursive formula to calculate the acoustic model score. Sup-pose that an HMM has N states with states 1 and N used formodeling the start and end boundaries, respectively. Let aijdenote the transition probability from states i to j and bj()denote the observation probability function of state j. Let theforward likelihood function be defined as the likelihood ofpartial observations up to time t with the condition that stateqt = i, i.e.,

    t[i] = P (O1,O2, . . . ,Ot, qt = i|w). (2)

    Then, the Markovian property allows us to derive a recursiveformula for the forward likelihood function as follows:

    t+1[j]=P (O1,O2, . . . ,Ot,Ot+1, qt+1=j|w)

    =

    Ni=1

    P (O1,O2, . . . ,Ot,Ot+1, qt= i, qt+1=j|w)

    =

    Ni=1

    {P (O1,O2, . . . ,Ot, qt = i|w)

    P (qt+1=j|qt= i,w)P (Ot+1|qt+1=j,w)}

    =Ni=1

    t[i]aijbj(Ot+1). (3)

    The likelihood of the complete observation sequence can beobtained as follows:

    P (O1,O2, . . . ,OTD|w) =Ni=1

    TD[i]aiN . (4)

    The aforementioned formula is the total likelihood summedover all the possible state sequences (paths). To efficientlysearch for the word sequence that best matches the speech, thelikelihood of the most likely path is used instead of the totallikelihood. The Viterbi algorithm is usually applied to decodethe speech by simultaneously searching for the most likely pathand its associated likelihood.

    B. Adaptation of HMMs for RFR DataSuppose that the frame-rate reduction factor is D and only

    the decimated RFR data sequence OD,O2D, . . . ,OTD is avail-able to the speech recognizer trained from the FFR data. Thespeech recognizer for the RFR data finds a word sequence Wsuch that

    W = argmaxwV

    P (w|OD,O2D, . . . ,OTD)

    = argmaxwV

    P (OD,O2D, . . . ,OTD|w)P (w). (5)

    To calculate the acoustic model score of the RFR data, wedefine the forward likelihood function as follows:

    D,t[i] = P (OD,O2D, . . . ,OtD, qtD = i|w). (6)

    Note that D,t[i], which is the forward function at the tthtime step in (6), is not the same as Dt[i], which is the forwardfunction of the (tD)th time step defined by (2). Since theRFR data are generated by the same underlying physical modelthat generates the FFR data, we can think that there is anunderlying state at each time point of the FFR data. Therefore,a recursive formula to calculate the forward likelihood functioncan be derived as follows:

    D,(t+1)[j]

    =

    Ni=1

    Nk1=1

    N

    kD1=1

    P

    (OD,O2D, . . . ,OtD,O(t+1)D, qtD= i,

    qtD+1=k1, . . . , qtD+D1=kD1,q(t+1)D=j

    w)

    =

    Ni=1

    Nk1=1

    N

    kD1=1

    P (OD,O2D, . . . ,OtD, qtD= i|w)P (qtD+1=k1|qtD= i,w)P (qtD+2=k2|qtD+1=k1,w).

    .

    .

    P(q(t+1)D=j|qtD+D1=kD1,w

    )P(O(t+1)D|q(t+1)D=j,w

    )

    =

    Ni=1

    Nk1=1

    N

    kD1=1

    {P (OD,O2D, . . . ,OtD, qtD = i|w)aik1ak1k2 . . . akD1jbj

    (O(t+1)D

    )}

    =

    Ni=1

    D,t[i]aijbj(O(t+1)D

    ) (7)

  • LEE AND JEAN: ADAPTATION OF HMMS FOR RECOGNIZING SPEECH OF RFR 2117

    Fig. 3. Relation between an original HMM and the corresponding adaptedHMMs. (a) Original HMM for FFR data. (b) Adapted HMM for HFR data.(c) Adapted HMM for one-third frame-rate data.

    where

    aij =N

    k1=1

    N

    kD1=1

    aik1ak1k2 . . . akD1j . (8)

    By comparing (7) with (3), one can see that it is equal tothe forward likelihood function of an equivalent adapted HMMfor the RFR data sequence with the state transition probabilitiesgiven by (8) and with the same output probability distributionsas that of the original FFR model. From (8), the state transitionprobabilities of adapted HMMs for recognizing reduced framedata are easy to obtain and must be prepared offline prior tothe recognition as aforementioned. The relation between anoriginal FFR HMM and the corresponding adapted HMM fora five-state (including the starting and ending null states) left-to-right HMM is shown in Fig. 3. Note that, although thestructure of the original HMM is left to right without skips, theadapted HMMs include skip transitions. An RFR feature vectorsequence may not have sufficient frames of data to cover all thestates of the concatenated FFR models corresponding to thatsequence. In such a case, decoding the RFR data sequence usingFFR HMMs will result in errors. In contrast, since the adaptedmodels allow for state skips, the RFR data sequence can verylikely fit into the corresponding concatenated adapted models.

    III. EXPERIMENTS AND RESULTS

    A series of experiments on the recognition of RFR speechdata was conducted to investigate the effectiveness of the pro-posed HMM adaptation method. The effects of using variousmethods and system parameter sets were also inspected andcompared.

    A. Experimental Database

    The AURORA 2 [21], [22] database was used in the exper-iments. This database was designed for the task of speaker-independent connected digit recognition in noisy environments.It contains clean and noisy speeches. The clean speech wasobtained by downsampling the TIDIGIT database to 8 kHz.There are two training sets in AURORA 2. One training setcontains clean training speech. The other training set is amulticondition set that contains clean and noisy speeches withseveral signal-to-noise ratio (SNR) levels (20, 15, 10, and5 dB) and several types of noise (subway, babble, car, andexhibition hall). The noisy training speech used was obtainedby first filtering the clean speech with a G.712 characteristicand then artificially adding noise to the filtered speech at severalSNRs. There are, all in all, ten test subsets contained in threetest sets, i.e., both test sets A and B contain four subsetseach, and test set C contains two subsets. All the data inthe three test data sets contain clean and noisy speeches ofvarious SNR levels. There are, in total, 10 010 test utterancesfor each SNR level. The data in test sets A and B are filteredwith G.712 filtering characteristics. The data in test set Care filtered with the Modified Intermediate Reference Systemcharacteristics. The noise types added to set A are the sameas those for the multicondition training. The noise types addedto test set B are four other types (restaurant, street, airport,and train station). The subway and street noises are added totest set C.

    B. Feature Extraction

    The frame lengths and the frame shift times for the FFRobservation sequence were 25 and 10 ms, respectively. TwelveMFCCs and one log energy were used as the static featurevector. Each frame of the speech signal was preemphasizedusing a coefficient of 0.97. The MFCCs were extracted from theenergy of 23 frequency banks in the range from 64 Hz to 4 kHz.The MFCCs were then liftered with a coefficient of 22. Thedelta and acceleration coefficients of the static feature sequencewere used as the dynamic feature vector in the experiments.The delta and acceleration window lengths were set to threeand two frames, respectively. Since the dynamic feature canbe calculated at the back end, only the static feature should becalculated at the front end.

    The static feature vector of the RFR data was obtained usingthe constant multiple frame shift of FFR in the extraction offeature vectors. This is equivalent to the downsampling of theFFR static feature vector sequence shown in Figs. 1 and 2. Toobtain the dynamic feature of the RFR data, the downsampledstatic feature sequence was first linearly interpolated to theFFR, and then, the dynamic feature at the time positions of the

  • 2118 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

    Fig. 4. Extraction of dynamic feature from RFR static feature sequence.

    RFR was extracted from the interpolated static feature sequenceas indicated in Fig. 4. In addition to the FFR and RFR data,the reconstructed FFR data from the RFR data using linearinterpolation were used in the experiments.

    C. HMM Structure

    The FFR HMM structure for each digit model in the experi-ments was left to right without skip transitions. The number ofemitting states in each digit model was 16 in most experimentsconducted to compare the performance of the proposed methodand of the other approaches. The performance of the proposedmethod was also investigated for a range of state numbersin each digit model. The FFR silence model was developedusing three emitting states with left-to-right transitions, a skiptransition from the first emitting state to the last emitting state,and a loop-back transition from the last emitting state to thefirst emitting state. The FFR short pause model between digitshad one emitting state that shared the output distribution withthe middle state of the silence model and had a skip transitionfrom the dummy start state to the dummy end state. For a statein a FFR digit model, the initial state transition probability toitself and to the next one was set to be 0.6 and 0.4, respectively.The output probability distribution of each state was modeledas a mixture of Gaussian distributions with diagonal covariancematrices. The mixture number was set to three for the states ineach digit model and six for the states in the silence and theshort-pause models.

    D. Experimental Results

    The Hidden Markov Model Toolkit [23] was used for trainingthe model and decoding the speech. The training procedurefollowed the script supplied with the AURORA 2 database.Both the clean and multicondition trainings and all the ten testsubsets in the three test sets were used in the experiments. Theword loop network for speech decoding was the same as thatprovided by AURORA 2. The model parameters were estimatedfrom the FFR data. The performance of the proposed HMMadaptation method for the recognition of the RFR data wascompared with that of three other configurations, i.e., usingmodels trained from the FFR data to recognize FFR data, RFRdata, and FFR data linearly interpolated from RFR data. Therecognition rate reported was the average of the ten test subsets.

    Using models trained from FFR clean data, we obtainedthe word accuracy for FFR and RFR (HFR in this case) datarecognized by various methods and have listed the values in

    TABLE IWORD ACCURACY FOR VARIOUS RECOGNITION METHODS AND SNR

    LEVELS (CLEAN TRAINING; FR REDUCTION FACTOR D = 2)

    TABLE IIWORD ACCURACY FOR VARIOUS RECOGNITION METHODS AND SNR

    LEVELS (MULTICONDITION TRAINING; FR REDUCTION FACTOR D = 2)

    Table I. For multicondition training, the word accuracy is givenin Table II. A comparison of the word accuracy of the HMMadaptation and the data interpolation methods for several SNRlevels and the frame reduction factors of three and four is givenin Table III. From these tables, we can see that the degradationis severe if the RFR data without repetition or interpolationwere recognized with models trained from the FFR data andboth the proposed HMM adaptation method and the data inter-polation method can effectively alleviate the degradation.

    Although both the methods can achieve similar recognitionaccuracy, the proposed method only needs to process a fractionof the frames during the decoding process such that its recog-nition time can be significantly reduced as compared with thatof the data interpolation method. As a result, the percentagesof the computational load for the proposed method are merely52.9%, 35.3%, and 29.6% of the data interpolation method forthe frame-rate reduction factors D = 2, 3, and 4, respectively.The total decoding time (unit: minutes) of all three test datasets of AURORA 2 using various methods is shown in Fig. 5.The decoding time was measured on a personal computer with

  • LEE AND JEAN: ADAPTATION OF HMMS FOR RECOGNIZING SPEECH OF RFR 2119

    TABLE IIICOMPARISON OF WORD ACCURACY FOR HMM ADAPTATION AND DATA

    INTERPOLATION METHODS (MULTICONDITION TRAINING)

    Fig. 5. Total decoding time of all three test data sets of AURORA 2, whereFFR indicates the results for FFR data and FFR HMMs, ADP is for RFR dataand adapted HMMs, INP is for interpolated data and FFR HMMs, and RFR isfor RFR data and FFR HMMs (16-state digit HMMs; multicondition training).

    TABLE IVDISTRIBUTION OF DIFFERENT TYPES OF ERRORS (IN PERCENT)

    FOR THE RECOGNITION OF HFR DATA USING FFR MODELS(MULTICONDITION TRAINING)

    a 2.93-GHz Intel Core 2 Duo processor and 2-GB memory. Theplatform used in the experiments was Cygwin on Windows.

    The distribution of different error types for the recognitionof the RFR data using the FFR models was analyzed and isgiven in Table IV. One can find that the deletion error is themajor cause of accuracy degradation. Because a traversal of anHMM without skips will consume at least as many frames asthe number of the states in that HMM, a deletion error mustoccur if the frame number of the RFR data of an utterance wasless than the state number of the concatenated model for thatutterance.

    Fig. 6. Word accuracy of the proposed HMM adaptation method versusframe-rate reduction factor (multicondition training).

    Fig. 7. Word accuracy of the proposed HMM adaptation method versus statenumber for each digit model (multicondition training; clean test speech).

    We also investigate the word accuracy performance of theproposed HMM adaptation method for a wide range of frame-rate reduction factors. The experimental results for clean andnoisy speeches with the SNR of 20 and 15 dB are shownin Fig. 6. From Fig. 6, we can see that the word accuracygradually decreased for the frame-rate reduction factor of lessthan four and drastically dropped for the frame-rate reductionfactor greater than five.

    Since the proposed adaptation method changes the statetransition scheme, one may want to know the recognition rateversus the number of states. We conducted experiments toconsider the effect of state number, and the results were shownin Figs. 7 and 8 for clean and noisy speeches with the SNR of20 dB, respectively. From these figures, we can find that therecognition rate of the proposed method was very close to thatof FFR data for small frame-rate reduction factor and large statenumber. We can also find that the word accuracy decreased asthe number of states in each digital model decreased for allframe rates and the accuracy reduction was most prominentwhen the state number was small. Because the state number isusually set to be sufficiently large to get the best performance,the adverse effect of the small state number may not be a seriousproblem to the proposed method.

    Since the computation time for a HMM is mostly spent oncalculating Gaussian distribution functions, thus, it would behelpful to consider the computation time versus the number of

  • 2120 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

    Fig. 8. Word accuracy of the proposed HMM adaptation method versus statenumber for each digit model (multicondition training; noisy test speech withthe SNR of 20 dB).

    Fig. 9. Total decoding time of the proposed method versus the state numberin each digit model (multicondition training; clean test speech).

    states for the proposed method. The experimental results areshown in Fig. 9. From Fig. 9, one can see that the computationtime is roughly proportional to the state number and frame rate.

    IV. DISCUSSIONThe proposed approach is good for whole word models of

    isolated and connected speeches. For continuous speech, theword or subword models are connected to form a whole HMMnetwork. If we want to get the exactly equivalent connectedmodel for decoding the RFR data, we should apply the pro-posed adaptation method to the whole connected HMM net-work rather than apply the adaptation method to the individualHMM and then connect the adapted HMM. The network thatconnects individually adapted HMMs differs from the exactlyadapted network in the transition probabilities of the last statesof a model that is connected to other models. For example,in the case of HFR HMM adaptation, the proposed HMMadaptation method will allow for a transition from the secondlast state of a model to the second state of a following model, asshown by the path of the thick arcs in Fig. 10(a). In this case, thetransition skips two states, but an accurate adaptation methodtaking the model context into account would allow for skippingat most one state as the thick arcs show in Fig. 10(b). In this

    Fig. 10. Two HMM adaptation schemes for HFR data. (a) Connection ofindividually adapted HMMs. (b) Result of applying adaptation to connectedHMMs.

    paper, since the word model has many states, the relaxation ofthe constraint on state transition caused by applying adaptationto the individual model is slight; thus, we did not employ thecomplex accurate adaptation method. If the phone model witha small state number is used in the decoding network, then theproposed method will apply to the whole network in order tomaintain the transition constraint. We shall apply the adaptationmethod to the whole network in the future.

    V. CONCLUSION

    In a DSR system, frame-rate reduction is important for a re-source limited front-end device. The loading of the network andthe back-end server can also benefit from frame-rate reduction.For speech recognition over error-prone wireless networks,frame-rate reduction allows us to use more bits to protect thedata. However, some compensation is necessary to alleviatethe performance degradation due to the mismatched framerates between the received RFR data and the models trainedfrom FFR data. In this paper, we have derived an adaptationformula of the HMM transition probability for the recognitionof RFR speech observations. Experimental results show that theproposed HMM adaptation method can effectively compensatefor the fame rate mismatch between the received data andthe pretrained models. Using the proposed method, we canachieve the same level of recognition accuracy as that of thedata interpolation method and, at the same time, significantlyreduce the computation time. Since a back-end server in a DSRsystem recognizes utterances delivered from many client users,the proposed method allows the same back-end server capableof serving much more clients compared with the interpolationmethod. In other words, the proposed method increases theoverall system capacity without any additional cost of installingnew equipment.

    ACKNOWLEDGMENT

    The authors would like to thank the anonymous reviewers fortheir valuable comments that helped improve the quality of thispaper. They would also like to thank the National Center forHigh-Performance Computing for computer time and facilities.

  • LEE AND JEAN: ADAPTATION OF HMMS FOR RECOGNIZING SPEECH OF RFR 2121

    REFERENCES[1] Z.-H. Tan and I. Varga, Network, distributed and embedded speech re-

    cognition: An overview, in Automatic Speech Recognition on Mobile De-vices and over Communication Networks, Z.-H. Tan and B. Lindberg, Eds.London, U.K.: Springer-Verlag, 2008, pp. 126.

    [2] A. M. Peinado and J. C. Segura, Speech Recognition over DigitalChannels- Robustness and Standards. Chichester, U.K.: Wiley, 2006,ch. 1.

    [3] Speech Processing, Transmission and Quality Aspects (STQ); DistributedSpeech Recognition; Front-End Feature Extraction Algorithm; Compres-sion Algorithms, ETSI Std. ES 201 108 V1.1.3, Sep. 2003.

    [4] Speech Processing, Transmission and Quality Aspects (STQ); DistributedSpeech Recognition; Advanced Front-End Feature Extraction Algorithm;Compression Algorithms, ETSI Std. ES 202 050 V1.1.5, Jan. 2007.

    [5] Speech Processing, Transmission and Quality Aspects (STQ); DistributedSpeech Recognition; Extended Front-End Feature Extraction Algorithm;Compression Algorithms; Back-End Speech Reconstruction Algorithm,ETSI Std. ES 202 211 V1.1.1, Nov. 2003.

    [6] Speech Processing, Transmission and Quality Aspects (STQ); DistributedSpeech Recognition; Extended Advanced Front-End Feature ExtractionAlgorithm; Compression Algorithms; Back-End Speech ReconstructionAlgorithm, ETSI Std. ES 202 212 V1.1.2, Nov. 2005.

    [7] T. Ramabadran, A. Sorin, M. McLaughlin, D. Chazan, D. Pearce, andR. Hoory, The ETSI extended distributed speech recognition (DSR)standardsClient side processing and tonal language recognition eval-uation, in Proc. IEEE Int. Conf. Acoust., Speech, Signal, Process.,May 2004, pp. 129132.

    [8] A. Sorin, T. Ramabadran, D. Chazan, R. Hoory, M. McLaughlin,D. Pearce, C. R. Wang, and Y. Zhang, The ETSI extended distributedspeech recognition (DSR) standardsServer-side speech reconstruction,in Proc. IEEE Int. Conf. Acoust., Speech, Signal, Process., May 2004,pp. 5356.

    [9] Z.-H. Tan, P. Dalsgaard, and B. Lindberg, Adaptive multi-frame-ratescheme for distributed speech recognition based on a half frame-rate front-end, in Proc. IEEE 7th Workshop Multimedia Signal Process., 2005,pp. 14.

    [10] H. Deng, D. OShaughnessy, J. Dahan, and W. F. Ganong, Interpolativevariable frame rate transmission of speech features for distributed speechrecognition, in Proc. ASRU, 2007, pp. 591595.

    [11] N. Rajput and A. A. Nanavati, Speech in Mobile and Pervasive Environ-ments. Hoboken, NJ: Wiley, 2012.

    [12] R. Flynn and E. Jones, Feature selection for reduced-bandwidth dis-tributed speech recognition, Speech Commun., vol. 54, no. 6, pp. 836843, Jul. 2012.

    [13] R. Flynn and E. Jones, Reducing bandwidth for robust distributed speechrecognition in conditions of packet loss, Speech Commun., vol. 54, no. 7,pp. 881892, Sep. 2012.

    [14] Z.-H. Tan and B. Lindberg, Low-complexity variable frame rate analysisfor speech recognition and voice activity detection, IEEE J. Sel. TopicsSignal Process., vol. 4, no. 5, pp. 798807, Oct. 2010.

    [15] K. M. Pointing and S. M. Peeling, The use of variable frame rate analysisin speech recognition, Comput. Speech Lang., vol. 5, no. 2, pp. 169179,Apr. 1991.

    [16] P. Le Cerf and D. Van Compernolle, A new variable frame rate analysismethod for speech recognition, IEEE Signal Process. Lett., vol. 1, no. 12,pp. 185187, Dec. 1994.

    [17] Q. Zhu and A. Alwan, On the use of variable frame rate analysis inspeech recognition, in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess., Jun. 2000, pp. 17831786.

    [18] H. You, Q. Zhu, and A. Alwan, Entropy-based variable frame rate analy-sis of speech signals and its application to ASR, in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process., May 2004, pp. 549552.

    [19] Z.-H. Tan, P. Dalsgaard, and B. Lindberg, Exploiting temporal correla-tion of speech for error robust and bandwidth flexible distributed speechrecognition, IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 4,pp. 13911403, May 2007.

    [20] L.-M. Lee, Adaptation of hidden Markov models for half frame rateobservations, Electron. Lett., vol. 46, no. 10, pp. 723724, May 2010.

    [21] H. G. Hirsch and D. Pearce, The AURORA experimental frameworkfor the performance evaluations of speech recognition systems undernoisy conditions, in Proc. ISCA ITRWChallenges Next Millenn., Paris,France, Sep. 2000.

    [22] ELRA catalogue, AURORA Project Database, v2.0, catalogue reference:AURORA/CD0002. [Online]. Available: http://catalog.elra.info

    [23] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw, G. Moore,J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C. Woodland, The HTKBook. Cambridge, U.K.: Cambridge Univ. Eng. Dept, 2005, version 3.3.

    Lee-Min Lee (M03) graduated from the NationalTaipei Institute of Technology, Taipei, Taiwan, in1982 and received the M.S. and Ph.D. degrees inelectrical engineering from the National Tsing HuaUniversity, Hsinchu, Taiwan, in 1989 and 1995,respectively.

    From 1995 to 2002, he was with the Faculty ofMingchi Institute of Technology, Taishan, Taiwan,where he was an Associate Professor with the De-partment of Electrical Engineering. Since 2002, hehas been with the Department of Electrical Engineer-

    ing, Dayeh University, Changhua, Taiwan. His research interests include speechrecognition, machine learning, and natural language processing.

    Fu-Rong Jean received the B.S. degree from theNational Taiwan University of Science and Technol-ogy, Taipei, Taiwan, in 1985 and the M.S. and Ph.D.degrees in electrical engineering from the NationalTsing Hua University, Hsinchu, Taiwan, in 1987 and1995, respectively.

    From 1987 to 1995, he was a Lecturer with the De-partment of Electrical Engineering, National TaipeiInstitute of Technology, Taipei. Since 1995, he hasbeen an Associate Professor with the Department ofElectrical Engineering, National Taipei University of

    Technology, Taipei. His main research interests include speech enhancement,speech and audio coding, and distributed speech recognition.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice