Automatic generation of multiple pronunciations based on neural networks

  • Published on
    15-Jul-2016

  • View
    213

  • Download
    1

Transcript

Automatic generation of multiple pronunciations based onneural networksToshiaki Fukada *, Takayoshi Yoshimura, Yoshinori SagisakaATR Interpreting Telecommunications Research Laboratory, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, JapanReceived 3 April 1998; accepted 24 September 1998AbstractWe propose a method for automatically generating a pronunciation dictionary based on a pronunciation neuralnetwork that can predict plausible pronunciations (realized pronunciations) from the canonical pronunciation. Thismethod can generate multiple forms of realized pronunciations using the pronunciation network. For generating asophisticated realized pronunciation dictionary, two techniques are described: (1) realized pronunciations with likeli-hoods and (2) realized pronunciations for word boundary phonemes. Experimental results on spontaneous speech showthat the automatically derived pronunciation dictionaries give consistently higher recognition rates than a conventionaldictionary. 1999 Elsevier Science B.V. All rights reserved.ZusammenfassungWir schlagen eine Methode zur automatischen Generierung eines Aussprachelexikons vor, dessen Aussprachevari-anten durch ein neuronales Netz generiert werden. Das neuronale Netz kann sinnvolle alternative Aussprachevariantenaus den ursprunglichen (kanonischen) Aussprachevarianten vorhersagen. Zur automatischen Generierung eines Aus-sprachelexikons sind hier zwei Techniken beschrieben: (1) die Erzeugung von alternativen Aussprachevarianten mitLikelihoods und (2) die Erzeugung von alternativen Aussprachevarianten fur Phonemen an Wortgrenzen.Erkennungsexperimente mit spontaner Sprache zeigen, da die Verwendung automatisch generierter Aussprachelexikahohere Erkennungsraten gibt als die Verwendung des regularen Aussprachelexikons. 1999 Elsevier Science B.V. Allrights reserved.ResumeNous proposons une methode permettant de generer automatiquement un dictionnaire de prononciations a laidedun reseau de neurones qui predit les prononciations les plus plausibles (variantes) a partir dune prononciationstandard. Le reseau de prononciation peut generer de multiples variantes de prononciation. Pour generer undictionnaire de variantes de prononciation sophistique, deux techniques sont decrites: (1) variantes de prononciationavec valeur de vraisemblance, et (2) variantes de prononciation pour les phonemes aux frontieres des mots. Les resultatsde nos experiences sur de la parole spontanee montrent que les dictionnaires de prononciation derives automatiquementpermettent dameliorer de facon significative le taux de reconnaissance, par rapport a un dictionnaire convention-nel. 1999 Elsevier Science B.V. All rights reserved.Keywords: Pronunciation dictionary; Neural networks; Spontaneous speech; Speech recognitionSpeech Communication 27 (1999) 6373* Corresponding author. Tel.: +81 774 95 1301; fax: +81 774 95 1308; e-mail: fukada@itl.atr.co.jp0167-6393/99/$ see front matter 1999 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 6 6 - 11. IntroductionThe creation of an appropriate pronunciationdictionary is widely acknowledged to be an im-portant component for a speech recognition sys-tem. One of the earliest successful attempts basedon phonological rules was made at IBM (Bahlet al., 1978). Generating a sophisticated pronuncia-tion dictionary is still considered to be quite ef-fective for improving the system performance onlarge vocabulary continuous speech recognition(LVCSR) tasks (Lamel and Adda, 1996). How-ever, constructing a pronunciation dictionarymanually or by a rule-based system requires timeand expertise. Consequently, research eorts havebeen directed at constructing a pronunciationdictionary automatically. In the early 1990s, theemergence of phonetically transcribed (hand-la-beled) medium-size databases (e.g. TIMIT andResource Management) encouraged a lot of re-searchers to explore pronunciation modeling(Randolph, 1990; Riley, 1991; Wooters andStolcke, 1994). Although all of these approachesare able to automatically generate pronunciationrules, hand-labeled transcriptions by expert pho-neticians are required. As a result, automaticphone transcriptions generated by a phonemerecognizer, which enable one to cope with a largeamount of training data, have been used in pro-nunciation modeling (Schmid et al., 1993; Slobo-da, 1995; Imai et al., 1995; Humphries, 1997).Recently, LVCSR systems have started to treatspontaneous, conversational speech, such as theSwitchboard corpus and, consequently, pronunci-ation modeling has become an important topicbecause word pronunciations vary more here thanin read speech (Fosler et al., 1996; Weintraubet al., 1996; Byrne et al., 1997).We have proposed a method for automaticallygenerating a pronunciation dictionary on the basisof a spontaneous, conversational speech database(Fukada and Sagisaka, 1997). Our approach isbased on a pronunciation neural network that canpredict plausible pronunciations (realized pro-nunciations) from the canonical pronunciation;most other approaches use decision trees for pro-nunciation modeling (Randolph, 1990; Riley,1991; Fosler et al., 1996; Weintraub et al., 1996;Humphries, 1997; Byrne et al., 1997). In this pa-per, we mainly address the following issues togenerate more sophisticated multiple pronuncia-tions for improved speech recognition: (1) how toassign a score to a pronunciation variation; and(2) how to generate pronunciation variations forword-boundary phonemes.We define canonical and realized pronuncia-tions as follows. Canonical pronunciation: Standard phoneme se-quences assumed to be pronounced in readspeech. Pronunciation variations such as speak-er variability, dialect or coarticulation in conver-sational speech are not considered. Realized pronunciation: Actual phoneme se-quences pronounced in speech. Various pronun-ciation variations due to speaker orconversational speech can be included.In the following sections, we first present trainingand generation procedures based on a pronuncia-tion neural network. In Section 3, the proposedmethod is applied to a task of pronunciationdictionary generation for spontaneous speechrecognition. Section 4 shows results of recognitionexperiments and Section 5 gives a discussion of thepresented work.2. Automatic generation of a pronunciation dictio-nary2.1. Pronunciation networkTo predict realized pronunciations from a ca-nonical pronunciation, we employ a multilayerperceptron as shown in Fig. 1. In this paper, arealized pronunciation A(m) for a canonicalpronunciation L(m) is predicted from the fiveFig. 1. Pronunciation network.64 T. Fukada et al. / Speech Communication 27 (1999) 6373phonemes (i.e. quintphone) of canonical pronun-ciations Lm 2; . . . ; Lm 2. 1Now we have two questions: (1) how to train apronunciation network; and (2) how to generatemultiple realized pronunciations by using thetrained pronunciation network. These questionsare answered in the following sections.2.2. Training procedures2.2.1. Training data preparationTo train a pronunciation network, first we haveto prepare the training data, that is, input (ca-nonical pronunciation) and output (realized pro-nunciation) pairs. The training data can beprepared by generating a realized pronunciationsequence and mapping it to the canonical pro-nunciation as follows.1. Conduct phoneme recognition using speechtraining data for dictionary generation. Therecognized phoneme strings are taken as the re-alized pronunciation sequence.2. Align the canonical pronunciation sequence tothe realized pronunciation sequence using a dy-namic programming algorithm.For example, if the phoneme recognition result(i.e. realized pronunciation) for the canonicalpronunciation /a r a y u r u/, is /a w a u r i u/,the correspondence between the canonical pro-nunciation and the realized pronunciation can bedetermined as follows:a r a y u r u canonical pron:;a w a u r i u realized pron:;where the second phoneme of the canonical pro-nunciation, /r/, is substituted by /w/, and /y/ isdeleted and /i/ is inserted for the sixth phoneme ofthe canonical pronunciation, /r/. That is, L(2)r,A(2)w, L(4)y, A(4)x (deletion), L(6)rand A(6) {r, i} (/i/ is an insertion). The cor-rectly recognized phonemes are also treated assubstitutions (e.g. /a/ is substituted by /a/). Pho-neme recognition is conducted using all of thetraining data and the aligned results are used asthe data for input and output, for the pronuncia-tion neural network training (described in thefollowing section). Note that both the phonemerecognition and alignment procedures are notperformed for each word but for each utterance.2.2.2. Structure of pronunciation networkTo train a pronunciation network, a context offive phonemes in the canonical pronunciations,Lm 2; . . . ; Lm 2, are given as inputs; Amaligned to Lm is given for the outputs. A total of130 units (26 Japanese phoneme sets times fivecontexts) are used in the input layer. The repre-sentation of realized pronunciations at the outputlayer is localized, with one unit representing dele-tion, 26 units representing substitution, and 26units representing insertion, providing a total of 53output units. 2In the previous example, when / / (deletion),which corresponds to the fourth canonical string/y/, is used as Am, and /r a y u r/ are used asLm 2; . . . ; Lm 2. Here, 1.0 is given as theoutput unit for deletion and as the input units forthe /r/ in Lm 2, /a/ in Lm 1, /y/ in Lm, /u/in Lm 1 and /r/ in Lm 2; 0.0 is given for theother input and output units.2.3. Generation procedures2.3.1. Realized pronunciation generationAssume that we want to find the best (i.e. mostprobable) realized pronunciation for a word Win terms of pronunciation network outputs. Letthe canonical pronunciation of W be denotedas L L1; . . . ; LjW j, where jW j is the numberof phonemes of the canonical pronunciation(jW jP 5). Realized pronunciation A A1; . . . ;AjW j for L can be obtained in the followingsteps.1 This network structure is similar to that employed inNETtalk (Sejnowski and Rosenberg, 1986), which can predictan English word pronunciation from its spelling. Note that thepronunciation network is designed to predict realized pronun-ciations, for the purpose of improving the performance inspontaneous speech recognition, while NETtalk is designed topredict canonical pronunciations for text-to-speech systems.2 In this paper, we do not treat insertions of more than twophonemes, because there are relatively very few of them and thenumber of weights can be reduced.T. Fukada et al. / Speech Communication 27 (1999) 6373 651. Set i 3, A(1)L(1) and A(2)L(2).2. For the quintphone context of the ith phoneme,l Li 2; . . . ; Li 2, input 1.0 in the cor-responding input units of the pronunciationnetwork.3. Find the maximum unit U1out in all of the out-put units.3.1. If U1out is found in the substitution units,set Ai to the phoneme of U1out.3.2. If U1out is found in the insertion units, findanother maximum unit U2out in the substi-tution units. Set Ai to the phoneme list ofU2out and U1out, respectively.3.3. If U1out is the deletion unit, set Ai x.4. Set i i 1.5. Repeat step 2 to step 4 until i jW j 1.6. Set AjW j 1 LjW j 1 and AjW j LjW j.2.3.2. Multiple pronunciations with likelihoodsMultiple realized pronunciations can be ob-tained by finding the N-best candidates based onthe output values of the network. Suppose that theoutputs for the canonical pronunciation /a r i g at o o/ are obtained as shown in Table 1. Then,multiple realized pronunciations can be deter-mined by multiplying each normalized output forall possible combinations and choosing the prob-able candidates. Although multiple pronunciationscan be obtained by setting the number of candi-dates N as in (Fukada and Sagisaka, 1997), in thispaper we use a likelihood cut-o threshold for themultiplied normalized output. When the thresholdis set to 0.4, we get the following six realizedpronunciations for /a r i g a t o o/ (normalizedpronunciation likelihoods are in brackets).1. a r i a t o o (1.0)2. a r i g a t o o (0.75)3. a r i a d o o (0.6)4. a r i e t o o (0.5)5. a r i g a d o o (0.45)6. a r i a k o o (0.4)Note that in this example, 1.0 is given as thepronunciation likelihood for the word boundaryphonemes; the beginning two phonemes (/a r/)and the ending two phonemes (/o o/).2.3.3. Integrating the pronunciation likelihood intospeech recognitionIn conventional speech recognition systems,recognized word sequence W^ given observation Ocan be obtained by W^ argmaxW P W jO. In thispaper, we extend this formula by considering therealized pronunciation A for the word W as fol-lows:W^ argmaxW 2WXA2WP A;W jO: 1Using Bayes Rule, the right-hand side of Eq. (1)can be written asargmaxW 2WXA2WP OjA;W P W P AjW : 2The first term in Eq. (2), P OjA;W , is the proba-bility of a sequence of acoustic observations,conditioned on the pronunciation and word string.This probability can be computed using anacoustic model. The second term in Eq. (2), PW ,is the language model likelihood and can becomputed using an n-gram word model. We callthe third term in Eq. (2), P AjW , the pronuncia-tion model. In this paper, the pronunciation net-work is used as the pronunciation model.We consider that multiple realized pronuncia-tions mainly represent the pronunciation variabil-ity caused by speaker or context dierences. Thatis, for a certain speaker and in a certain context,only one realized pronunciation can be taken for aTable 1Examples of inputs and outputs for the canonical pronunciation /a r i g a t o o/. /x/ denotes deletionInput Phoneme (raw output/normalized output)1 2 3a r i g a i (0.9/1.0) u (0.2/0.22) o (0.1/0.11)r i g a t x (0.4/1.0) g (0.3/0.75) b (0.1/0.25)i g a t o a (0.8/1.0) e (0.4/0.5) o (0.2/0.25)g a t o o t (0.5/1.0) d (0.3/0.6) k (0.2/0.4)66 T. Fukada et al. / Speech Communication 27 (1999) 6373word pronunciation. Therefore, we omit the sum-mation in Eq. (2). Furthermore, by applying ex-ponential weighting to the language probabilityand pronunciation probability, the acoustic ob-servation O can be decoded by the word sequencebased on the following equation:argmaxW 2W ; A2WP OjA;W PW aPAjW b; 3where a and b are weighting factors for the lan-guage model and the pronunciation model, re-spectively.2.3.4. Realized pronunciations for word boundaryphonemesIn the previous section or our previous ap-proach (Fukada and Sagisaka, 1997), the fourphonemes at word boundaries, L1, L2, LjW j 1 and LjW j, are not predicted by the pronunci-ation network and the canonical pronunciationsare simply used as the realized pronunciations forthese phonemes, since the preceding and succeed-ing words of W are not known at the stage ofgenerating a dictionary. The optimal solutionwould be to apply the pronunciation networkduring the decoding stage to generate alternativepronunciations on the fly based on hypotheses, butthis is technically dicult (Riley et al., 1995).To avoid this, an N -best rescoring paradigm hasbeen proposed, by applying decision tree-basedpronunciation models to the hypothesis generatedusing the conventional dictionary (Fosler et al.,1996; Weintraub et al., 1996). Although this ap-proach can evaluate pronunciation variations evenfor word boundary phonemes depending on thepreceding and succeeding words, the obtainedimprovement is not significant. We suspect thatthis is mainly because the N -best hypotheses areobtained from a baseline dictionary, that is, deci-sion tree models are not applied in the first passdecoding.In (Humphries, 1997), cross-word eects areroughly incorporated into the pronunciationmodeling through the inclusion of word boundaryinformation as an additional feature in the deci-sion tree clustering. No improvement, however, isobtained by the implementation of cross-wordpronunciation modeling. This implies that thecontextual dependency for each word, such as aword A which is often followed by a word B, has tobe taken into account when predicting realizedpronunciations for word boundary phonemes.Therefore, we take the approach to generate arealized pronunciation dictionary whose variationsare considered not only within-word phonemesbut also word boundary phonemes, and use thisdictionary to the first pass in decoding. Pronunci-ation variations for word boundary phonemes canbe taken into account based on language statistics.As language statistics, we employ word bigrammodels here. Their probabilities are employed togenerate realized pronunciations. Because wordbigram models give all possible preceding andsucceeding words and their frequencies for a cer-tain word, five phoneme contexts (quintphone) ofword boundary phonemes are statistically deter-mined.Consider that we want to find realized pro-nunciations for the first canonical phoneme LWC1for a word WC and its canonical pronunciation isLWC LWC1; . . . ; LWCjWCj, where jWCj is thenumber of phonemes of the canonical pronuncia-tion. Let a word which can be preceded by WC bedenoted as WP whose canonical pronunciation isLWP LWP1; . . . ; LWPjWPj, where jWPj is thenumber of phonemes of the canonical pronuncia-tion. Then, the quintphone for LWC1 is fixed asLWPjWPj 1, LWPjWPj, LWC1, LWC2, LWC3and the output values of the pronunciation net-work can be computed. By computing outputvalues for all possible preceding words for LWC , theoutput value of the ith output unit, SWC;i1, isstatistically computed asSWC;i1 XWP2WP WCjWPSWC;WP;i1; 4where W is the set of all possible words, P WCjWPis the conditional probability of WC given by theword bigram models, and SWC ;WP;i1 is the outputof the ith output units computed by the quint-phone input using WC and WP. Similarly, theoutput values for other word boundary phonemes,e.g., LWC2, LWCjWCj 1 and LWCjWCj, can bestatistically computed. Once the outputs for eachoutput unit are computed, multiple realized pro-nunciations for WC can be obtained as shown inSection 2.3.2.T. Fukada et al. / Speech Communication 27 (1999) 6373 672.3.5. Reliability weighting for pronunciation likeli-hoodsAlthough P AjW in Eq. (3) given as the nor-malized likelihood can be used as the score for thepronunciation models, the reliability of P AjW forthe following three kinds of realized pronuncia-tions will decrease in the following order: (1) ob-tained from quintphone input (known), (2)obtained using language statistics (statisticallyknown), and (3) substituted with the canonicalpronunciation (unknown). Therefore, we introducea modified pronunciation likelihood P 0AjW computed by multiplying a weighting factor k toP AjW asP 0AjW kP AjW ; 5where k is a jW j dependent constant factor(0 < k < 1) and is defined ask PjW jm1Q2i2 wm ijW j ; 6where wm i is heuristically defined as in Ta-ble 2. Here, values for jij 2 are set to bigger thanjij 1 because two phonemes away from thecenter phoneme would aect the output less thanadjacent phonemes.3. Pronunciation dictionary for spontaneous speechrecognition3.1. ConditionsA total of 230 speaker (100 male and 130 fe-male) dialogues were used for the pronunciationnetwork and acoustic model training. A 26-dimensional feature vector (12-dimensional mel-cepstrum + power and their derivatives) wascomputed using a 25.6 ms window duration and a10 ms frame period. A set of 26 phonemes wasused as the Japanese pronunciation representa-tions.Shared-state context dependent HMMs (CD-HMMs) with five Gaussian mixture componentsper state (Ostendorf and Singer, 1997) weretrained. The total number of states was set to 800.By using the CD-HMMs and Japanese syllabicconstraints, phoneme recognition was performedon the training data. The phoneme sequences ofthe recognition results were taken as the realizedpronunciations. For each utterance, these realizedpronunciations were aligned to their canonicalpronunciations transcribed by human experts.3.2. Pronunciation network trainingCanonical pronunciations with quintphonecontext and their correspondent realized pronun-ciations (about 120,000 samples in total) were usedas the inputs and outputs for the pronunciationnetwork training. The structure of the pronunci-ation network is shown in Fig. 1, where 130 inputunits, 100 hidden units and 53 output units areused. There is also a bias that acts as an additionalinput constantly set to one. The total number ofnetwork weights including the biases becomes18,453 (131 100 + 101 53). For output andhidden units, the sigmoid function with the meansquared error (MSE) criterion is used becauseeach output produces a number between 0 and 1but the sum of all outputs does not sum up to one.The network was trained using 1000 batch itera-tions and an intermediate network after 500 iter-ations was used in the following experiments. Thedierences in the recognition performance for thenumber of iterations are discussed in Section 5.1.The phone recognition accuracy between the ca-nonical pronunciation and the training data is81.1%. In order to indicate how the pronunciationnetwork can predict pronunciation variation, weevaluated performances of the pronunciation net-work by the coincidence rate and by the MSE forthe training data. Fig. 2 shows the coincidencerates of target pronunciation and estimated pro-nunciation (solid line), and the MSE between thetargets and the estimates (dotted line) as a func-tion of the number of training iterations. The co-incidence rate for target and canonicalTable 2wm i in Eq. (6)Context jij0 1 2Known 1.0 1.0 1.0Statistically known ) 0.8 0.9Unknown ) 0.7 0.868 T. Fukada et al. / Speech Communication 27 (1999) 6373pronunciation (shown as Original Correct in thefigure) is 77.2%.3.3. Generation of realized pronunciation dictionaryWe applied the trained pronunciation networkto the following two kinds of Japanese pronunci-ation dictionaries with 7484 word entries, 3 devel-oped for spontaneous speech recognition on atravel arrangement task (Nakamura et al., 1996). Simple dictionary: Each word entry has a singlestandard canonical pronunciation. This dictio-nary is automatically generated from its readingrepresented by Japanese syllabic symbols(katakana characters). All entries can be fol-lowed by silence (i.e. pause) in recognition. Expert dictionary: This dictionary is constructedby human experts considering pronunciationvariabilities such as successive voicings, 4 inser-tion and substitution of phonemes occurring inspontaneous speech and possible insertions ofa pause.Table 3 shows examples of pronunciations ob-tained for these dictionaries. // denotes silence. Inexample 2 of the expert dictionary, /s u {{m|} i|ng} m a s e ng {|}/ represents the following sixkinds of multiple pronunciations.1. s u m i m a s e ng2. s u i m a s e ng3. s u ng m a s e ng4. s u m i m a s e ng 5. s u i m a s e ng 6. s u ng m a s e ng By expanding the brackets shown in Table 3 asabove and applying these canonical pronuncia-tions to the pronunciation network, a realizedpronunciation dictionary is automatically gener-ated as described in Section 2.3. Table 4 shows thetotal number of multiple pronunciations for 7484word entries. In this table, Prop 1 and Prop 2denote the proposed realized pronunciation dict-ionaries without and with the application of theword boundary phonemes described in Sec-tion 2.3.4, respectively. The threshold for limitingthe number of realized pronunciations was set to0.4 for all cases. We used the expanded diction-aries as the baseline dictionaries in the followingrecognition experiments, but note that the bracket3 Multi-words, which were automatically generated by thelanguage modeling (Masataki and Sagisaka, 1996), were alsoincluded in the entries.4 Some Japanese word pronunciations change when a com-pound word is formed. For example, the conjunction of /k o do m o/ (child) and /h e y a/ (room) is pronounced /k o d o m o be y a/.Table 3Examples of simple and expert pronunciation dictionariesExample Simple dictionary Expert dictionary1 h e y a {j} {h|b} e y a {|)}2 s u m i m a s e ng{j}s u {{m|}i|ng} m a s e ng{|)}3 h o sh i {j} h o sh i// denotes silence. {|)} represents that both no silence and si-lence can be used.Table 4Table number of multiple pronunciations for 7484 word entriesSimple ExpertBefore bracket expression 7484 7484After bracket expansion 14,968 17,210Prop 1 (w/o boundary phonemes) 28,663 33,198Prop 2 (with boundary phonemes) 33,742 42,103Fig. 2. Coincidence rates (solid line) and mean squared error(dotted line) of target and estimates for training data as afunction of number of training iterations. Coincidence rate fortarget and canonical pronunciation (shown as Original Correct)is 77.2%.T. Fukada et al. / Speech Communication 27 (1999) 6373 69expansion did not aect the recognition perfor-mance.For example, the multiple realized pronuncia-tions obtained from the pronunciation network fora word /w a z u k a/ are shown in Table 5. In thisexample, word boundary phonemes (i.e. /w a/ and/k a/) are also applied to the pronunciation net-work.4. Spontaneous speech recognition experimentsTo investigate the relative eectiveness of theproposed dictionary generated in Section 3, weconducted continuous speech recognition experi-ments on a Japanese spontaneous speech database(Nakamura et al., 1996).4.1. Experimental conditionsThe same training data, front-end, and acousticmodel described in Section 3.1 were used. For theopen test set, 42 speaker (17 male and 25 female)dialogues were used. Variable-order n-grams(Masataki and Sagisaka, 1996) were used as thelanguage model. A multi-pass beam search tech-nique was used for decoding (Shimizu et al., 1996).The language and pronunciation probabilityweights, a and b in Eq. (3) were equally set.Four kinds of realized pronunciation diction-aries were generated for each baseline dictionary(i.e. the simple and expert dictionaries):1. Generate a realized pronunciation dictionarywith no pronunciation likelihood or languagestatistics for word boundary phonemes. Thispronunciation dictionary generation is equiva-lent to our previous approach (Fukada and Sa-gisaka, 1997) (Dict 1).2. The same as for Dict 1 but with the pronuncia-tion likelihoods described in Section 2.3.2 (Dict2).3. Generate a dictionary by using language statis-tics as described in Section 2.3.4. The reliabilityweighting described in Section 2.3.5 is also ap-plied (Dict 3).4. The same as for Dict 3 but with pronunciationlikelihoods (Dict 4).Note that the total number of multiple pronunci-ations for Dict 1 and Dict 2 are the same as Prop 1in Table 4. The total number for Dict 3 and Dict 4are the same as Prop 2 in this table.4.2. Recognition results4.2.1. Simple dictionaryRecognition results in word error rate (WER)(%) for the simple dictionary are shown in Table 6.All four proposed dictionaries outperform thebaseline dictionary. Also, the application of pro-nunciation likelihoods or language statistics (Dict2, Dict 3 or Dict 4) boosts the recognition per-formance of our previous approach (Dict 1). Notethat the proposed dictionary generated using bothpronunciation likelihoods and language statisticsachieved about a 10% error reduction in worderror rate compared to the baseline performance.4.2.2. Expert dictionaryRecognition results in WER (%) for the expertdictionary are shown in Table 7. First, comparingthe two baseline results for the simple and expertdictionaries in Tables 6 and 7, the expert dictio-nary (29.0%) can be observed to be significantlysuperior to the simple dictionary (34.5%). Second,we can see similar improvements by applying theproposed method to the baseline expert dictionaryas achieved with the simple dictionary. Note thatTable 5Example of realized pronunciations with normalized likeli-hoods for /w a z u k a/Pronunciation Normalized likelihoodw a z u k a 1.0a z u k a 0.896w a z u t a 0.662a z u t a 0.593Table 6Recognition results for the simple dictionaryDictionary Likelihood Language stat. WER (%)Baseline ) ) 34.5Dict 1 No No 33.2Dict 2 Yes No 31.2Dict 3 No Yes 32.9Dict 4 Yes Yes 31.170 T. Fukada et al. / Speech Communication 27 (1999) 6373the proposed dictionary with pronunciation like-lihoods and language statistics (Dict 4) achievedabout a 9% error reduction in WER compared tothe baseline performance.4.2.3. Error analysisFrom Tables 6 and 7, the proposed dictionariesgave consistently better performances than thebaseline dictionaries. Here, we found from therecognition results that the number of insertionand substitution errors for the proposed diction-aries significantly decreased compared to thebaseline dictionaries. We believe that this is be-cause the proposed dictionary reduces the errorswhen long word is incorrectly recognized as a se-quence of short words, or a correct word is sub-stituted as another word when the actualpronunciation is slightly dierent from that in thebaseline dictionary.5. Discussion5.1. Number of iterations for NN trainingThe WER and the total number of realizedpronunciations as functions of neural networktraining iterations (50, 100, 200, 500 and 1000) areshown in Fig. 3. The experimental conditions werethe same as those described in Section 4.1, exceptthat the threshold for the normalized likelihoodwas set to 0.5. The baseline expert dictionary wasused for generating the realized pronunciations.No pronunciation likelihoods or language statis-tics were used in this experiment. From these re-sults, it can be seen that the WER was reduced upto 500 iterations and then saturated, while thenumber of realized pronunciations kept increasing.Note that all created dictionaries outperformed thebaseline dictionary.5.2. Probability cut-o thresholdThe WER and total number of realized pro-nunciations as functions of thresholds (in steps of0.1 from 0.2 to 1.0), for limiting the number ofrealized pronunciations, are shown in Fig. 4. Theexperimental conditions were the same as thosedescribed in Section 4.1. The baseline expertdictionary was used for generating the realizedpronunciations. No pronunciation likelihoods orlanguage statistics was used in this experiment.From these results, we can see that the WERshowed a flat concave around 0.40.6, and thesmaller the used threshold is, the higher the WER,Fig. 3. Word error rate and total number of realized pronun-ciations as functions of neural network training iterations.Fig. 4. Word error rate and total number of realized pronun-ciations as functions of thresholds, for limiting the number ofpronunciations.Table 7Recognition results for the expert dictionaryDictionary Likelihood Language stat. WER (%)Baseline ) ) 29.0Dict 1 No No 27.9Dict 2 Yes No 26.0Dict 3 No Yes 27.4Dict 4 Yes Yes 26.4T. Fukada et al. / Speech Communication 27 (1999) 6373 71and the exponentially larger the pronunciationsbecame. Again, note that all created dictionariesoutperformed the baseline dictionary.5.3. Computational requirementsAs the proposed dictionaries have significantlymore entries than the baseline dictionaries (seeTable 4), we investigated the computational re-quirements of the experiments run for the expertdictionary. The computational requirements andpronunciation entries for the expert dictionaries,which are normalized against those of the baselinesystem, are shown in Table 8. From these results,although the proposed dictionaries required someadditional search time compared to the baselinesystem, the increase in search time was much lessthan the increase in actual pronunciation entries inthe dictionary. We believe that this phenomenon isdue to the fact that the proposed realized pro-nunciation dictionaries appropriately representactually occurring pronunciations in conversa-tional speech. Note that the computational re-quirements of realized pronunciations highlydepend on the recognizers representation of pro-nunciations. In our case, the multiple pronuncia-tions are represented in a network with the sharingof common phonemes, and this will be much fasterthan a linear lexicon.6. ConclusionIn this paper, a method for automatically gen-erating a pronunciation dictionary based on apronunciation neural network has been proposed.We focused on two techniques: (1) realized pro-nunciations with likelihoods based on the neuralnetwork outputs and (2) realized pronunciationsfor word boundary phonemes using word bigram-based language statistics, which are both moresophisticated generation methods than our pre-vious approach (Fukada and Sagisaka, 1997).Experimental results on spontaneous speech rec-ognition show that automatically derived pro-nunciation dictionaries give higher recognitionrates than the conventional dictionary. We alsoconfirmed that the proposed method can enhancethe recognition performance of a dictionary con-structed based on expertise. In this paper, only aquintphone context is used for predicting pro-nunciation variations, that is, words whosequintphone contexts are the same, have the samepronunciation variation. However, other factors(e.g. part-of-speech) can be easily incorporatedinto the pronunciation network by having addi-tional units for these factors. Although the pro-posed method requires the fixed input window (i.e.a context of five phonemes), this requirementcould be relaxed by adding word boundary phones(pad phones) to the beginning and end of theword. In addition, we expect the multiple pro-nunciation dictionary to be a useful resource foracoustic model retraining by realigning the train-ing data (Sloboda, 1995; Fosler et al., 1996; Byrneet al., 1997).ReferencesBahl, L., Baker, J., Cohen, P., Jelinek, F., Lewis, B., Mercer,R., 1978. Recognition of a continuously read naturalcorpus. In: Proc. ICASSP-78, pp. 422424.Byrne, B., Finke, M., Khudanpur, S., McDonough, J., Nock,H., Riley, M., Saraclar, M., Wooters, C., Zavaliagkos, G.,1997. Pronunciation modelling for conversational speechrecognitioin: A status report from WS97. In: Proc. 1997IEEE Workshop on Speech Recognition and Understand-ing.Fosler, E., Weintraub, M., Wegmann, S., Kao, Y.-H.,Khudanpur, S., Galles, C., Saraclar, M., 1996. Automaticlearning of word pronunciation from data. In: Proc.ICSLP-96, pp. 2829 (addendum).Fukada, T., Sagisaka, Y., 1997. Automatic generation of apronunciation dictionary based on a pronunciation net-work. In: Proc. EUROSPEECH-97, pp. 24712474.Humphries, J., 1997., Accent modelling and adaptation inautomatic speech recognition, Ph.D. Thesis, University ofCambridge, Cambridge.Table 8Computational requirements and pronunciation entries for theexpert dictionaries normalized against those of the baselinesystemDictionary CPU time EntriesDict 1 1.04 1.93Dict 2 1.20 1.93Dict 3 1.19 2.45Dict 4 1.33 2.4572 T. Fukada et al. / Speech Communication 27 (1999) 6373Imai, T., Ando, A., Miyasaka, E., 1995. A new method forautomatic generation of speaker-dependent phonologicalrules. In: Proc. ICASSP-95, pp. 864867.Lamel, L., Adda, G., 1996. On designing pronunciationlexicons for large vocabulary, continuous speech recogni-tion. In: Proc. ICSLP-96, pp. 69.Masataki, H., Sagisaka, Y., 1996. Variable-order n-gramgeneration by word-class splitting and consecutive wordgrouping. In: Proc. ICASSP-96, pp. 188191.Nakamura, A., Matsunaga, S., Shimizu, T., Tonomura, M.,Sagisaka, Y., 1996. Japanese speech databases for robustspeech recognition. In: Proc. ICSLP-96, pp. 21992202.Ostendorf, M., Singer, H., 1997. HMM topology design usingmaximum likelihood successive state splitting. ComputerSpeech and Language 11, 1741.Randolph, M., 1990. A data-driven method for discovering andpredicting allophonic variation. In: Proc. ICASSP-90,pp. 11771180.Riley, M., 1991. A statistical model for generating pronunci-ation networks. In: Proc. ICASSP-91, pp. 737740.Riley, M., Pereira, F., Chung, E., 1995. Lazy transducercomposition: A flexible method for on-the-fly expansion ofcontext-dependent grammar networks. In: Proc. IEEEAutomatic Speech Recognition Workshop, pp. 139140.Schmid, P., Cole, R., Fanty, M., 1993. Automatically generatedword pronunciations from phoneme classifier output. In:Proc. ICASSP-93, pp. II-223II-226.Sejnowski, T., Rosenberg, C., 1986. NETtalk: A parallelnetwork that learns to read aloud. The Johns HopkinsUniversity, Electrical Engineering and Computer ScienceTech. Report JHU/EECS-86/01.Shimizu, T., Yamamoto, H., Masataki, H., Matsunaga, S.,Sagisaka, Y., 1996. Spontaneous dialogue speech recogni-tion using cross-word context constrained word graphs. In:Proc. ICASSP-96, pp. 145148.Sloboda, T., 1995. Dictionary learning performance throughconsistency. In: Proc. ICASSP-95, pp. 453456.Weintraub, M., Fosler, E., Galles, C., Kao, Y.-H., Khudanpur,S., Saraclar, M., Wegmann, S. 1996. Automatic learning ofword pronunciation from data. JHU Workshop-96 ProjectReport.Wooters, C., Stolcke, A., 1994. Multiple-pronunciation lexicalmodeling in a speaker independent speech understandingsystem. In: Proc. ICSLP-94, pp. 13631366.T. Fukada et al. / Speech Communication 27 (1999) 6373 73

Recommended

View more >