Automatic generation of multiple pronunciations based on neural networks

Automatic generation of multiple pronunciations based onneural networks

Toshiaki Fukada *, Takayoshi Yoshimura, Yoshinori Sagisaka

ATR Interpreting Telecommunications Research Laboratory, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan

Received 3 April 1998; accepted 24 September 1998

Abstract

We propose a method for automatically generating a pronunciation dictionary based on a pronunciation neural

network that can predict plausible pronunciations (realized pronunciations) from the canonical pronunciation. This

method can generate multiple forms of realized pronunciations using the pronunciation network. For generating a

sophisticated realized pronunciation dictionary, two techniques are described: (1) realized pronunciations with likeli-

hoods and (2) realized pronunciations for word boundary phonemes. Experimental results on spontaneous speech show

that the automatically derived pronunciation dictionaries give consistently higher recognition rates than a conventional

dictionary. Ó 1999 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Wir schlagen eine Methode zur automatischen Generierung eines Aussprachelexikons vor, dessen Aussprachevari-

anten durch ein neuronales Netz generiert werden. Das neuronale Netz kann sinnvolle alternative Aussprachevarianten

aus den urspr�unglichen (kanonischen) Aussprachevarianten vorhersagen. Zur automatischen Generierung eines Aus-

sprachelexikons sind hier zwei Techniken beschrieben: (1) die Erzeugung von alternativen Aussprachevarianten mit

Likelihoods und (2) die Erzeugung von alternativen Aussprachevarianten f�ur Phonemen an Wortgrenzen.

Erkennungsexperimente mit spontaner Sprache zeigen, daû die Verwendung automatisch generierter Aussprachelexika

h�ohere Erkennungsraten gibt als die Verwendung des regul�aren Aussprachelexikons. Ó 1999 Elsevier Science B.V. All

rights reserved.

ReÂsumeÂ

Nous proposons une m�ethode permettant de g�en�erer automatiquement un dictionnaire de prononciations �a l'aide

d'un r�eseau de neurones qui pr�edit les prononciations les plus plausibles (variantes) �a partir d'une prononciation

standard. Le r�eseau de prononciation peut g�en�erer de multiples variantes de prononciation. Pour g�en�erer un

dictionnaire de variantes de prononciation sophistiqu�e, deux techniques sont d�ecrites: (1) variantes de prononciation

avec valeur de vraisemblance, et (2) variantes de prononciation pour les phon�emes aux fronti�eres des mots. Les r�esultats

de nos exp�eriences sur de la parole spontan�ee montrent que les dictionnaires de prononciation d�eriv�es automatiquement

permettent d'am�eliorer de facßon signi®cative le taux de reconnaissance, par rapport �a un dictionnaire convention-

nel. Ó 1999 Elsevier Science B.V. All rights reserved.

Keywords: Pronunciation dictionary; Neural networks; Spontaneous speech; Speech recognition

Speech Communication 27 (1999) 63±73

* Corresponding author. Tel.: +81 774 95 1301; fax: +81 774 95 1308; e-mail: [email protected]

0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 6 6 - 1

1. Introduction

The creation of an appropriate pronunciationdictionary is widely acknowledged to be an im-portant component for a speech recognition sys-tem. One of the earliest successful attempts basedon phonological rules was made at IBM (Bahlet al., 1978). Generating a sophisticated pronuncia-tion dictionary is still considered to be quite ef-fective for improving the system performance onlarge vocabulary continuous speech recognition(LVCSR) tasks (Lamel and Adda, 1996). How-ever, constructing a pronunciation dictionarymanually or by a rule-based system requires timeand expertise. Consequently, research e�orts havebeen directed at constructing a pronunciationdictionary automatically. In the early 1990s, theemergence of phonetically transcribed (hand-la-beled) medium-size databases (e.g. TIMIT andResource Management) encouraged a lot of re-searchers to explore pronunciation modeling(Randolph, 1990; Riley, 1991; Wooters andStolcke, 1994). Although all of these approachesare able to automatically generate pronunciationrules, hand-labeled transcriptions by expert pho-neticians are required. As a result, automaticphone transcriptions generated by a phonemerecognizer, which enable one to cope with a largeamount of training data, have been used in pro-nunciation modeling (Schmid et al., 1993; Slobo-da, 1995; Imai et al., 1995; Humphries, 1997).Recently, LVCSR systems have started to treatspontaneous, conversational speech, such as theSwitchboard corpus and, consequently, pronunci-ation modeling has become an important topicbecause word pronunciations vary more here thanin read speech (Fosler et al., 1996; Weintraubet al., 1996; Byrne et al., 1997).

We have proposed a method for automaticallygenerating a pronunciation dictionary on the basisof a spontaneous, conversational speech database(Fukada and Sagisaka, 1997). Our approach isbased on a pronunciation neural network that canpredict plausible pronunciations (realized pro-nunciations) from the canonical pronunciation;most other approaches use decision trees for pro-nunciation modeling (Randolph, 1990; Riley,1991; Fosler et al., 1996; Weintraub et al., 1996;

Humphries, 1997; Byrne et al., 1997). In this pa-per, we mainly address the following issues togenerate more sophisticated multiple pronuncia-tions for improved speech recognition: (1) how toassign a score to a pronunciation variation; and(2) how to generate pronunciation variations forword-boundary phonemes.

We de®ne canonical and realized pronuncia-tions as follows.· Canonical pronunciation: Standard phoneme se-

quences assumed to be pronounced in readspeech. Pronunciation variations such as speak-er variability, dialect or coarticulation in conver-sational speech are not considered.

· Realized pronunciation: Actual phoneme se-quences pronounced in speech. Various pronun-ciation variations due to speaker orconversational speech can be included.

In the following sections, we ®rst present trainingand generation procedures based on a pronuncia-tion neural network. In Section 3, the proposedmethod is applied to a task of pronunciationdictionary generation for spontaneous speechrecognition. Section 4 shows results of recognitionexperiments and Section 5 gives a discussion of thepresented work.

2. Automatic generation of a pronunciation dictio-

nary

2.1. Pronunciation network

To predict realized pronunciations from a ca-nonical pronunciation, we employ a multilayerperceptron as shown in Fig. 1. In this paper, arealized pronunciation A(m) for a canonicalpronunciation L(m) is predicted from the ®ve

Fig. 1. Pronunciation network.

64 T. Fukada et al. / Speech Communication 27 (1999) 63±73

phonemes (i.e. quintphone) of canonical pronun-ciations L�mÿ 2�; . . . ; L�m� 2�. 1

Now we have two questions: (1) how to train apronunciation network; and (2) how to generatemultiple realized pronunciations by using thetrained pronunciation network. These questionsare answered in the following sections.

2.2. Training procedures

2.2.1. Training data preparationTo train a pronunciation network, ®rst we have

to prepare the training data, that is, input (ca-nonical pronunciation) and output (realized pro-nunciation) pairs. The training data can beprepared by generating a realized pronunciationsequence and mapping it to the canonical pro-nunciation as follows.1. Conduct phoneme recognition using speech

training data for dictionary generation. Therecognized phoneme strings are taken as the re-alized pronunciation sequence.

2. Align the canonical pronunciation sequence tothe realized pronunciation sequence using a dy-namic programming algorithm.

For example, if the phoneme recognition result(i.e. realized pronunciation) for the canonicalpronunciation /a r a y u r u/, is /a w a u r i u/,the correspondence between the canonical pro-nunciation and the realized pronunciation can bedetermined as follows:

a r a y u r u �canonical pron:�;a w a u r i u �realized pron:�;

where the second phoneme of the canonical pro-nunciation, /r/, is substituted by /w/, and /y/ isdeleted and /i/ is inserted for the sixth phoneme ofthe canonical pronunciation, /r/. That is, L(2)�r,A(2)�w, L(4)�y, A(4)�x (deletion), L(6)�rand A(6)� {r, i} (/i/ is an insertion). The cor-

rectly recognized phonemes are also treated assubstitutions (e.g. /a/ is substituted by /a/). Pho-neme recognition is conducted using all of thetraining data and the aligned results are used asthe data for input and output, for the pronuncia-tion neural network training (described in thefollowing section). Note that both the phonemerecognition and alignment procedures are notperformed for each word but for each utterance.

2.2.2. Structure of pronunciation networkTo train a pronunciation network, a context of

®ve phonemes in the canonical pronunciations,L�mÿ 2�; . . . ; L�m� 2�, are given as inputs; A�m�aligned to L�m� is given for the outputs. A total of130 units (26 Japanese phoneme sets times ®vecontexts) are used in the input layer. The repre-sentation of realized pronunciations at the outputlayer is localized, with one unit representing dele-tion, 26 units representing substitution, and 26units representing insertion, providing a total of 53output units. 2

In the previous example, when / / (deletion),which corresponds to the fourth canonical string/y/, is used as A�m�, and /r a y u r/ are used asL�mÿ 2�; . . . ; L�m� 2�. Here, 1.0 is given as theoutput unit for deletion and as the input units forthe /r/ in L�mÿ 2�, /a/ in L�mÿ 1�, /y/ in L�m�, /u/in L�m� 1� and /r/ in L�m� 2�; 0.0 is given for theother input and output units.

2.3. Generation procedures

2.3.1. Realized pronunciation generationAssume that we want to ®nd the best (i.e. most

probable) realized pronunciation for a word Win terms of pronunciation network outputs. Letthe canonical pronunciation of W be denotedas L � �L�1�; . . . ; L�jW j��, where jW j is the numberof phonemes of the canonical pronunciation(jW jP 5). Realized pronunciation A � �A�1�; . . . ;A�jW j�� for L can be obtained in the followingsteps.1 This network structure is similar to that employed in

NETtalk (Sejnowski and Rosenberg, 1986), which can predict

an English word pronunciation from its spelling. Note that the

pronunciation network is designed to predict realized pronun-

ciations, for the purpose of improving the performance in

spontaneous speech recognition, while NETtalk is designed to

predict canonical pronunciations for text-to-speech systems.

2 In this paper, we do not treat insertions of more than two

phonemes, because there are relatively very few of them and the

number of weights can be reduced.

T. Fukada et al. / Speech Communication 27 (1999) 63±73 65

1. Set i� 3, A(1)�L(1) and A(2)�L(2).2. For the quintphone context of the ith phoneme,

l � �L�iÿ 2�; . . . ; L�i� 2��, input 1.0 in the cor-responding input units of the pronunciationnetwork.

3. Find the maximum unit U1out in all of the out-put units.3.1. If U1out is found in the substitution units,

set A�i� to the phoneme of U1out.3.2. If U1out is found in the insertion units, ®nd

another maximum unit U2out in the substi-tution units. Set A�i� to the phoneme list ofU2out and U1out, respectively.

3.3. If U1out is the deletion unit, set A�i� � x.4. Set i � i� 1.5. Repeat step 2 to step 4 until i � jW j ÿ 1.6. Set A�jW j ÿ 1� � L�jW j ÿ 1� and A�jW j� �

L�jW j�.

2.3.2. Multiple pronunciations with likelihoodsMultiple realized pronunciations can be ob-

tained by ®nding the N-best candidates based onthe output values of the network. Suppose that theoutputs for the canonical pronunciation /a r i g at o o/ are obtained as shown in Table 1. Then,multiple realized pronunciations can be deter-mined by multiplying each normalized output forall possible combinations and choosing the prob-able candidates. Although multiple pronunciationscan be obtained by setting the number of candi-dates N as in (Fukada and Sagisaka, 1997), in thispaper we use a likelihood cut-o� threshold for themultiplied normalized output. When the thresholdis set to 0.4, we get the following six realizedpronunciations for /a r i g a t o o/ (normalizedpronunciation likelihoods are in brackets).1. a r i a t o o (1.0)2. a r i g a t o o (0.75)3. a r i a d o o (0.6)

4. a r i e t o o (0.5)5. a r i g a d o o (0.45)6. a r i a k o o (0.4)Note that in this example, 1.0 is given as thepronunciation likelihood for the word boundaryphonemes; the beginning two phonemes (/a r/)and the ending two phonemes (/o o/).

2.3.3. Integrating the pronunciation likelihood intospeech recognition

In conventional speech recognition systems,recognized word sequence W given observation Ocan be obtained by W � argmaxW P �W jO�. In thispaper, we extend this formula by considering therealized pronunciation A for the word W as fol-lows:

W � argmaxW 2W

XA2W

P �A;W jO�: �1�

Using Bayes' Rule, the right-hand side of Eq. (1)can be written as

argmaxW 2W

XA2W

P �OjA;W �P �W �P �AjW �: �2�

The ®rst term in Eq. (2), P �OjA;W �, is the proba-bility of a sequence of acoustic observations,conditioned on the pronunciation and word string.This probability can be computed using anacoustic model. The second term in Eq. (2), P�W �,is the language model likelihood and can becomputed using an n-gram word model. We callthe third term in Eq. (2), P �AjW �, the pronuncia-tion model. In this paper, the pronunciation net-work is used as the pronunciation model.

We consider that multiple realized pronuncia-tions mainly represent the pronunciation variabil-ity caused by speaker or context di�erences. Thatis, for a certain speaker and in a certain context,only one realized pronunciation can be taken for a

Table 1

Examples of inputs and outputs for the canonical pronunciation /a r i g a t o o/. /x/ denotes deletion

Input Phoneme (raw output/normalized output)

1 2 3

a r i g a i (0.9/1.0) u (0.2/0.22) o (0.1/0.11)

r i g a t x (0.4/1.0) g (0.3/0.75) b (0.1/0.25)

i g a t o a (0.8/1.0) e (0.4/0.5) o (0.2/0.25)

g a t o o t (0.5/1.0) d (0.3/0.6) k (0.2/0.4)


word pronunciation. Therefore, we omit the sum-mation in Eq. (2). Furthermore, by applying ex-ponential weighting to the language probabilityand pronunciation probability, the acoustic ob-servation O can be decoded by the word sequencebased on the following equation:

argmaxW 2W ; A2W

P �OjA;W �P�W �aP�AjW �b; �3�

where a and b are weighting factors for the lan-guage model and the pronunciation model, re-spectively.

2.3.4. Realized pronunciations for word boundaryphonemes

In the previous section or our previous ap-proach (Fukada and Sagisaka, 1997), the fourphonemes at word boundaries, L�1�, L�2�, L�jW j ÿ1� and L�jW j�, are not predicted by the pronunci-ation network and the canonical pronunciationsare simply used as the realized pronunciations forthese phonemes, since the preceding and succeed-ing words of W are not known at the stage ofgenerating a dictionary. The optimal solutionwould be to apply the pronunciation networkduring the decoding stage to generate alternativepronunciations on the ¯y based on hypotheses, butthis is technically di�cult (Riley et al., 1995).

To avoid this, an N -best rescoring paradigm hasbeen proposed, by applying decision tree-basedpronunciation models to the hypothesis generatedusing the conventional dictionary (Fosler et al.,1996; Weintraub et al., 1996). Although this ap-proach can evaluate pronunciation variations evenfor word boundary phonemes depending on thepreceding and succeeding words, the obtainedimprovement is not signi®cant. We suspect thatthis is mainly because the N -best hypotheses areobtained from a baseline dictionary, that is, deci-sion tree models are not applied in the ®rst passdecoding.

In (Humphries, 1997), cross-word e�ects areroughly incorporated into the pronunciationmodeling through the inclusion of word boundaryinformation as an additional feature in the deci-sion tree clustering. No improvement, however, isobtained by the implementation of cross-wordpronunciation modeling. This implies that thecontextual dependency for each word, such as a

word A which is often followed by a word B, has tobe taken into account when predicting realizedpronunciations for word boundary phonemes.

Therefore, we take the approach to generate arealized pronunciation dictionary whose variationsare considered not only within-word phonemesbut also word boundary phonemes, and use thisdictionary to the ®rst pass in decoding. Pronunci-ation variations for word boundary phonemes canbe taken into account based on language statistics.As language statistics, we employ word bigrammodels here. Their probabilities are employed togenerate realized pronunciations. Because wordbigram models give all possible preceding andsucceeding words and their frequencies for a cer-tain word, ®ve phoneme contexts (quintphone) ofword boundary phonemes are statistically deter-mined.

Consider that we want to ®nd realized pro-nunciations for the ®rst canonical phoneme LWC

�1�for a word WC and its canonical pronunciation isLWC� �LWC

�1�; . . . ; LWC�jWCj��, where jWCj is the

number of phonemes of the canonical pronuncia-tion. Let a word which can be preceded by WC bedenoted as WP whose canonical pronunciation isLWP� �LWP

�1�; . . . ; LWP�jWPj��, where jWPj is the

number of phonemes of the canonical pronuncia-tion. Then, the quintphone for LWC

�1� is ®xed as�LWP�jWPj ÿ 1�, LWP

�jWPj�, LWC�1�, LWC

�2�, LWC�3��

and the output values of the pronunciation net-work can be computed. By computing outputvalues for all possible preceding words for LWC

, theoutput value of the ith output unit, �SWC;i�1�, isstatistically computed as

�SWC;i�1� �X

WP2W

P �WCjWP�SWC;WP;i�1�; �4�

where W is the set of all possible words, P �WCjWP�is the conditional probability of WC given by theword bigram models, and SWC ;WP;i�1� is the outputof the ith output units computed by the quint-phone input using WC and WP. Similarly, theoutput values for other word boundary phonemes,e.g., LWC

�2�, LWC�jWCj ÿ 1� and LWC

�jWCj�, can bestatistically computed. Once the outputs for eachoutput unit are computed, multiple realized pro-nunciations for WC can be obtained as shown inSection 2.3.2.


2.3.5. Reliability weighting for pronunciation likeli-hoods

Although P �AjW � in Eq. (3) given as the nor-malized likelihood can be used as the score for thepronunciation models, the reliability of P �AjW � forthe following three kinds of realized pronuncia-tions will decrease in the following order: (1) ob-tained from quintphone input (known), (2)obtained using language statistics (statisticallyknown), and (3) substituted with the canonicalpronunciation (unknown). Therefore, we introducea modi®ed pronunciation likelihood P 0�AjW �computed by multiplying a weighting factor k toP �AjW � as

P 0�AjW � � kP �AjW �; �5�where k is a jW j dependent constant factor(0 < k < 1) and is de®ned as

k �PjW j

m�1

Q2i�ÿ2 w�mÿ i�jW j ; �6�

where w�mÿ i� is heuristically de®ned as in Ta-ble 2. Here, values for jij � 2 are set to bigger thanjij � 1 because two phonemes away from thecenter phoneme would a�ect the output less thanadjacent phonemes.

3. Pronunciation dictionary for spontaneous speech

recognition

3.1. Conditions

A total of 230 speaker (100 male and 130 fe-male) dialogues were used for the pronunciationnetwork and acoustic model training. A 26-dimensional feature vector (12-dimensional mel-cepstrum + power and their derivatives) wascomputed using a 25.6 ms window duration and a10 ms frame period. A set of 26 phonemes was

used as the Japanese pronunciation representa-tions.

Shared-state context dependent HMMs (CD-HMMs) with ®ve Gaussian mixture componentsper state (Ostendorf and Singer, 1997) weretrained. The total number of states was set to 800.By using the CD-HMMs and Japanese syllabicconstraints, phoneme recognition was performedon the training data. The phoneme sequences ofthe recognition results were taken as the realizedpronunciations. For each utterance, these realizedpronunciations were aligned to their canonicalpronunciations transcribed by human experts.

3.2. Pronunciation network training

Canonical pronunciations with quintphonecontext and their correspondent realized pronun-ciations (about 120,000 samples in total) were usedas the inputs and outputs for the pronunciationnetwork training. The structure of the pronunci-ation network is shown in Fig. 1, where 130 inputunits, 100 hidden units and 53 output units areused. There is also a bias that acts as an additionalinput constantly set to one. The total number ofnetwork weights including the biases becomes18,453 (131 ´ 100 + 101 ´ 53). For output andhidden units, the sigmoid function with the meansquared error (MSE) criterion is used becauseeach output produces a number between 0 and 1but the sum of all outputs does not sum up to one.The network was trained using 1000 batch itera-tions and an intermediate network after 500 iter-ations was used in the following experiments. Thedi�erences in the recognition performance for thenumber of iterations are discussed in Section 5.1.The phone recognition accuracy between the ca-nonical pronunciation and the training data is81.1%. In order to indicate how the pronunciationnetwork can predict pronunciation variation, weevaluated performances of the pronunciation net-work by the coincidence rate and by the MSE forthe training data. Fig. 2 shows the coincidencerates of target pronunciation and estimated pro-nunciation (solid line), and the MSE between thetargets and the estimates (dotted line) as a func-tion of the number of training iterations. The co-incidence rate for target and canonical

Table 2

w�mÿ i� in Eq. (6)

Context jij0 1 2

Known 1.0 1.0 1.0

Statistically known ) 0.8 0.9

Unknown ) 0.7 0.8


pronunciation (shown as Original Correct in the®gure) is 77.2%.

3.3. Generation of realized pronunciation dictionary

We applied the trained pronunciation networkto the following two kinds of Japanese pronunci-ation dictionaries with 7484 word entries, 3 devel-oped for spontaneous speech recognition on atravel arrangement task (Nakamura et al., 1996).· Simple dictionary: Each word entry has a single

standard canonical pronunciation. This dictio-nary is automatically generated from its readingrepresented by Japanese syllabic symbols(katakana characters). All entries can be fol-lowed by silence (i.e. pause) in recognition.

· Expert dictionary: This dictionary is constructedby human experts considering pronunciationvariabilities such as successive voicings, 4 inser-tion and substitution of phonemes occurring in

spontaneous speech and possible insertions ofa pause.

Table 3 shows examples of pronunciations ob-tained for these dictionaries. /±/ denotes silence. Inexample 2 of the expert dictionary, /s u {{m|} i

|ng} m a s e ng {|±}/ represents the following sixkinds of multiple pronunciations.1. s u m i m a s e ng

2. s u i m a s e ng

3. s u ng m a s e ng

4. s u m i m a s e ng ±5. s u i m a s e ng ±6. s u ng m a s e ng ±By expanding the brackets shown in Table 3 asabove and applying these canonical pronuncia-tions to the pronunciation network, a realizedpronunciation dictionary is automatically gener-ated as described in Section 2.3. Table 4 shows thetotal number of multiple pronunciations for 7484word entries. In this table, Prop 1 and Prop 2denote the proposed realized pronunciation dict-ionaries without and with the application of theword boundary phonemes described in Sec-tion 2.3.4, respectively. The threshold for limitingthe number of realized pronunciations was set to0.4 for all cases. We used the expanded diction-aries as the baseline dictionaries in the followingrecognition experiments, but note that the bracket

3 Multi-words, which were automatically generated by the

language modeling (Masataki and Sagisaka, 1996), were also

included in the entries.4 Some Japanese word pronunciations change when a com-

pound word is formed. For example, the conjunction of /k o d

o m o/ (child) and /h e y a/ (room) is pronounced /k o d o m o b

e y a/.

Table 3

Examples of simple and expert pronunciation dictionaries

Example Simple dictionary Expert dictionary

1 h e y a {jÿ} {h|b} e y a {|)}

2 s u m i m a s e ng

{jÿ}

s u {{m|}i|ng} m a s e ng

{|)}

3 h o sh i {jÿ} h o sh i

/±/ denotes silence. {|)} represents that both no silence and si-

lence can be used.

Table 4

Table number of multiple pronunciations for 7484 word entries

Simple Expert

Before bracket expression 7484 7484

After bracket expansion 14,968 17,210

Prop 1 (w/o boundary phonemes) 28,663 33,198

Prop 2 (with boundary phonemes) 33,742 42,103

Fig. 2. Coincidence rates (solid line) and mean squared error

(dotted line) of target and estimates for training data as a

function of number of training iterations. Coincidence rate for

target and canonical pronunciation (shown as Original Correct)

is 77.2%.


expansion did not a�ect the recognition perfor-mance.

For example, the multiple realized pronuncia-tions obtained from the pronunciation network fora word /w a z u k a/ are shown in Table 5. In thisexample, word boundary phonemes (i.e. /w a/ and/k a/) are also applied to the pronunciation net-work.

4. Spontaneous speech recognition experiments

To investigate the relative e�ectiveness of theproposed dictionary generated in Section 3, weconducted continuous speech recognition experi-ments on a Japanese spontaneous speech database(Nakamura et al., 1996).

4.1. Experimental conditions

The same training data, front-end, and acousticmodel described in Section 3.1 were used. For theopen test set, 42 speaker (17 male and 25 female)dialogues were used. Variable-order n-grams(Masataki and Sagisaka, 1996) were used as thelanguage model. A multi-pass beam search tech-nique was used for decoding (Shimizu et al., 1996).The language and pronunciation probabilityweights, a and b in Eq. (3) were equally set.

Four kinds of realized pronunciation diction-aries were generated for each baseline dictionary(i.e. the simple and expert dictionaries):1. Generate a realized pronunciation dictionary

with no pronunciation likelihood or languagestatistics for word boundary phonemes. Thispronunciation dictionary generation is equiva-lent to our previous approach (Fukada and Sa-gisaka, 1997) (Dict 1).

2. The same as for Dict 1 but with the pronuncia-tion likelihoods described in Section 2.3.2 (Dict2).

3. Generate a dictionary by using language statis-tics as described in Section 2.3.4. The reliabilityweighting described in Section 2.3.5 is also ap-plied (Dict 3).

4. The same as for Dict 3 but with pronunciationlikelihoods (Dict 4).

Note that the total number of multiple pronunci-ations for Dict 1 and Dict 2 are the same as Prop 1in Table 4. The total number for Dict 3 and Dict 4are the same as Prop 2 in this table.

4.2. Recognition results

4.2.1. Simple dictionaryRecognition results in word error rate (WER)

(%) for the simple dictionary are shown in Table 6.All four proposed dictionaries outperform thebaseline dictionary. Also, the application of pro-nunciation likelihoods or language statistics (Dict2, Dict 3 or Dict 4) boosts the recognition per-formance of our previous approach (Dict 1). Notethat the proposed dictionary generated using bothpronunciation likelihoods and language statisticsachieved about a 10% error reduction in worderror rate compared to the baseline performance.

4.2.2. Expert dictionaryRecognition results in WER (%) for the expert

dictionary are shown in Table 7. First, comparingthe two baseline results for the simple and expertdictionaries in Tables 6 and 7, the expert dictio-nary (29.0%) can be observed to be signi®cantlysuperior to the simple dictionary (34.5%). Second,we can see similar improvements by applying theproposed method to the baseline expert dictionaryas achieved with the simple dictionary. Note that

Table 5

Example of realized pronunciations with normalized likeli-

hoods for /w a z u k a/

Pronunciation Normalized likelihood

w a z u k a 1.0

a z u k a 0.896

w a z u t a 0.662

a z u t a 0.593

Table 6

Recognition results for the simple dictionary

Dictionary Likelihood Language stat. WER (%)

Baseline ) ) 34.5

Dict 1 No No 33.2

Dict 2 Yes No 31.2

Dict 3 No Yes 32.9

Dict 4 Yes Yes 31.1


the proposed dictionary with pronunciation like-lihoods and language statistics (Dict 4) achievedabout a 9% error reduction in WER compared tothe baseline performance.

4.2.3. Error analysisFrom Tables 6 and 7, the proposed dictionaries

gave consistently better performances than thebaseline dictionaries. Here, we found from therecognition results that the number of insertionand substitution errors for the proposed diction-aries signi®cantly decreased compared to thebaseline dictionaries. We believe that this is be-cause the proposed dictionary reduces the errorswhen long word is incorrectly recognized as a se-quence of short words, or a correct word is sub-stituted as another word when the actualpronunciation is slightly di�erent from that in thebaseline dictionary.

5. Discussion

5.1. Number of iterations for NN training

The WER and the total number of realizedpronunciations as functions of neural networktraining iterations (50, 100, 200, 500 and 1000) areshown in Fig. 3. The experimental conditions werethe same as those described in Section 4.1, exceptthat the threshold for the normalized likelihoodwas set to 0.5. The baseline expert dictionary wasused for generating the realized pronunciations.No pronunciation likelihoods or language statis-tics were used in this experiment. From these re-sults, it can be seen that the WER was reduced upto 500 iterations and then saturated, while thenumber of realized pronunciations kept increasing.Note that all created dictionaries outperformed thebaseline dictionary.

5.2. Probability cut-o� threshold

The WER and total number of realized pro-nunciations as functions of thresholds (in steps of0.1 from 0.2 to 1.0), for limiting the number ofrealized pronunciations, are shown in Fig. 4. Theexperimental conditions were the same as thosedescribed in Section 4.1. The baseline expertdictionary was used for generating the realizedpronunciations. No pronunciation likelihoods orlanguage statistics was used in this experiment.From these results, we can see that the WERshowed a ¯at concave around 0.4±0.6, and thesmaller the used threshold is, the higher the WER,

Fig. 3. Word error rate and total number of realized pronun-

ciations as functions of neural network training iterations.

Fig. 4. Word error rate and total number of realized pronun-

ciations as functions of thresholds, for limiting the number of

pronunciations.

Table 7

Recognition results for the expert dictionary

Dictionary Likelihood Language stat. WER (%)

Baseline ) ) 29.0

Dict 1 No No 27.9

Dict 2 Yes No 26.0

Dict 3 No Yes 27.4

Dict 4 Yes Yes 26.4


and the exponentially larger the pronunciationsbecame. Again, note that all created dictionariesoutperformed the baseline dictionary.

5.3. Computational requirements

As the proposed dictionaries have signi®cantlymore entries than the baseline dictionaries (seeTable 4), we investigated the computational re-quirements of the experiments run for the expertdictionary. The computational requirements andpronunciation entries for the expert dictionaries,which are normalized against those of the baselinesystem, are shown in Table 8. From these results,although the proposed dictionaries required someadditional search time compared to the baselinesystem, the increase in search time was much lessthan the increase in actual pronunciation entries inthe dictionary. We believe that this phenomenon isdue to the fact that the proposed realized pro-nunciation dictionaries appropriately representactually occurring pronunciations in conversa-tional speech. Note that the computational re-quirements of realized pronunciations highlydepend on the recognizer's representation of pro-nunciations. In our case, the multiple pronuncia-tions are represented in a network with the sharingof common phonemes, and this will be much fasterthan a linear lexicon.

6. Conclusion

In this paper, a method for automatically gen-erating a pronunciation dictionary based on apronunciation neural network has been proposed.We focused on two techniques: (1) realized pro-nunciations with likelihoods based on the neural

network outputs and (2) realized pronunciationsfor word boundary phonemes using word bigram-based language statistics, which are both moresophisticated generation methods than our pre-vious approach (Fukada and Sagisaka, 1997).Experimental results on spontaneous speech rec-ognition show that automatically derived pro-nunciation dictionaries give higher recognitionrates than the conventional dictionary. We alsocon®rmed that the proposed method can enhancethe recognition performance of a dictionary con-structed based on expertise. In this paper, only aquintphone context is used for predicting pro-nunciation variations, that is, words whosequintphone contexts are the same, have the samepronunciation variation. However, other factors(e.g. part-of-speech) can be easily incorporatedinto the pronunciation network by having addi-tional units for these factors. Although the pro-posed method requires the ®xed input window (i.e.a context of ®ve phonemes), this requirementcould be relaxed by adding word boundary phones(pad phones) to the beginning and end of theword. In addition, we expect the multiple pro-nunciation dictionary to be a useful resource foracoustic model retraining by realigning the train-ing data (Sloboda, 1995; Fosler et al., 1996; Byrneet al., 1997).

References

Bahl, L., Baker, J., Cohen, P., Jelinek, F., Lewis, B., Mercer,

R., 1978. Recognition of a continuously read natural

corpus. In: Proc. ICASSP-78, pp. 422±424.

Byrne, B., Finke, M., Khudanpur, S., McDonough, J., Nock,

H., Riley, M., Saraclar, M., Wooters, C., Zavaliagkos, G.,

1997. Pronunciation modelling for conversational speech

recognitioin: A status report from WS97. In: Proc. 1997

IEEE Workshop on Speech Recognition and Understand-

ing.

Fosler, E., Weintraub, M., Wegmann, S., Kao, Y.-H.,

Khudanpur, S., Galles, C., Saraclar, M., 1996. Automatic

learning of word pronunciation from data. In: Proc.

ICSLP-96, pp. 28±29 (addendum).

Fukada, T., Sagisaka, Y., 1997. Automatic generation of a

pronunciation dictionary based on a pronunciation net-

work. In: Proc. EUROSPEECH-97, pp. 2471±2474.

Humphries, J., 1997., Accent modelling and adaptation in

automatic speech recognition, Ph.D. Thesis, University of

Cambridge, Cambridge.

Table 8

Computational requirements and pronunciation entries for the

expert dictionaries normalized against those of the baseline

system

Dictionary CPU time Entries

Dict 1 1.04 1.93

Dict 2 1.20 1.93

Dict 3 1.19 2.45

Dict 4 1.33 2.45


Imai, T., Ando, A., Miyasaka, E., 1995. A new method for

automatic generation of speaker-dependent phonological

rules. In: Proc. ICASSP-95, pp. 864±867.

Lamel, L., Adda, G., 1996. On designing pronunciation

lexicons for large vocabulary, continuous speech recogni-

tion. In: Proc. ICSLP-96, pp. 6±9.

Masataki, H., Sagisaka, Y., 1996. Variable-order n-gram

generation by word-class splitting and consecutive word

grouping. In: Proc. ICASSP-96, pp. 188±191.

Nakamura, A., Matsunaga, S., Shimizu, T., Tonomura, M.,

Sagisaka, Y., 1996. Japanese speech databases for robust

speech recognition. In: Proc. ICSLP-96, pp. 2199±2202.

Ostendorf, M., Singer, H., 1997. HMM topology design using

maximum likelihood successive state splitting. Computer

Speech and Language 11, 17±41.

Randolph, M., 1990. A data-driven method for discovering and

predicting allophonic variation. In: Proc. ICASSP-90,

pp. 1177±1180.

Riley, M., 1991. A statistical model for generating pronunci-

ation networks. In: Proc. ICASSP-91, pp. 737±740.

Riley, M., Pereira, F., Chung, E., 1995. Lazy transducer

composition: A ¯exible method for on-the-¯y expansion of

context-dependent grammar networks. In: Proc. IEEE

Automatic Speech Recognition Workshop, pp. 139±140.

Schmid, P., Cole, R., Fanty, M., 1993. Automatically generated

word pronunciations from phoneme classi®er output. In:

Proc. ICASSP-93, pp. II-223±II-226.

Sejnowski, T., Rosenberg, C., 1986. NETtalk: A parallel

network that learns to read aloud. The Johns Hopkins

University, Electrical Engineering and Computer Science

Tech. Report JHU/EECS-86/01.

Shimizu, T., Yamamoto, H., Masataki, H., Matsunaga, S.,

Sagisaka, Y., 1996. Spontaneous dialogue speech recogni-

tion using cross-word context constrained word graphs. In:

Proc. ICASSP-96, pp. 145±148.

Sloboda, T., 1995. Dictionary learning performance through

consistency. In: Proc. ICASSP-95, pp. 453±456.

Weintraub, M., Fosler, E., Galles, C., Kao, Y.-H., Khudanpur,

S., Saraclar, M., Wegmann, S. 1996. Automatic learning of

word pronunciation from data. JHU Workshop-96 Project

Report.

Wooters, C., Stolcke, A., 1994. Multiple-pronunciation lexical

modeling in a speaker independent speech understanding

system. In: Proc. ICSLP-94, pp. 1363±1366.


Documents

Automatic generation of multiple pronunciations based on neural networks