Automatic generation of multiple pronunciations based on neural networks

Embed Size (px)

Text of Automatic generation of multiple pronunciations based on neural networks

  • Automatic generation of multiple pronunciations based onneural networks

    Toshiaki Fukada *, Takayoshi Yoshimura, Yoshinori Sagisaka

    ATR Interpreting Telecommunications Research Laboratory, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan

    Received 3 April 1998; accepted 24 September 1998


    We propose a method for automatically generating a pronunciation dictionary based on a pronunciation neural

    network that can predict plausible pronunciations (realized pronunciations) from the canonical pronunciation. This

    method can generate multiple forms of realized pronunciations using the pronunciation network. For generating a

    sophisticated realized pronunciation dictionary, two techniques are described: (1) realized pronunciations with likeli-

    hoods and (2) realized pronunciations for word boundary phonemes. Experimental results on spontaneous speech show

    that the automatically derived pronunciation dictionaries give consistently higher recognition rates than a conventional

    dictionary. 1999 Elsevier Science B.V. All rights reserved.


    Wir schlagen eine Methode zur automatischen Generierung eines Aussprachelexikons vor, dessen Aussprachevari-

    anten durch ein neuronales Netz generiert werden. Das neuronale Netz kann sinnvolle alternative Aussprachevarianten

    aus den ursprunglichen (kanonischen) Aussprachevarianten vorhersagen. Zur automatischen Generierung eines Aus-sprachelexikons sind hier zwei Techniken beschrieben: (1) die Erzeugung von alternativen Aussprachevarianten mit

    Likelihoods und (2) die Erzeugung von alternativen Aussprachevarianten fur Phonemen an Wortgrenzen.Erkennungsexperimente mit spontaner Sprache zeigen, da die Verwendung automatisch generierter Aussprachelexika

    hohere Erkennungsraten gibt als die Verwendung des regularen Aussprachelexikons. 1999 Elsevier Science B.V. Allrights reserved.


    Nous proposons une methode permettant de generer automatiquement un dictionnaire de prononciations a laidedun reseau de neurones qui predit les prononciations les plus plausibles (variantes) a partir dune prononciationstandard. Le reseau de prononciation peut generer de multiples variantes de prononciation. Pour generer undictionnaire de variantes de prononciation sophistique, deux techniques sont decrites: (1) variantes de prononciationavec valeur de vraisemblance, et (2) variantes de prononciation pour les phonemes aux frontieres des mots. Les resultatsde nos experiences sur de la parole spontanee montrent que les dictionnaires de prononciation derives automatiquementpermettent dameliorer de facon significative le taux de reconnaissance, par rapport a un dictionnaire convention-nel. 1999 Elsevier Science B.V. All rights reserved.

    Keywords: Pronunciation dictionary; Neural networks; Spontaneous speech; Speech recognition

    Speech Communication 27 (1999) 6373

    * Corresponding author. Tel.: +81 774 95 1301; fax: +81 774 95 1308; e-mail:

    0167-6393/99/$ see front matter 1999 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 6 6 - 1

  • 1. Introduction

    The creation of an appropriate pronunciationdictionary is widely acknowledged to be an im-portant component for a speech recognition sys-tem. One of the earliest successful attempts basedon phonological rules was made at IBM (Bahlet al., 1978). Generating a sophisticated pronuncia-tion dictionary is still considered to be quite ef-fective for improving the system performance onlarge vocabulary continuous speech recognition(LVCSR) tasks (Lamel and Adda, 1996). How-ever, constructing a pronunciation dictionarymanually or by a rule-based system requires timeand expertise. Consequently, research eorts havebeen directed at constructing a pronunciationdictionary automatically. In the early 1990s, theemergence of phonetically transcribed (hand-la-beled) medium-size databases (e.g. TIMIT andResource Management) encouraged a lot of re-searchers to explore pronunciation modeling(Randolph, 1990; Riley, 1991; Wooters andStolcke, 1994). Although all of these approachesare able to automatically generate pronunciationrules, hand-labeled transcriptions by expert pho-neticians are required. As a result, automaticphone transcriptions generated by a phonemerecognizer, which enable one to cope with a largeamount of training data, have been used in pro-nunciation modeling (Schmid et al., 1993; Slobo-da, 1995; Imai et al., 1995; Humphries, 1997).Recently, LVCSR systems have started to treatspontaneous, conversational speech, such as theSwitchboard corpus and, consequently, pronunci-ation modeling has become an important topicbecause word pronunciations vary more here thanin read speech (Fosler et al., 1996; Weintraubet al., 1996; Byrne et al., 1997).

    We have proposed a method for automaticallygenerating a pronunciation dictionary on the basisof a spontaneous, conversational speech database(Fukada and Sagisaka, 1997). Our approach isbased on a pronunciation neural network that canpredict plausible pronunciations (realized pro-nunciations) from the canonical pronunciation;most other approaches use decision trees for pro-nunciation modeling (Randolph, 1990; Riley,1991; Fosler et al., 1996; Weintraub et al., 1996;

    Humphries, 1997; Byrne et al., 1997). In this pa-per, we mainly address the following issues togenerate more sophisticated multiple pronuncia-tions for improved speech recognition: (1) how toassign a score to a pronunciation variation; and(2) how to generate pronunciation variations forword-boundary phonemes.

    We define canonical and realized pronuncia-tions as follows. Canonical pronunciation: Standard phoneme se-

    quences assumed to be pronounced in readspeech. Pronunciation variations such as speak-er variability, dialect or coarticulation in conver-sational speech are not considered.

    Realized pronunciation: Actual phoneme se-quences pronounced in speech. Various pronun-ciation variations due to speaker orconversational speech can be included.

    In the following sections, we first present trainingand generation procedures based on a pronuncia-tion neural network. In Section 3, the proposedmethod is applied to a task of pronunciationdictionary generation for spontaneous speechrecognition. Section 4 shows results of recognitionexperiments and Section 5 gives a discussion of thepresented work.

    2. Automatic generation of a pronunciation dictio-


    2.1. Pronunciation network

    To predict realized pronunciations from a ca-nonical pronunciation, we employ a multilayerperceptron as shown in Fig. 1. In this paper, arealized pronunciation A(m) for a canonicalpronunciation L(m) is predicted from the five

    Fig. 1. Pronunciation network.

    64 T. Fukada et al. / Speech Communication 27 (1999) 6373

  • phonemes (i.e. quintphone) of canonical pronun-ciations Lm 2; . . . ; Lm 2. 1

    Now we have two questions: (1) how to train apronunciation network; and (2) how to generatemultiple realized pronunciations by using thetrained pronunciation network. These questionsare answered in the following sections.

    2.2. Training procedures

    2.2.1. Training data preparationTo train a pronunciation network, first we have

    to prepare the training data, that is, input (ca-nonical pronunciation) and output (realized pro-nunciation) pairs. The training data can beprepared by generating a realized pronunciationsequence and mapping it to the canonical pro-nunciation as follows.1. Conduct phoneme recognition using speech

    training data for dictionary generation. Therecognized phoneme strings are taken as the re-alized pronunciation sequence.

    2. Align the canonical pronunciation sequence tothe realized pronunciation sequence using a dy-namic programming algorithm.

    For example, if the phoneme recognition result(i.e. realized pronunciation) for the canonicalpronunciation /a r a y u r u/, is /a w a u r i u/,the correspondence between the canonical pro-nunciation and the realized pronunciation can bedetermined as follows:

    a r a y u r u canonical pron:;a w a u r i u realized pron:;

    where the second phoneme of the canonical pro-nunciation, /r/, is substituted by /w/, and /y/ isdeleted and /i/ is inserted for the sixth phoneme ofthe canonical pronunciation, /r/. That is, L(2)r,A(2)w, L(4)y, A(4)x (deletion), L(6)rand A(6) {r, i} (/i/ is an insertion). The cor-

    rectly recognized phonemes are also treated assubstitutions (e.g. /a/ is substituted by /a/). Pho-neme recognition is conducted using all of thetraining data and the aligned results are used asthe data for input and output, for the pronuncia-tion neural network training (described in thefollowing section). Note that both the phonemerecognition and alignment procedures are notperformed for each word but for each utterance.

    2.2.2. Structure of pronunciation networkTo train a pronunciation network, a context of

    five phonemes in the canonical pronunciations,Lm 2; . . . ; Lm 2, are given as inputs; Amaligned to Lm is given for the outputs. A total of130 units (26 Japanese phoneme sets times fivecontexts) are used in the input layer. The repre-sentation of realized pronunciations at the outputlayer is localized, with one unit representing dele-tion, 26 units representing substitution, and 26units representing insertion, providing a total of 53output units. 2

    In the previous example, when / / (deletion),which corresponds to the fourth canonical string/y/, is used as Am, and /r a y u r/ are used asLm 2; . . . ; Lm 2. Here, 1.0 is given as theoutput unit for deletion and as the input units forthe /r/ in Lm 2, /a/ in Lm 1, /y/ in Lm, /u/in Lm 1 and /r/ in Lm 2; 0.0 is given for theother input and output units.

    2.3. Generation procedures

    2.3.1. Realized pronunciation generatio


View more >