[IEEE 2011 International Conference on Electrical and Control Engineering (ICECE) - Yichang, China (2011.09.16-2011.09.18)] 2011 International Conference on Electrical and Control

Development of Hindi Speech Synthesizer for Metro Rail InformationSystem

Archana Balyan1, S.S. Agrawal2 and Amita Dev3

1Department of Electronics and communication Engineering, Maharaja Surajmal Institute ofEngineering, New Delhi, India

2 Advisor C DAC & Director KIIT, Gurgaon, India3 Department of Computer Science, Ambedkar Institute of technology, Delhi, India{ [email protected] , [email protected], [email protected] }

AbstractBuilding a natural conversational interface is a majorchallenge faced by the researchers. Speech synthesis is atechnology used for automatic information retrieval. In thispaper, we described the development of unit selection forHindi language. The aim of this study is to develop speechsynthesizer considering phonemes as the basic unit in Hindilanguage, as applicable to Metro Rail Passenger InformationSystem (MRPIS).The phoneme has been chosen as the basicunit as larger domain synthesizer cannot be created usingsyllable as the basic unit. We describe the build process andaddress the issue of speech segmentation using HiddenMarkov Model (HMM) based techniques. We report thecomparison of automatically segmented labels (speech units)using base line model of HMM with the manually segmentedlabels in the context of Hindi Speech.

Keywords: Concatenation, Hidden Markov Model, Unitselection synthesis, Mel frequency Cepstral Co-efficient,Automatic Phonetic Segmentation

Introduction

The function of a Text to Speech (T-T-S) system is to convertan arbitrary text to a spoken waveform. Speech may besynthesized in different ways. The simplest approach tosynthesis involves taking real recorded speech; cutting it intosegments and concatenating these segments back togetherduring synthesis. This simplest approach is calledconcatenative synthesis. In this paper, we have used aconcatenative approach to develop speech synthesizer. Since aconcatenative approach has been used, the database must beprepared very carefully in order to obtain high quality naturalspeech. The synthesized speech is expected to be intelligible.Different from a standard pre-recorded database, unit selectionprovide multiple instances for the same unit to be presented ina corpus. These instances are differentiated by a uniquecombination of acoustic and linguistic value.In a typical speech corpus, only one type of segment is used torepresent the content of the corpus. In [2], diphone is used torepresent its speech corpus. Considerable work has been

reported in T-T-S conversion for Hindi in [3][4][5]. [3] usesdiphone as speech unit. [2][3] [4] & [5] reports developmentof synthesizer based on Festvox voice building framework.Our text-to-speech system consists of:-a text analysis module that brings input text and producesbasic units, that in our approach are phonemes.- A unit matching module that generates acoustics unitsequence according to the linguistic units detected from theinput.- Speech synthesis module that generates speech based on theacoustic unit sequences.- using Hidden Markov Model (HMM) for automaticgeneration of phoneme database from recorded sequences.

The paper is organized as follows: Section I describes variousaspects of Indian language, specifically Hindi. Section IIdescribes our methodology of collecting speech corpora tobuild database. Section III gives the description of phonemesegmentation of the speech signal using HMMs and theperformance of the segmentation system. Section IV describesthe statistics of the text information and data preparation foruse in synthesizing speech data. Section V describes thearchitecture of the implemented speech synthesizer and insection VI we discuss the advantages and difficulties that wefaced during the development of the synthesizer along withthe concluding remarks and future work.

1. Characteristics of Hindi Speech.

1.1 Units of Speech Database

The basic units of the writing system in Indian languages arecharacters which are an orthographic representation of speechsounds. A character in Indian language scripts is close to asyllable and can be typically of the following form: V, CV,VC, CCV and CVC, where C is a consonant and V is a vowel.In Hindi, there are ten vowels, two diphthongs, foursemivowels, and 31 consonants. Hindi language script isphonetic in nature. There are 8 aspirated plosives and 2

2166978-1-4244-8165-1/11/$26.00 ©2011 IEEE

aspirated affricates in addition to their unaspiratedcounterparts. Retroflexion is another prominent feature inHindi. There are 4 places of articulation; velar, dental,retroflex and bilabial [6].

2. Development of speech database Methodology

Speech Corpora design for Metro Rail Information Systemconsists following steps [7].

Selection of Sentences as pertaining to RailwayInformation system.

Recording of content. Automatic Phoneme segmentation of speech signal

using HMM.

2.1 Selection of sentences

The first step we followed in creating a speech database forbuilding speech synthesis system is generation of optimal setof spoken sentences relevant to Metro Rail Information systemfor a few states in India. The selected sentences should besufficient to convey all types of required information andminimal to reduce recording and at the same time should haveenough occurrences of each type of sound to capture theacoustics of Hindi speech.

2.1.1 Most frequent words

A significant portion of the recorded sentences is covered bymost frequently used words. Therefore, such database helpstowards obtaining better quality synthesized speech. In thepresent database, 200 more frequent words have beenidentified and cover 40% of speech.

2.1.2 Recording of Speech Corpus and StatisticalAnalysis

The text material contains 292 sentences spoken in Hindi. Thesentences recorded were by a professional male speaker tomaintain a constant pitch and prevent stress phenomenon. Theduration of the recorded sentences is 1hour 30 minutes ofspeech including silence regions, where the average durationof each utterance is approximately 5 seconds with silenceregions ranging 1 to 2 seconds. The recording environmentwas noise- free and echo cancelled studio. The speech sampleswere recorded at rate of 16 KHz and stored in 16-bit PCMencoded waveform format in mono mode. The statisticalanalysis of text information of speech database has beencarried out. Table 1 below shows the statistical analysis of therecorded speech sentences.

Table 1: Statistics of recorded speech

S.No. Description Number

1 Total number of sentencesrecorded

292

2 Total number of name ofstations recorded

200

3 Total number of wordsrecorded

2400

4 Total number of differentwords recorded

370

5 Total number of differentnouns recorded

175

3. Automatic Phoneme segmentation of speech signalusing Hidden Markov Model:

Hidden Markov Models [8] may be used to represent thesequence of sounds within a section of speech. Each elementalspeech sound, known as phoneme can be modeled by anindividual HMM. The most frequent approach for automaticphonetic segmentation is to modify an HMM based phoneticrecognizer to adapt it to the task of automatic phoneticsegmentation. The main modification needed consists inletting the recognizer know the phonetic transcription of thesentence to be segmented by building a recognizer’s grammarfor that transcription and performing forced alignment. Theprobability of the input speech feature vector matching theHMM is used to identify the words spoken.HMMs are stochastic state machines where the current state isnot directly observable; an HMM emits an observable symbolper state. The probability of an HMM emitting a symbol ismodeled by a mixture of Gaussian distributions, as describedin equation (1)

M

mmjmjmjj UxNCxb

1,,)( µ ………… (1)

Where x is a feature extracted from the speech e.g. melfrequency cepstral coefficient, mjC , mjµ and mjU are thecoefficient , mean vector and covariance for the mixturecomponent in state j.A block diagram of the HMM-based automatic phoneticsegmentation (APS) used in this study is illustrated in Fig. 1.

2167

Fig. 1: Automatic Segmentation System (Base Model) used for segmentation

In the training module, mel-frequency cepstral coefficients(MFCC) are obtained from the speech database, using mel-cepstral analysis technique [9]. Context independent HMMsare trained using these obtained coefficients. The speechsentences are transcribed on the word level, so we performinitial phone level segmentation using automatic alignment ofspeech signals and word transcriptions, which is based onmonophone HMMs. The automatic segmentation is performedusing the forced alignment of the spoken sentence and thecorresponding transcription using the Viterbi algorithm [10].

3.1.1. Feature Extraction Process

The process of spectral parameter extraction(Parameterization) involves conversion of speech samples intofeature vectors to provide spectral patterns of the speech. Thespeech waveform, sampled at 16 KHz, is pre-emphasized,(with pre-emphasis coefficient of 0.99) and equally segmentedat uniform interval of 10 msec. The overlapping factor used is2.6. The Hamming window is applied for segmentation andthe Fourier Transform is taken for each frame having 292samples. The resulting power spectrum obtained is warpedaccording to the Mel-scale to adapt to frequency resolution ofthe human ear. The spectrum is then segmented into a numberof critical bands by means of 24 order overlapping triangularfilter banks. A Discrete Cosine Transform (DCT) is applied tothe logarithm of filter bank output that results in the rawMFCC vectors.

3.1.2. HMM Structure and Estimation of HMMParameters

This is the process of building a model for each phoneme inthe spoken speech. The feature vector is 12 dimensionalvectors which consist of MFCCcoefficients } ,.........,,{ 12321 cccc .with Cepstral MeanNormalization (CMN) and normalized log energy, excludingthe power coefficient, 0c . The number of states per phonemewas set at 3 or 5. For these models; a strict left-to-right

topology was used with no skips. The observation distributionof each state is modeled by a continuous Gaussian probabilitydensity function with mixtures of 2 covariance Gaussians forall the phoneme models for each system. The phone HMM isinitialized using the “flat start” training where all the modelsare initially equal. We performed 30 iterations of Viterbi re-estimation algorithm to train the acoustic models, and obtainedoptimized speaker dependent HMM for each phoneme. Theresults of the Viterbi algorithm are automatically segmentedmonophones which are used as the input for the speechsynthesis training.

3.1.3. Text Transcription and Phoneme Parser

The raw input text is pre-processed to convert the UnicodeDevanagari characters into Roman characters. The input text ispreprocessed to convert the Devnagari characters into Romancharacters since the keyboards available have ASCII codegenerated for Roman characters only. INSROT (Indian ScriptRoman Transliteration) [11] is the proposed transliterationschemes. We have manually performed the transliteration ofall the sentences occurring in our database manually. Forexample, the text transliteration of the sentence <;;s xkM+h dsUnzh; lfpoky; rd ds LkHkh LVs’kuks ij :dsxhA is<Wfile001"yeh’ gaad’ii kenadriiya sachivalaya tak kai sabhiist’eshno par rukegii" >.The parser takes the text as the input,extracts the phonemes, arranges them according to thesequence of occurrence in the given sentence and the list ofthe phonemes is given as the output. The phoneme parser hasbeen implemented in Visual C++. Each phoneme is indexed asVowel (V) or Consonant (C).

3.1.4. Phoneme Segmentation Performance EvaluationMethod

The segmentation accuracy was evaluated in terms oftolerance or deviation, i.e. the percentage of boundaries whichwere predicted within a distance smaller than t millisecondssmaller than hand-labeled boundaries. The automaticallysegmented boundaries are compared with the manuallysegmented boundaries. The percentage of boundaries whoseerror is within the tolerance is measured for the range oftolerances. For each of the phoneme occurring in the speechdatabase, the difference between estimated start-end pointsand actual start end points was computed and average wascalculated. This is called deviation and is used as a measure ofthe performance of the segmentation. A deviation less than 20msec is considered an acceptable limit for producing goodquality synthetic speech [12]. The corpus used to perform theexperiment is divided into training data (150-200 sentences)and test data (80-100) sentences. We have used HMM codeswritten in C++ to perform the experiment. Table 2 shows theoverall segmentation results in terms of percentage ofphonemes falling into different ranges of deviation.

Phoneme Sequence

Feature Vector

Segmented Data

Speech InputFeatureExtraction(MFCC)

PhonemeParser

TextTranscription

HMMRecognizer

Viterbi TimeAlignment

Raw text

2168

Table 2: The overall segmentation results

Deviation 0-20ms 20-30ms 30-40ms 40-50ms >50ms

Phonemes 36.36% 24.24% 18.18% 18.18% 21.21%

4. Implementation

4.1 The implementation of development of the speechsynthesizer is carried out in a sequence of three major stages,namely, segmentation at phoneme level, unit (phoneme)selection and synthesis. The block diagram of the process isshown in fig 2.

Fig. 2: The Principle of our phoneme-based speech synthesis system

5.1.2. Database Preparations and Phoneme Selectionprocess:

Speech segmentation relevant to our system would primarilysegment phonemes occurring in the speech sentence which hasbeen implemented using Hidden Markov Models. The inputfor database preparation is the segmented and labeled filesobtained at the output of the recognizer. The databasepreparation module scans each segmented files for presence ofa phoneme and generates a file for that phoneme in formatphoneme.txt (“n.txt”), where ‘n’ is the phoneme and .txt isextension and type of file. In this study, 43 such files aregenerated and stored. During scanning of the segmented file, itgoes sequentially and picks the specific speech unit i.e.phoneme along with its left and right context, getsinformation about start time, end time, duration and name ofthe corresponding wave file from which the appropriatephoneme has been picked up. The output of this recognizer isan index, which represents the phoneme segmented. The Unitselection module is responsible for selecting the best unitrealization sequence from many possible unit realizationsequences from the database .The number of tokens generatedfor each phoneme in our database along with the INSROTtransliteration scheme used in this study is shown in Table 3.

5. Speech synthesis results

The text-to-speech test included 21 Hindi sentences. For eachsentence the speech in the raw format, pitch and duration weregenerated. Figure 4 represents the result of generated speechfor the sentence < text 001, “saevaa maai nahaii”>.

Figure 4: Spectrograph and spectrogram of the synthesized speech signal forthe utterance text 001

The Figure 5 shows the spectrograph, spectrogram andannotation of the corresponding part of the original signal .Theannotation of fifteen sentences been done manually usingPRAAT [13] speech tool.

Phonemeparser

Text Corpus

Recording

Pre-processing

Phonemesegmentation(HMM)

Speech database

Unit selectionor synthesizer

Input text

Phoneme string

Synthesizedspeech output

Featureextraction

2169

Fig. 6.Spectrograph and spectrogram and annotation of original signal forsentence text001

6. Discussion and Conclusion

In this paper we presented the development of T-T-S systembased on principles of concatenation of speech units which isphoneme in our study. The used HMM approach for phoneticsegmentation is very effective in increasing the quality ofgenerated speech. Further, the intelligibility of syntheticspeech is reasonably good. However, some of the phonemessuch uu, d, r are missed in the synthesized speech. The reasonattributed to the above is that the number of tokens generatedfor these phonemes are very few and also that thesegmentation error (deviation) is higher. The above limitationcan be overcome by recording more sentences and labelingthem so that large number of tokens of each phoneme isgenerated with the required left and right phonemes. In thefuture work, In order to improve the context- dependent phonemodels used for synthesis more Hindi speech material will berecorded and labeled.

Table 3: Number of Phonemes generated for each Phoneme

INSROT * Numberof Tokensgenerated

INSROT * NumberofTokensgenerated

K 755 J 220

Kh 43 Jh 5

G 216 ng' 0

Gh 17 nj' 0

t' 362 n'a 440

t'h 69 Na 567

D' 85 Ma 380

d'h 96 y 336

T 30 R 880

Th 82 L 294

D 290 l' 27

Dh 66 V 344

P 300 Sh 87

Ph 48 s' 208

B 168 S 466

Bh 77 h' 347

Ch 94 d'. 12

C'hh 12 d'h. 17

A 1795 oo 330

Aa 1106 ai 122

I 1284 au 55

ii 504 u 228

O 5 uu 67

References

[1] S. Lemmetty, “ A Review of Speech Synthesis Technology”, MasterThesis, Department of Electrical and Communication Engineering,HelinskiUniversity of Technology, Helsinki,Finland,March 1999.[2J.P.H van Santen, R.W.Spoart,J.P.Olive and J.Hirschberg. “Progress inSpeech Synthesis”, Springer-Verlag 1997[3]T-T-S in Indian languages N.Sridhar[4]S.P. Kishore,Rajeev Sanghal and M.Srinivas, “Building Hindi and TeluguVoices in festvox”, Language Technology Research Center,IIIT,Hyderabad[5]S.P Kishore, R.Kumar, R.Sangal, “A Data-Driven synthesis approach forIndian languages using Syllable as Basic unit”. ICSLP00, China,October2000,CDROM paper:00192.pdf[6] Samudravijaya K, Rao,P.V.S. and Agrawal S.S. “ Hindi Speech Database,“ICSLP00, China,October 2000,CDROM paper:00192.pdf[7]Archana Balyan, S.S.Agrawal,Amita Dev, “Building Syllable DominatedHindi Speech Corpora for Metro Rail Information System”,Proc. O-COCOSDA-2008,pp 135-140.[8]Rabiner,L.R.,A tutorial on hidden Markov Models and selectedapplications in speech recognition. Proceedings of IEEE,1989.77(2):p. 257-286[9] ] Sirko Molau,Michael Pitz,Ralf Schliiter, and Herman Nay , “Computing

Mel-Frequency Cepstral Coefficients on the Power Spectrum” in Proc,in Int’l conf., IEEE 2001 (ICASSP).

[10] J.D. Forney, “The Viterbi Algorithm”, Proc of IEEE, vol.no.3, pp.268-278, 1978.

[11] [email protected] , INSROT[12] J. Matousek, D. Tihelka, and J. Psutka, “ Automatic segmentation forCzech concatenative speech synthesis using statistical approach withboundary-specific correction, “ Proceedings Eurospeech ’03 ,pp. 301-304,2003.[13] Boersma, P. and D.Weenik, “Praat: A System for doing Phonetics byComputer”,(http://www.praat.org/),2001.

2170

Documents

[IEEE 2011 International Conference on Electrical and Control Engineering (ICECE) - Yichang, China (2011.09.16-2011.09.18)] 2011 International Conference on Electrical and Control