6
Harmonic Plus Noise Model Based Speech Synthesis for Hindi Sourav Nandy Prologix Software Solutions Pvt. Ltd, Lucknow (UP) India 226016 [email protected] Vibhu Agrawal Prologix Software Solutions Pvt. Ltd, Lucknow (UP) India 226016, [email protected] Abstract In recent years speech synthesis has comes out with great prominence. There are two approaches to generate synthetic speech: waveform based and parameter based. Waveform based approach uses pre-recorded sentences of speech and plays a part of these sentences in a pre- scribed sequence for generating the desired speech out- put. Harmonic plus noise model (HNM) is a variant of parameter based approach. In parameter based ap- proach synthetic speech is generated using parameters, there is no need of the recorded wav or raw files. Har- monic plus noise model divides the spectrum of the speech into two sub-bands, one is modeled with har- monics of the fundamental frequency and the other is synthesized using random noise. Maximum voiced fre- quency is used to discriminate between harmonics and noise part. harmonics and noise are also known as pe- riodic and non-periodic parts. 1. Introduction In HNM, the speech signal to be composed of two parts: harmonic, and noise. The harmonic part accounts for the periodic components of the speech signal while the noise part is responsible for non-periodic compo- nents [1]. In the lower band, the signal is represented only by harmonically related sine waves with slowly varying amplitudes and frequencies. The goal of speech synthesis is to enable a ma- chine to transmit orally information to a user [2]. Fig- ure 1 shows the waveform of speech segment uttered by a female speaker (sampling frequency is 16000Hz). Whereas figure 2 and 3 shows the same speech segments below and above 4000 Hz frequency respectively. A parametric pitch-synchronous model, based on a harmonic plus noise representation for the speech sig- Figure 1. Waveform of speech segment Figure 2. Waveform of speech segment be- low 4000 Hz Figure 3. Waveform of speech segment above 4000 Hz 978-1-4244-5858-5/10/$26.00 ©2010 IEEE 390 ICALIP2010

[IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

  • Upload
    vibhu

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

Harmonic Plus Noise Model Based Speech Synthesis for Hindi

Sourav NandyPrologix Software Solutions Pvt. Ltd,

Lucknow (UP) India [email protected]

Vibhu AgrawalPrologix Software Solutions Pvt. Ltd,

Lucknow (UP) India 226016,[email protected]

Abstract

In recent years speech synthesis has comes out withgreat prominence. There are two approaches to generatesynthetic speech: waveform based and parameter based.Waveform based approach uses pre-recorded sentencesof speech and plays a part of these sentences in a pre-scribed sequence for generating the desired speech out-put. Harmonic plus noise model (HNM) is a variantof parameter based approach. In parameter based ap-proach synthetic speech is generated using parameters,there is no need of the recorded wav or raw files. Har-monic plus noise model divides the spectrum of thespeech into two sub-bands, one is modeled with har-monics of the fundamental frequency and the other issynthesized using random noise. Maximum voiced fre-quency is used to discriminate between harmonics andnoise part. harmonics and noise are also known as pe-riodic and non-periodic parts.

1. Introduction

In HNM, the speech signal to be composed of two

parts: harmonic, and noise. The harmonic part accounts

for the periodic components of the speech signal while

the noise part is responsible for non-periodic compo-

nents [1]. In the lower band, the signal is represented

only by harmonically related sine waves with slowly

varying amplitudes and frequencies.

The goal of speech synthesis is to enable a ma-

chine to transmit orally information to a user [2]. Fig-

ure 1 shows the waveform of speech segment uttered

by a female speaker (sampling frequency is 16000Hz).

Whereas figure 2 and 3 shows the same speech segments

below and above 4000 Hz frequency respectively.

A parametric pitch-synchronous model, based on a

harmonic plus noise representation for the speech sig-

Figure 1. Waveform of speech segment

Figure 2. Waveform of speech segment be-low 4000 Hz

Figure 3. Waveform of speech segmentabove 4000 Hz

978-1-4244-5858-5/10/$26.00 ©2010 IEEE 390 ICALIP2010

Page 2: [IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

nal. This twofold representation allows us to apply dif-

ferent modification methods to each part (harmonic and

noise), yielding more natural synthesis [1]. This has at-

tracted a lot of research efforts in recent years, stimu-

lated by the pioneering works by Griffin and Lim [3].

The proposed HNM assumes the speech signal to be

composed of a harmonic part h(t) and a noise part n(t).

For a voiced speech, the signal is divided into two bands

delimited by the so called maximum voiced frequency

Fm(t) a time varying parameter. The lower band of

the spectrum (below the maximum voiced frequency)

is represented by the harmonic part (low pass signal),

while the upper band by the noise part (high pass sig-

nal). Thus, the harmonic part accounts for the periodic

(voiced) structure of the speech signal, which is sum of

harmonically related sinusoidal component with contin-

uously time varying amplitudes and phases.

h(t) =

K(t)∑k=1

ak(t)cosφk(t) (1)

where, ak(t) corresponds to amplitudes

and φk(t) are phases

at time t of the kth harmonic.

The noise part n(t) is obtained by filtering a White

Gaussian noise u(t) by an all-pole filter h(t, τ). The filter

h(t, τ) is evaluated at each time instants.

n(t) = [u(t).h(t, τ)] (2)

The final synthetic speech signal s(t) is supposed to

be the superposition of the harmonic and the noise part.

s(t) = h(t) + n(t) (3)

2. System architecture

As reverent to objective; the complete system archi-

tecture diagram for the proposed system shown below:

Speech signals are given to the initial pitch estima-

tion module. By use of auto-correlation method pitch

values are determined. These initial pitch values are

given as input to the voiced/unvoiced decision making

module. In this module, decision are taken on the ba-

sis of a threshold, whether the frame is voiced or un-

voiced [4]. Maximum voiced frequency (MVF) [6] [7]

is estimated only on the voiced frames. On the basis of

MVF, voiced frames are divided into harmonic and noise

parts. Amplitudes, phases and envelope of harmonic and

noise parts are estimated in parameter extraction mod-

ule.

Figure 4. System architecture diagram ofanalysis

Figure 5. System architecture diagram ofsynthesis

In synthesis phase as in figure 5, the parameters ex-

tracted in analysis phase from figure 4 i.e. amplitudes,

phases and envelope are used to generate the harmonic

and noise part and it is given to the synthesis block,

which generates the output speech signal.

2..1 Initial pitch estimation

Initial pitch is determined by auto-correlation

method. It is based on the criterion that how close the

synthesized speech to the original speech [5].

E =

∫ 12

− 12

[| Sw(f) | − | Sw(f) |]2 df

∫ 12

− 12

| Sw(f) |2 df

(4)

where Sw(f) is the Fourier transform of a framed

speech segment of the speech signal s(t) and Sw(f) is

the Fourier transform of the synthetic speech generated

by the fundamental frequency f0.

Griffin and Lim [3] proposed to multiply the error

criterion in the above equation 4 by a pitch period de-

pendent correction factor.

391

Page 3: [IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

E =

∫ 12

− 12

[| Sw(f) | − | Sw(f) |]2 df

∫ 12

− 12

| Sw(f) |2 df [1− P.

∞∑t=−∞

w4(t)]

(5)

By replacing integrals of continuous functions by

summation of samples of these functions, Griffin and

Lim [3] proposed an efficient method for computing the

above equation 5 and it is the error function which is

used for the initial pitch estimation:

E =

∞∑t=−∞

s2(t)w2(t)− P∞∑

t=−∞r(l.P )

[∞∑

t=−∞s2(t)w2(t)].[1− P.

∞∑t=−∞

w4(t)]

(6)

where l is the number of harmonics,

s(t) is the speech signal

and w(t) is the analysis window which have a con-

straint

∞∑t=−∞

w2(t) = 1 (7)

The function r(k) is defined as

r(k) =∞∑

t=−∞s(t)w2(t)s(t+ k)w2(t+ k) (8)

Equation 8 is evaluated for pitch periods in the set

[ fsf0max

, ..... , fsf0min

], where f0min and f0max are the

minimum and maximum fundamental frequencies. Typ-

ical values of minimum and maximum fundamental fre-

quencies are 60 to 230 Hz for male voice and 180 to 400Hz for female voice [1].

2..2 Voiced/Unvoiced decision

This phase is for deciding whether the given seg-

ment of speech waveform should be classified as voiced

speech or unvoiced speech. A variety of approaches

have been described in the literature for making this de-

cision. The following are some measurements which are

used for making voiced/unvoiced decision:

• Energy of the signal

• Zero crossing rate of the signal

• Auto-correlation

• Linear Prediction Coefficients

In this work auto-correlation method is used for

making this decision. Using the initial pitch estimation,

a synthetic signal is generated s(t) as the sum of har-

monically related sinusoids with amplitudes and phases

estimated by the fast Fourier transform (FFT). Evaluate

FFT of the original signal and then synthetic signal spec-

trum is generated by calculating magnitude of the FFT

values for amplitudes and angle of the FFT values for

phases, with the help of initial pitch values estimated.

| S(f) | is denoted as synthetic spectrum and | S(f) | as

original spectrum, the voiced/unvoiced decision is made

by comparing the error over the first four harmonics of

the estimated fundamental frequency to a given thresh-

old (5 dB in this case)

E =

∫ 4.3f0

0.7f0

(| S(f) | − | S(f) |)2 df

∫ 4.3f0

0.7f0

| S(f) |2 df

(9)

where f0 is the initial fundamental frequencies

Fundamental Frequency =Sampling Frequency

P itch Period(10)

Voiced/unvoiced decision is made by comparing the

values of E with threshold. If E is less than threshold

then it is voiced, else unvoiced.

2..3 MVF estimation

Maximum voiced frequency is the frequency which

delimits harmonic and noise part. Signal below the max-

imum voiced frequency is considered as the harmonic

part and the signal above maximum voiced frequency is

noise. Maximum voiced frequency, Fm, is to search for

the largest sine-wave amplitude (peak) in the frequency

range [(2n−1)f0

2 , (2n+1)f02 ] over complete frame.

The following are the conditions applied determin-

ing MVF:

ifAmc(fc)

Amc(fi)> 2 (11)

or

Am(fc)−maxAm(fi) > 10dB (12)

then if

392

Page 4: [IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

| fc − Lf0 |Lf0

< 15% (13)

where fc is the frequency at which the frequency

range has maximum amplitudes, fi are the frequencies

of the peaks. Amc(fc) is the summation of the ampli-

tude values from previous valley to the successive valley

of fc, Amc(fi) is the mean of the complete frequency

range.

Am(fc) is the amplitude value at fc, Am(fi) is the

amplitude values at different peaks except fc. L is the

number of harmonics in the given frequency range

If the above conditions are satisfied then fc is the

MVF of the frequency range. Maximum value of the fcof all the frequency ranges of the frame is known as the

mvf of the frame and is denoted by Fm.

3. Parameters estimation

There are two types of parameters to be estimated;

harmonic parameters and noise parameters. For har-

monic parameters, FFT bins are used i.e. the peak val-

ues of the FFT spectrum with respect to the local pitch

period. Whereas for noise parameters; envelope of the

noise part is estimated.

Harmonic parameters is again divided into two parts

: phase and amplitude.

h(t) =L∑

k=−L

ak(t) cos φk(t) (14)

where L is the number of harmonics, ak and φk are

the amplitudes and phases respectively.

φk(t) = φk(t) + k2πf0(t) (15)

Harmonic part can also be written as a sum of expo-

nential functions. Therefore equation 14 can be written

as:

h(t) =L∑

k=−L

Ak(t)ej2πkf0(t−tia) (16)

where tia is a small time instant.

Converting equation 16 into matrix form

h = xB (17)

where B is a (2N + 1)× (2L+ 1) matrix

B = [b−L

... b−L+1

... b−L+2

... . . . . . .... bL] (18)

bk = [ej2πkf0(t−N) ej2πkf0(t−N−1) . . . . . . ej2πkf0(t+N)]T

(19)

where T is for transpose.

x is a (2L+1)× 1 vector which contains the ampli-

tude values

x = [A−L A−L+1 A−L+2 . . . . . . AL]T (20)

The solution to least-squares problem is then given

by the normal equation

(BTWTWB)x = BTWTWs (21)

where W is a (2N + 1)× (2N + 1) diagonal matrix

with diagonal elements as weight vector

WT = [w(−N) w(−N+1) w(−N+2) . . . . . . w(N)](22)

s is a (2N+1)×1 vector which contains the original

data

sT = [s(−N) s(−N + 1) s(−N + 2) . . . . . . s(N)](23)

Equation 21 can be rewritten as

Rx = P (24)

where R = BTWTWB and P = BTWTWsR is a (2L+ 1)× (2L+ 1) matrix

Element rik is given by

rik =N∑

t=−N

w2(t)ej2π(i−L−1)f0t−j2π(k−L−1)f0t (25)

where i = 1....(2L+ 1) and k = 1....(2L+ 1)P is a (2L+ 1)× 1 vector

Element pk is given by

pk =N∑

t=−N

w2(t)s(t)e−j2π(k−L−1)f0t (26)

Since the complete work is on the basis of auto-

correlation so, the interaction between the harmonics is

insignificant. Therefore rik = 0 for all i �= k. Equation

25 rik becomes

rik =N∑

t=−N

w2(t) (27)

393

Page 5: [IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

Equation 24 can be written as

x =P

R(28)

Putting the values of x, P and R in the equation 28

Ak =

N∑t=−N

w2(t)s(t)e−j2πkf0t

N∑t=−N

w2(t)

(29)

The above estimation is equivalent to the amplitude

estimated by the peak picking algorithm, which is based

on FFT. One important point to note is that complete

estimation is done in the time-domain.

Since maximum voiced frequency is used to discrim-

inate harmonic and noise part. For each frame, the spec-

tral density function of the original signal is modeled by

a pth order all-pole filter, H(t, τ), by use of a standard

correlation based method [11]. The noise part n(t) is

obtained by filtering a White Gaussian noise u(t) by an

all-pole filter h(t, τ). The filter h(t, τ) is evaluated at

each time instants.

n(t) = [u(t).h(t, τ)] (30)

The linear prediction (LP) envelope of the noise part

is determined. Instead of the complete noise part only

the envelope values are used to regenerate the noise part

at synthesis. For a signal at 16kHz the order of the filter

is set to 12.

4. Speech synthesis

Synthesis is performed in a pitch-synchronous way

as in analysis. Synthesis time instants, tis (synthesis

pitch marks) coincides with the analysis time instants, tia(analysis pitch marks) when no prosodic modification is

done. Since a set of amplitudes, phases and fundamental

frequency are estimated for each voiced frame, to gen-

erate the harmonic part of the ith frame using:

h(t) =

L(tis)∑k=0

ak(tis) cos (φk(t

is) + k2πf0(t

is)t) (31)

where t = 0 , 1 , ... , Nand N is the length of the synthesis frame which

is the integer closest to local pitch period at the time-

instant (tis).

Synthesis of the noise part is done by filtering White

Gaussian noise through an all-pole filter. Difference be-

tween the White Gaussian noise and random numbers in

simple terms is; White Gaussian noise is 0.5 less than

the random numbers. White Gaussian noise exhibits in

both positive and negative domain of the number sys-

tem. The filter described in this work is an all-pole filter

i.e. without zeros. In general, a rational transfer function

has the form:

T (x) =Z(x)

P (x)(32)

where Z(x) are the zeros of the system

and P (x) are the poles of the system

An all-pole filter has s frequency response func-

tion that goes infinite (poles) at specific frequencies, but

there are no frequencies where the response function is

zero. Basically the filter function (also called the trans-

fer function) is a ratio with a constant in the numerator

and a polynomial in the denominator.

Therefore for voiced or unvoiced frame, noise part is

generated by

n(t) = [u(t) . h(t, τ)] (33)

where u(t) is the White Gaussian noise of same

length of voiced/unvoiced frame

and h(t, τ) is an all-pole filter.

The final synthetic speech signal s(t) is summation of

the harmonic and the noise part. Summing up equation

31 and 33:

s(t) = h(t) + n(t) (34)

5. Summary

The harmonic plus noise model decomposes a

speech signal into its harmonic and random compo-

nents. This decomposition provides a framework in

which these signal components can be processed inde-

pendently, allowing us to utilize the properties inherent

in each one. In this paper we have used this model to

yield more natural synthesis. We synthesize 20 different

sentences and take the opinion of 15 persons around the

organization. The MOS (Mean Opinion Score) comes

around 95% for intelligibility and 85% for naturalness.

As a result, the synthetic speech produced by the pro-

posed method gives more intelligibility and naturalness

as compared to the other model based methods.

394

Page 6: [IEEE 2010 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2010.11.23-2010.11.25)] 2010 International Conference on Audio, Language and

References

[1] Y Stylianou, ”Harmonic plus Noise Model for speechcombined with Statistical Methods for Speech and SpeakerModification,” PhD Thesis, 1996.[2] L. R. Rabiner, ”Applications of Voice Processing toTelecommunications,” Proc. IEEE, 82(2): 199-228, February1994.[3] D. W. Griffin and J. S. Lim, ”Multiband-excitationvocoder,” IEEE Transactions Acoustic, Speech, Signal Pro-cessing, ASSP-36(2): 236-243, February 1988.[4] B. Yegnanarayana, C. d’Alessandro and V. Darsinos,”An iterative algorithm for decomposition of speech signalsinto Periodic and Aperiodic components,” IEEE Transactionson Speech and Audio Processing, vol. 6, no. 1, pp. 1-11, Jan-uary 1998.[5] P. K. Lehana, P. C. Pandey and R. Gupta, ”Use of har-monic plus noise model for reduction of self leakage in Elec-troalaryngeal speech,” International Conference on Systemics,Cybernetics and Informatics, February 2004.[6] Y. Stylianou, ”Modeling speech based on harmonic plusnoise model,” LNCS, Vol. 3445/2005, pp. 244-260, July 2005.[7] Y. Stylianou, ”A pitch and maximum voiced frequencyestimation technique adapted to harmonic models of speech,”IEEE Nordic Signal Processing Symposium 1996.[8] E. K. Kim, W. J. Han and Y. H. Oh, ”A new band split-ting method for two-band speech model,” IEEE Signal Pro-cessing Letters, Vol. 8, No. 12, December 2001.[9] M. T. Nagy, G. Rozinaj and A. Palenik, ”A hybrid Pitch

Period estimation method based on HNM model,” 49th Inter-national Symposium ELMAR, September 2007.[10] H. Gu and Y. Zhou, ”Mandarin syllable signal synthe-sis using an HNM based scheme,” ICALIP, IEEE 978-1-4244-1724-7, July 2008.[11] S. M. Kay, ”Mordern Spectral Estimation,” PrenticeHall, Englewood Cliffs, New Jersey, 1988.

395