Upload
vibhu
View
214
Download
1
Embed Size (px)
Citation preview
Harmonic Plus Noise Model Based Speech Synthesis for Hindi
Sourav NandyPrologix Software Solutions Pvt. Ltd,
Lucknow (UP) India [email protected]
Vibhu AgrawalPrologix Software Solutions Pvt. Ltd,
Lucknow (UP) India 226016,[email protected]
Abstract
In recent years speech synthesis has comes out withgreat prominence. There are two approaches to generatesynthetic speech: waveform based and parameter based.Waveform based approach uses pre-recorded sentencesof speech and plays a part of these sentences in a pre-scribed sequence for generating the desired speech out-put. Harmonic plus noise model (HNM) is a variantof parameter based approach. In parameter based ap-proach synthetic speech is generated using parameters,there is no need of the recorded wav or raw files. Har-monic plus noise model divides the spectrum of thespeech into two sub-bands, one is modeled with har-monics of the fundamental frequency and the other issynthesized using random noise. Maximum voiced fre-quency is used to discriminate between harmonics andnoise part. harmonics and noise are also known as pe-riodic and non-periodic parts.
1. Introduction
In HNM, the speech signal to be composed of two
parts: harmonic, and noise. The harmonic part accounts
for the periodic components of the speech signal while
the noise part is responsible for non-periodic compo-
nents [1]. In the lower band, the signal is represented
only by harmonically related sine waves with slowly
varying amplitudes and frequencies.
The goal of speech synthesis is to enable a ma-
chine to transmit orally information to a user [2]. Fig-
ure 1 shows the waveform of speech segment uttered
by a female speaker (sampling frequency is 16000Hz).
Whereas figure 2 and 3 shows the same speech segments
below and above 4000 Hz frequency respectively.
A parametric pitch-synchronous model, based on a
harmonic plus noise representation for the speech sig-
Figure 1. Waveform of speech segment
Figure 2. Waveform of speech segment be-low 4000 Hz
Figure 3. Waveform of speech segmentabove 4000 Hz
978-1-4244-5858-5/10/$26.00 ©2010 IEEE 390 ICALIP2010
nal. This twofold representation allows us to apply dif-
ferent modification methods to each part (harmonic and
noise), yielding more natural synthesis [1]. This has at-
tracted a lot of research efforts in recent years, stimu-
lated by the pioneering works by Griffin and Lim [3].
The proposed HNM assumes the speech signal to be
composed of a harmonic part h(t) and a noise part n(t).
For a voiced speech, the signal is divided into two bands
delimited by the so called maximum voiced frequency
Fm(t) a time varying parameter. The lower band of
the spectrum (below the maximum voiced frequency)
is represented by the harmonic part (low pass signal),
while the upper band by the noise part (high pass sig-
nal). Thus, the harmonic part accounts for the periodic
(voiced) structure of the speech signal, which is sum of
harmonically related sinusoidal component with contin-
uously time varying amplitudes and phases.
h(t) =
K(t)∑k=1
ak(t)cosφk(t) (1)
where, ak(t) corresponds to amplitudes
and φk(t) are phases
at time t of the kth harmonic.
The noise part n(t) is obtained by filtering a White
Gaussian noise u(t) by an all-pole filter h(t, τ). The filter
h(t, τ) is evaluated at each time instants.
n(t) = [u(t).h(t, τ)] (2)
The final synthetic speech signal s(t) is supposed to
be the superposition of the harmonic and the noise part.
s(t) = h(t) + n(t) (3)
2. System architecture
As reverent to objective; the complete system archi-
tecture diagram for the proposed system shown below:
Speech signals are given to the initial pitch estima-
tion module. By use of auto-correlation method pitch
values are determined. These initial pitch values are
given as input to the voiced/unvoiced decision making
module. In this module, decision are taken on the ba-
sis of a threshold, whether the frame is voiced or un-
voiced [4]. Maximum voiced frequency (MVF) [6] [7]
is estimated only on the voiced frames. On the basis of
MVF, voiced frames are divided into harmonic and noise
parts. Amplitudes, phases and envelope of harmonic and
noise parts are estimated in parameter extraction mod-
ule.
Figure 4. System architecture diagram ofanalysis
Figure 5. System architecture diagram ofsynthesis
In synthesis phase as in figure 5, the parameters ex-
tracted in analysis phase from figure 4 i.e. amplitudes,
phases and envelope are used to generate the harmonic
and noise part and it is given to the synthesis block,
which generates the output speech signal.
2..1 Initial pitch estimation
Initial pitch is determined by auto-correlation
method. It is based on the criterion that how close the
synthesized speech to the original speech [5].
E =
∫ 12
− 12
[| Sw(f) | − | Sw(f) |]2 df
∫ 12
− 12
| Sw(f) |2 df
(4)
where Sw(f) is the Fourier transform of a framed
speech segment of the speech signal s(t) and Sw(f) is
the Fourier transform of the synthetic speech generated
by the fundamental frequency f0.
Griffin and Lim [3] proposed to multiply the error
criterion in the above equation 4 by a pitch period de-
pendent correction factor.
391
E =
∫ 12
− 12
[| Sw(f) | − | Sw(f) |]2 df
∫ 12
− 12
| Sw(f) |2 df [1− P.
∞∑t=−∞
w4(t)]
(5)
By replacing integrals of continuous functions by
summation of samples of these functions, Griffin and
Lim [3] proposed an efficient method for computing the
above equation 5 and it is the error function which is
used for the initial pitch estimation:
E =
∞∑t=−∞
s2(t)w2(t)− P∞∑
t=−∞r(l.P )
[∞∑
t=−∞s2(t)w2(t)].[1− P.
∞∑t=−∞
w4(t)]
(6)
where l is the number of harmonics,
s(t) is the speech signal
and w(t) is the analysis window which have a con-
straint
∞∑t=−∞
w2(t) = 1 (7)
The function r(k) is defined as
r(k) =∞∑
t=−∞s(t)w2(t)s(t+ k)w2(t+ k) (8)
Equation 8 is evaluated for pitch periods in the set
[ fsf0max
, ..... , fsf0min
], where f0min and f0max are the
minimum and maximum fundamental frequencies. Typ-
ical values of minimum and maximum fundamental fre-
quencies are 60 to 230 Hz for male voice and 180 to 400Hz for female voice [1].
2..2 Voiced/Unvoiced decision
This phase is for deciding whether the given seg-
ment of speech waveform should be classified as voiced
speech or unvoiced speech. A variety of approaches
have been described in the literature for making this de-
cision. The following are some measurements which are
used for making voiced/unvoiced decision:
• Energy of the signal
• Zero crossing rate of the signal
• Auto-correlation
• Linear Prediction Coefficients
In this work auto-correlation method is used for
making this decision. Using the initial pitch estimation,
a synthetic signal is generated s(t) as the sum of har-
monically related sinusoids with amplitudes and phases
estimated by the fast Fourier transform (FFT). Evaluate
FFT of the original signal and then synthetic signal spec-
trum is generated by calculating magnitude of the FFT
values for amplitudes and angle of the FFT values for
phases, with the help of initial pitch values estimated.
| S(f) | is denoted as synthetic spectrum and | S(f) | as
original spectrum, the voiced/unvoiced decision is made
by comparing the error over the first four harmonics of
the estimated fundamental frequency to a given thresh-
old (5 dB in this case)
E =
∫ 4.3f0
0.7f0
(| S(f) | − | S(f) |)2 df
∫ 4.3f0
0.7f0
| S(f) |2 df
(9)
where f0 is the initial fundamental frequencies
Fundamental Frequency =Sampling Frequency
P itch Period(10)
Voiced/unvoiced decision is made by comparing the
values of E with threshold. If E is less than threshold
then it is voiced, else unvoiced.
2..3 MVF estimation
Maximum voiced frequency is the frequency which
delimits harmonic and noise part. Signal below the max-
imum voiced frequency is considered as the harmonic
part and the signal above maximum voiced frequency is
noise. Maximum voiced frequency, Fm, is to search for
the largest sine-wave amplitude (peak) in the frequency
range [(2n−1)f0
2 , (2n+1)f02 ] over complete frame.
The following are the conditions applied determin-
ing MVF:
ifAmc(fc)
Amc(fi)> 2 (11)
or
Am(fc)−maxAm(fi) > 10dB (12)
then if
392
| fc − Lf0 |Lf0
< 15% (13)
where fc is the frequency at which the frequency
range has maximum amplitudes, fi are the frequencies
of the peaks. Amc(fc) is the summation of the ampli-
tude values from previous valley to the successive valley
of fc, Amc(fi) is the mean of the complete frequency
range.
Am(fc) is the amplitude value at fc, Am(fi) is the
amplitude values at different peaks except fc. L is the
number of harmonics in the given frequency range
If the above conditions are satisfied then fc is the
MVF of the frequency range. Maximum value of the fcof all the frequency ranges of the frame is known as the
mvf of the frame and is denoted by Fm.
3. Parameters estimation
There are two types of parameters to be estimated;
harmonic parameters and noise parameters. For har-
monic parameters, FFT bins are used i.e. the peak val-
ues of the FFT spectrum with respect to the local pitch
period. Whereas for noise parameters; envelope of the
noise part is estimated.
Harmonic parameters is again divided into two parts
: phase and amplitude.
h(t) =L∑
k=−L
ak(t) cos φk(t) (14)
where L is the number of harmonics, ak and φk are
the amplitudes and phases respectively.
φk(t) = φk(t) + k2πf0(t) (15)
Harmonic part can also be written as a sum of expo-
nential functions. Therefore equation 14 can be written
as:
h(t) =L∑
k=−L
Ak(t)ej2πkf0(t−tia) (16)
where tia is a small time instant.
Converting equation 16 into matrix form
h = xB (17)
where B is a (2N + 1)× (2L+ 1) matrix
B = [b−L
... b−L+1
... b−L+2
... . . . . . .... bL] (18)
bk = [ej2πkf0(t−N) ej2πkf0(t−N−1) . . . . . . ej2πkf0(t+N)]T
(19)
where T is for transpose.
x is a (2L+1)× 1 vector which contains the ampli-
tude values
x = [A−L A−L+1 A−L+2 . . . . . . AL]T (20)
The solution to least-squares problem is then given
by the normal equation
(BTWTWB)x = BTWTWs (21)
where W is a (2N + 1)× (2N + 1) diagonal matrix
with diagonal elements as weight vector
WT = [w(−N) w(−N+1) w(−N+2) . . . . . . w(N)](22)
s is a (2N+1)×1 vector which contains the original
data
sT = [s(−N) s(−N + 1) s(−N + 2) . . . . . . s(N)](23)
Equation 21 can be rewritten as
Rx = P (24)
where R = BTWTWB and P = BTWTWsR is a (2L+ 1)× (2L+ 1) matrix
Element rik is given by
rik =N∑
t=−N
w2(t)ej2π(i−L−1)f0t−j2π(k−L−1)f0t (25)
where i = 1....(2L+ 1) and k = 1....(2L+ 1)P is a (2L+ 1)× 1 vector
Element pk is given by
pk =N∑
t=−N
w2(t)s(t)e−j2π(k−L−1)f0t (26)
Since the complete work is on the basis of auto-
correlation so, the interaction between the harmonics is
insignificant. Therefore rik = 0 for all i �= k. Equation
25 rik becomes
rik =N∑
t=−N
w2(t) (27)
393
Equation 24 can be written as
x =P
R(28)
Putting the values of x, P and R in the equation 28
Ak =
N∑t=−N
w2(t)s(t)e−j2πkf0t
N∑t=−N
w2(t)
(29)
The above estimation is equivalent to the amplitude
estimated by the peak picking algorithm, which is based
on FFT. One important point to note is that complete
estimation is done in the time-domain.
Since maximum voiced frequency is used to discrim-
inate harmonic and noise part. For each frame, the spec-
tral density function of the original signal is modeled by
a pth order all-pole filter, H(t, τ), by use of a standard
correlation based method [11]. The noise part n(t) is
obtained by filtering a White Gaussian noise u(t) by an
all-pole filter h(t, τ). The filter h(t, τ) is evaluated at
each time instants.
n(t) = [u(t).h(t, τ)] (30)
The linear prediction (LP) envelope of the noise part
is determined. Instead of the complete noise part only
the envelope values are used to regenerate the noise part
at synthesis. For a signal at 16kHz the order of the filter
is set to 12.
4. Speech synthesis
Synthesis is performed in a pitch-synchronous way
as in analysis. Synthesis time instants, tis (synthesis
pitch marks) coincides with the analysis time instants, tia(analysis pitch marks) when no prosodic modification is
done. Since a set of amplitudes, phases and fundamental
frequency are estimated for each voiced frame, to gen-
erate the harmonic part of the ith frame using:
h(t) =
L(tis)∑k=0
ak(tis) cos (φk(t
is) + k2πf0(t
is)t) (31)
where t = 0 , 1 , ... , Nand N is the length of the synthesis frame which
is the integer closest to local pitch period at the time-
instant (tis).
Synthesis of the noise part is done by filtering White
Gaussian noise through an all-pole filter. Difference be-
tween the White Gaussian noise and random numbers in
simple terms is; White Gaussian noise is 0.5 less than
the random numbers. White Gaussian noise exhibits in
both positive and negative domain of the number sys-
tem. The filter described in this work is an all-pole filter
i.e. without zeros. In general, a rational transfer function
has the form:
T (x) =Z(x)
P (x)(32)
where Z(x) are the zeros of the system
and P (x) are the poles of the system
An all-pole filter has s frequency response func-
tion that goes infinite (poles) at specific frequencies, but
there are no frequencies where the response function is
zero. Basically the filter function (also called the trans-
fer function) is a ratio with a constant in the numerator
and a polynomial in the denominator.
Therefore for voiced or unvoiced frame, noise part is
generated by
n(t) = [u(t) . h(t, τ)] (33)
where u(t) is the White Gaussian noise of same
length of voiced/unvoiced frame
and h(t, τ) is an all-pole filter.
The final synthetic speech signal s(t) is summation of
the harmonic and the noise part. Summing up equation
31 and 33:
s(t) = h(t) + n(t) (34)
5. Summary
The harmonic plus noise model decomposes a
speech signal into its harmonic and random compo-
nents. This decomposition provides a framework in
which these signal components can be processed inde-
pendently, allowing us to utilize the properties inherent
in each one. In this paper we have used this model to
yield more natural synthesis. We synthesize 20 different
sentences and take the opinion of 15 persons around the
organization. The MOS (Mean Opinion Score) comes
around 95% for intelligibility and 85% for naturalness.
As a result, the synthetic speech produced by the pro-
posed method gives more intelligibility and naturalness
as compared to the other model based methods.
394
References
[1] Y Stylianou, ”Harmonic plus Noise Model for speechcombined with Statistical Methods for Speech and SpeakerModification,” PhD Thesis, 1996.[2] L. R. Rabiner, ”Applications of Voice Processing toTelecommunications,” Proc. IEEE, 82(2): 199-228, February1994.[3] D. W. Griffin and J. S. Lim, ”Multiband-excitationvocoder,” IEEE Transactions Acoustic, Speech, Signal Pro-cessing, ASSP-36(2): 236-243, February 1988.[4] B. Yegnanarayana, C. d’Alessandro and V. Darsinos,”An iterative algorithm for decomposition of speech signalsinto Periodic and Aperiodic components,” IEEE Transactionson Speech and Audio Processing, vol. 6, no. 1, pp. 1-11, Jan-uary 1998.[5] P. K. Lehana, P. C. Pandey and R. Gupta, ”Use of har-monic plus noise model for reduction of self leakage in Elec-troalaryngeal speech,” International Conference on Systemics,Cybernetics and Informatics, February 2004.[6] Y. Stylianou, ”Modeling speech based on harmonic plusnoise model,” LNCS, Vol. 3445/2005, pp. 244-260, July 2005.[7] Y. Stylianou, ”A pitch and maximum voiced frequencyestimation technique adapted to harmonic models of speech,”IEEE Nordic Signal Processing Symposium 1996.[8] E. K. Kim, W. J. Han and Y. H. Oh, ”A new band split-ting method for two-band speech model,” IEEE Signal Pro-cessing Letters, Vol. 8, No. 12, December 2001.[9] M. T. Nagy, G. Rozinaj and A. Palenik, ”A hybrid Pitch
Period estimation method based on HNM model,” 49th Inter-national Symposium ELMAR, September 2007.[10] H. Gu and Y. Zhou, ”Mandarin syllable signal synthe-sis using an HNM based scheme,” ICALIP, IEEE 978-1-4244-1724-7, July 2008.[11] S. M. Kay, ”Mordern Spectral Estimation,” PrenticeHall, Englewood Cliffs, New Jersey, 1988.
395