11
Speech recovery based on the linear canonical transform q Wei Qiu, Bing-Zhao Li , Xue-Wen Li Mathematics Department of Beijing Institute of Technology, Beijing 100081, China Received 16 May 2011; received in revised form 1 June 2012; accepted 4 June 2012 Available online 9 July 2012 Abstract As is well known, speech signal processing is one of the hottest signal processing directions. There are exist lots of speech signal mod- els, such as speech sinusoidal model, straight speech model, AM–FM model, gaussian mixture model and so on. This paper investigates AM–FM speech model by the linear canonical transform (LCT). The LCT can be considered as a generalization of traditional Fourier transform and fractional Fourier transform, and proved to be one of the powerful tools for non-stationary signal processing. This has opened up the possibility of a new range of potentially promising and useful applications based on the LCT. Firstly, two novel recovery methods of speech based on the AM–FM model are presented in this paper: one depends on the LCT domain filtering; the other one is based on the chirp signal parameter estimation to restore the speech signal in LCT domain. Then, experiments results are presented to verify the performance of the proposed methods. Finally, the summarization and the conclusion of the paper is given. Ó 2012 Published by Elsevier B.V. Keywords: Linear canonical transform; Chirp signal; The AM–FM model; Speech; Signal reconstruction 1. Introduction As is known, a speech signal is a kind of typical non- stationary signal whose frequency components vary with time, which is consisting of vowels, consonants and tran- sient parts. It is very important to detect its spectral com- ponents in many speech processing applications. The traditional tool for this estimation is spectral analysis by means of Fourier transform, however, this is a tool for sta- tionary signals processing only and it simply indicates the global frequency components, but does not tell us when they occurred. This defection has seriously hindered the development of the speech signal processing. Focus on the speech signal processing, many time–frequency analysis tools have been proposed, for example, the short-time Fourier transform (STFT), the wavelet transform (WT), the fractional Fourier transform (FRFT) (Yin et al., 2008). The speech signal, as a special kind of signal by some harmonic structures consisted, based on the mechanism of the production of speech, people has proposed some speech models, such as speech sinusoidal model, straight speech model, gaussian mixture model and so on (Yin et al., 2008; Teager and Teager, 1989). Besides, the speech also can be seen as a special kind of multicomponent chirp signals (Bovik et al., 1993). A chirp is an important signal in which the frequency increases (‘up-chirp’)or decreases (‘down-chirp’) with time. Chirp signal is a typical non-sta- tionary signal, which widely appears in systems such as communication, radar, sonar, biomedicine and earthquake exploring system and also be used as a signal model for a lot of natural phenomena. Therefore, it is very important for chirp signal processing. The detection, the parameters estimation and the time–frequency representation for the multicomponent chirps are still an important research topic. In 1980s Teager and Teager (1989) discoveres by experiments that vortices could be the secondary source to excite the channel and produce the speech signal. There- fore, speech should be composed of the plane-wave-based linear part and a vortices-based non-linear part. According 0167-6393/$ - see front matter Ó 2012 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.specom.2012.06.002 q This work was supported by the National Natural Science Foundation of China (Nos. 60901058 and 61171195), and also supported partially by Beijing Natural Science Foundation (No. 1102029). Corresponding author. Tel.: +86 10 68912131 16. E-mail address: [email protected] (B.-Z. Li). www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication 55 (2013) 40–50

Speech recovery based on the linear canonical transform

  • Upload
    wei-qiu

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 55 (2013) 40–50

Speech recovery based on the linear canonical transform q

Wei Qiu, Bing-Zhao Li ⇑, Xue-Wen Li

Mathematics Department of Beijing Institute of Technology, Beijing 100081, China

Received 16 May 2011; received in revised form 1 June 2012; accepted 4 June 2012Available online 9 July 2012

Abstract

As is well known, speech signal processing is one of the hottest signal processing directions. There are exist lots of speech signal mod-els, such as speech sinusoidal model, straight speech model, AM–FM model, gaussian mixture model and so on. This paper investigatesAM–FM speech model by the linear canonical transform (LCT). The LCT can be considered as a generalization of traditional Fouriertransform and fractional Fourier transform, and proved to be one of the powerful tools for non-stationary signal processing. This hasopened up the possibility of a new range of potentially promising and useful applications based on the LCT. Firstly, two novel recoverymethods of speech based on the AM–FM model are presented in this paper: one depends on the LCT domain filtering; the other one isbased on the chirp signal parameter estimation to restore the speech signal in LCT domain. Then, experiments results are presented toverify the performance of the proposed methods. Finally, the summarization and the conclusion of the paper is given.� 2012 Published by Elsevier B.V.

Keywords: Linear canonical transform; Chirp signal; The AM–FM model; Speech; Signal reconstruction

1. Introduction

As is known, a speech signal is a kind of typical non-stationary signal whose frequency components vary withtime, which is consisting of vowels, consonants and tran-sient parts. It is very important to detect its spectral com-ponents in many speech processing applications. Thetraditional tool for this estimation is spectral analysis bymeans of Fourier transform, however, this is a tool for sta-tionary signals processing only and it simply indicates theglobal frequency components, but does not tell us whenthey occurred. This defection has seriously hindered thedevelopment of the speech signal processing. Focus onthe speech signal processing, many time–frequency analysistools have been proposed, for example, the short-timeFourier transform (STFT), the wavelet transform (WT),the fractional Fourier transform (FRFT) (Yin et al., 2008).

0167-6393/$ - see front matter � 2012 Published by Elsevier B.V.

http://dx.doi.org/10.1016/j.specom.2012.06.002

q This work was supported by the National Natural Science Foundationof China (Nos. 60901058 and 61171195), and also supported partially byBeijing Natural Science Foundation (No. 1102029).⇑ Corresponding author. Tel.: +86 10 68912131 16.

E-mail address: [email protected] (B.-Z. Li).

The speech signal, as a special kind of signal by someharmonic structures consisted, based on the mechanismof the production of speech, people has proposed somespeech models, such as speech sinusoidal model, straightspeech model, gaussian mixture model and so on (Yinet al., 2008; Teager and Teager, 1989). Besides, the speechalso can be seen as a special kind of multicomponent chirpsignals (Bovik et al., 1993). A chirp is an important signalin which the frequency increases (‘up-chirp’)or decreases(‘down-chirp’) with time. Chirp signal is a typical non-sta-tionary signal, which widely appears in systems such ascommunication, radar, sonar, biomedicine and earthquakeexploring system and also be used as a signal model for alot of natural phenomena. Therefore, it is very importantfor chirp signal processing. The detection, the parametersestimation and the time–frequency representation for themulticomponent chirps are still an important researchtopic. In 1980s Teager and Teager (1989) discoveres byexperiments that vortices could be the secondary sourceto excite the channel and produce the speech signal. There-fore, speech should be composed of the plane-wave-basedlinear part and a vortices-based non-linear part. According

W. Qiu et al. / Speech Communication 55 (2013) 40–50 41

to such theory, Dimitriadis and Maragos (2005), Boviket al. (1993), Santhanam and Maragos (2000), and Huang(1998) proposed an AM–FM modulation model for speechanalysis, synthesis and coding. During the last thirty years,the four milestones among the best practices methods asso-ciated with the speech detection and recovery are MESA(Bovik et al., 1993), PASED (Santhanam and Maragos,2000), Hilbert Huang Transform (Huang, 1998) and Gian-felici Transform (Gianfelici et al., 2007).

The paper Bovik et al. (1993) develops a multiband orwavelet approach for capturing the AM–FM componentsof modulated signals immersed in noise. It is demonstratedthat the performance of the energy operator/ESAapproach is vastly improved if the signal is first filteredthrough a bank of bandpass filters, and at each instant ana-lyzed using the dominant local channel response. In thepaper Santhanam and Maragos (2000), authors present anonlinear algorithm for the separation and demodulationof discrete-time multicomponent AM–FM signals, itavoids the shortcomings of previous approaches and workswell for extremely small spectral separations of the compo-nents and for a wide range of relative amplitude/powerratios. The paper Gianfelici et al. (2007) developes anothernew method for analyzing nonlinear and non-stationarysignal processes. This decomposition method is adaptive,and, therefore, highly efficient. The proposed approachpresented in the paper Gianfelici et al. (2007) is on the basisof a rigorous mathematical formulation, and the validity ofthis approach has been proven by some applications toboth synthetic signals and natural speech.

In this paper, we introduce two novel methods to restorethe speech signal with background noise based the AM–FM model associated with the linear canonical transform(LCT). The LCT is also known ABCD transform, general-ized Fresnel transform (Aizenberg and Astola, 2006), Col-lins formula (Fan and Lu, 2006), generalized Huygensintegral (Kostenbauder, 1990). It was proposed early inthe seventies by Moshinsky and Quesne (2006) and Collins(1970) the special case with complex parameters was pro-posed by Bargmann (1961). LCT can be considered as ageneralization of traditional Fourier transform and frac-tional Fourier transform. As same as the fractional Fouriertransform, the LCT is used for analyzing Optical Systemsand solving Differential Equations in the first. With1990s’ development of fractional Fourier transform, LCThas began to be taken seriously in the field of signal pro-cessing community. In such circumstances, we can analysisthe signal in LCT domain, because from Pei and Ding(2002) and Sharma and Joshi (2008) we know that theone-dimensional LCT is linear integral transform of athree-parameter class, LCT can be more general and flexi-ble than the Fourier transform as well as the fractionalFourier transform in some properties (Sharma and Joshi,2008), and it can solve problems that cannot be dealt withwell by the latter. It is shown in Tao et al. (2009), Li et al.(2012), Sharma and Joshi (2008), and Maragos etal. (1993)that the LCT is one of the powerful non-staionary signal

processing tools, especially suitable for time-varying chirpsignal processing. So it is most hopeful by use of LCTfor this speech signal processing based AM–FM model.So far, LCT has many applications in signal processingcommunity. For example, it has been applied to filterdesign, communication signals against multi-path and soon (Li et al., 2012; Sharma and Joshi, 2006; James andAgarwal, 1996).

The rest of the paper is organized as follows. In Section2, the definition and some basic properties of LCT arebriefly introduced. In Section 3, single chirp signal modelis produced, then the AM–FM model of speech isdescribed. In Section 4, based the AM–FM model we intro-duce two novel methods to restore the speech signalaccording to LCT. Section 5 presents the experimentalresults and discussion. Some conclusions and futureresearch are given in Section 6.

2. Preliminary

When parameters ða; b; c; dÞ are real numbers, the LCTof a signal f ðtÞ defines as follows (Tao et al., 2009):

F ða;b;c;dÞ ¼

ffiffiffiffiffiffiffi1

j2pb

q R1�1 Ka;b;c;dðu; tÞf ðtÞdt if b–0ffiffiffi

dp

exp jcd2

u2� �

f ðduÞ if b ¼ 0

8<: ð1Þ

and ad � bc ¼ 1, kernel function is defined as

Ka;b;c;d ¼ expjd2b

u2

� �exp � j

but

� �exp

ja2b

t2

� �; ð2Þ

Eq. 1 can be written as F ða;b;c;dÞ ¼ La;b;c;dðf ðtÞÞ ¼ LA½f �ðuÞ,the parameters

A ¼a b

c d

� �:

By the above definition it is proved in paper Moshinskyand Quesne (2006) that the LCT satisfies the additivityproperty of the parameters, that is

La2;b2;c2;d2 ½La1;b1;c1;d1ðf ðtÞÞ� ¼ Le;f ;g;hðf ðtÞÞ; ð3Þ

where

e f

g h

� �¼

a2 b2

c2 d2

� �a1 b1

c1 d1

� �:

Besides, the reversibility property is derived (Tao et al.,2009)

Ld;�b;�c;a½La;b;c;dðf ðtÞÞ� ¼ f ðtÞ: ð4Þ

When these parameters reduces to

ða; b; c; dÞ ¼ ðcos a; sin a;� sin a; cos aÞ;

then LCT becomes into the FRFT by a fixed phase factormultiplied (Moshinsky and Quesne, 2006):

Lcos a;sin a;� sin a;cos aðf ðtÞÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiexpð�jaÞ

pF aðf ðtÞÞ: ð5Þ

42 W. Qiu et al. / Speech Communication 55 (2013) 40–50

When these parameters ða; b; c; dÞ ¼ ð0; 1;�1; 0Þ, LCTbecomes into Fourier transform by fixed

ffiffiffiffiffiffi�jp

multiplied.When these parameters ða; b; c; dÞ ¼ ð1; 0; s; 1Þ, LCTreduces into chirp multiplication operator. Most of the clas-sical concepts and theorems in the Fourier domain are gen-eralized in the LCT domain, for example, the samplingtheorem, the uncertainty principle and the Poisson sum for-mulae (Li and Xu, 2012; Li et al., 2007, 2009; Zhao et al.,2009; Tao et al., 2008; Stern, 2008). It has also found manyadvantages in real applications, such as filter design and sig-nal synthesis, time–frequency analysis, electromagneticwave propagation analysis, pattern recognition, communi-cation signal modulation and multiplexing, encryption,etc. (Tao et al., 2009; Li et al., 2012; Sharma and Joshi,2006; James and Agarwal, 1996; Stern, 2008). It is knownthat LCT is one of the most effective methods for non-sta-tionary signals, speech is a kind of typical non-stationarysignal. However, through my find no relevant paper is pub-lished about speech signal analyzing and processing by useof LCT method. So it is worthwhile and interesting to inves-tigate speech signal processing in LCT domain. Later wewill analysis a particular speech model by using LCT.

3. The AM–FM model of speech

In 1980s Teager and Teager (1989) discoveres by exper-iments that vortices could be the secondary source to excitethe channel and produce the speech signal. Therefore,speech should be composed of the plane-wave-based linearpart and a vortices-based non-linear part. According tosuch theory, Dimitriadis and Maragos (2005) proposes anAM–FM modulation model for speech analysis, synthesisand coding. The AM–FM model represents the speech sig-nal as the sum of formant resonance signals each of whichcontains amplitude and frequency modulation (Maragoset al., 1993). In paper Maragos et al. (1993), we find thatthe AM–FM model of speech has a form of the multicom-ponent chirp signals model.

3.1. Chirp signal model

Assume a single chirp signal model:

f ðtÞ ¼ h exp j1

2a2t2 þ ja1t þ ja0

� �þ wðtÞ; ð6Þ

where h is the signal amplitude parameter, a2 is the fre-quency rate parameter of linear LFM signal, a1 is knownas the angular frequency parameter (also is known as theinitial frequency or center frequency), a0 is called the initialphase parameter, wðnÞ is additive noise (Chew et al., 1994).

We consider that the received signal yðnÞ consists ofsuperimposed chirp signals of M embedded in noise, sothe multi-component chirps signal model can be written as

yðtÞ ¼XM

i¼1

hi exp j1

2kit2 þ jbit þ jci

� �þ w1ðtÞ; ð7Þ

and the meaning of each component parameter is as sameas the single chirp signal in (20).

3.2. AM–FM model of speech

In speech recognition, coding and synthesis, pitch andformants are always the most important parameters. In tra-ditional speech processing methods, these parameters areconsidered as constant with a frame during 10–30 ms(Dimitriadis and Maragos, 2005; Yao and Zhang, 2002).It is known that voiced speech has harmonic structure,sinusoidal modeling ( Smith and Serra, 1992) has beenwidely used to represent the most salient aspects of thissound, and it can be modeled as:

xðtÞ ¼X1k¼1

akðtÞ expðjðkw0t þ akÞÞ; ð8Þ

where akðtÞ is time-varying amplitude signal, w0 is the ini-tial frequency, ak is the initial phase, k is the number ofthe harmonics. A key component of sinusoidal modelingis the estimation of these parameters of multiple sinusoids.

But in fact, the frequency is always changing even withina frame 10–30 ms because of the tonation. Considering thefluctuation of frequency and the harmonic structure,speech can be modeled as an AM–FM signal (Maragoset al., 1993)

xðtÞ ¼X1k¼1

akðtÞ exp jk w0t þZ t

0

qðsÞds

� �þ jak

� �: ð9Þ

Where qðsÞ is the frequency modulation function. A rea-sonable simplification of qðsÞ is qðsÞ ¼ w1t, so we canobtain the AM–FM model (Dimitriadis and Maragos,2005):

xðtÞ ¼X1k¼1

akðtÞ exp jk w0t þ 1

2w1t2

� �þ jak

� �: ð10Þ

Compared the multi-component chirp signal model withAM–FM model, we can conclude that the AM–FM modelalso has a form of the multi-component chirp signal model.In some applications the multi-component chirp signalmodel is considered as a finite cut-off length of the AM–FM model. So we can make another reasonable simplifica-tion that the AM–FM speech model can be described as(Yin et al., 2008):

xðtÞ ¼XM

k¼1

Ak exp jkt w0 þ1

2w1t

� �þ jak

� �: ð11Þ

From (11) we can find the AM–FM speech model hasthe multi-component chirps form. It is known that thespeech signal recovery is an eternal research subject, basedon this speech model, if we can estimate all the parametersof this speech model, also equates to estimate theseparameters of multi-component chirps signal, then thespeech is restored. Besides, this signal is not band-limit,and is non-stationary in Fourier transform domain, so

W. Qiu et al. / Speech Communication 55 (2013) 40–50 43

the traditional parameters estimation methods in the Fou-rier domain are not optimal, the new signal processingtools and methods should be investigated for this kind ofnon-stationary signals. Base on the properties of the LCTand AM–FM speech model, two novel speech signal recov-ery methods in the LCT domain are proposed in the fol-lowing sections.

4. Two novel recovery methods of the speech

4.1. The analysis background of speech signal

Speech can be looked as a non-stationary signal, sospeech signal analysis appear very arduous. The speech sig-nal de-noising and reconstruction is always one of the mosthottest research focuses. Based on the speech sinusoidalmodel we review the speech signal by using the traditionalmethods, and then reconstruct the speech signal.

The sinusoidal model of speech can be expressed as thesum form of a group of sinusoidal wave, we hope to pro-vide input signal sðnÞ for K sinusoidal signal components,the original signal constitutes an approximation fromsum of this K sinusoidal signal components,

sðnÞ � ~s ¼XK

k¼1

dk cosð�wknþ /kÞ; n ¼ 0; 1; 2; . . . ; n� 1

ð12Þ

Where dk is the signal amplitude parameter, �wk is known asthe frequency parameter, /k is called the phase parameter,through such model the original signal is expressed as agroup of slow-moving sine wave. The analysis phase re-quires to estimate all the parameters of the input signalsaccurately, then in the synthetic stage we can use modelparameters extracted from the analysis phase to recon-struct the original speech signal. Spectral peak pickingcan estimate these parameters based the STFT. If thespeech signal is strict periodic, then

sðnÞ ¼XK

k¼1

dk cosðk�w0nþ /kÞ; n ¼ 0; 1; 2; . . . ; n� 1 ð13Þ

if the STFT of sðnÞ is

Sð�wÞ ¼XN=2

n¼�N=2

sðnÞ expð�jn�wÞ; ð14Þ

from the Fourier analysis the sinusoid parametersak ¼ jSðk�w0Þj, �wk ¼ k�w0, /k ¼ argðSðk�w0ÞÞ. Through theseparameters the original voice signal can be reconstructed.

But in fact, the frequency of speech is always changingeven within a frame 10–30 ms because of the non-stationa-rity. Considering the fluctuation of frequency and the har-monic structure, speech can be modeled as an AM–FMsignal from Section 3, the AM–FM speech model can bedescribed as (18). If all the parameters of Eq. (18) can beestimated, the original voice signal will obtain the recon-

struction. Below we describe how to estimate theseparameters.

Maximum likelihood estimation (MLE) method ofchirps signal can be expressed as ( Friedlander, 1995;Abatzoglou, 1986)

Lða1; a2Þ ¼Z T

2

�T2

f ðtÞ exp �j a1t þ 1

2a2t2

� �� �dt

����������2

;

� T26 t 6

T2

ð15Þ

When L take the maximum, corresponding coordinatesða1; a2Þ are respectively for a1 and a2 estimate. a0, A are esti-mated as

a0 ¼ arg

Z T2

�T2

f ðtÞ exp �j a1t þ a2t2� �� �

dt

( ); ð16Þ

bA ¼ 1

T

Z T2

�T2

f ðtÞ exp �j a0 þ a1t þ a2t2� �� �

dt

( ); ð17Þ

In this case, the single chirp signal parameters can beestimated, through the design of filter can also estimatethe signal parameters of multi component chirps. Laterappears the Polynomial Phase Transform (PPT), Chirp-Fourier Transform (CFT), Quadratic Phase Transform,Fan-Chirp Transform (FCT), and so on. They are bothconstantly developed tools for chirp signal analyzing andprocessing. These methods have their similarity.

Following the Radon-Wigner transform (RWT),Wigner-Hough transform (WHT), Radon-ambiguity trans-form (RAT), a later, another method is chirps signal detec-tion and parameters estimation by using the FRFT (Qi andTao, 2003). FRFT is a special form of the LCT. LFM sig-nals exist maximum peak points in two-dimension plane ofa and u, maximum point corresponding coordinates a, ucan be obtained by two dimensions peak search. If weassume X aðuÞ as the FRFT of The original chirp signalf ðtÞ � T

26 t 6 T

2

� �. They can be described as

fa; ug ¼ arg jmax X aðuÞj2; ð18Þ

a2 ¼ � cot a; a1 ¼ u csc a;

a0 ¼ argX aðuÞffiffiffiffiffiffiffiffiffiffiffiffi

1�j cot a2p

qexpðjpu2 cot aÞ

; bA ¼ jX aðuÞjT 1�j cot a

2p

�� �� : ð19Þ

So, it shows that we can reconstruct the original speechsignal by using the estimated parameters. We put forwardnew methods to analyze this kind of speech signal. Weintroduce two methods which have more advantages com-pared other methods. Firstly, LCT is a relatively new trans-form, which used in signal processing has obviousperformance, the speech signal is typically non-stationarysignal, the speech signal may not band-limited when inthe traditional Fourier-domain, but in the LCT domain itwill appear with the good properties of band-limited,besides, Gauss noise in LCT domain shows no focusingnature, while the chirps signal have focusing nature, in

44 W. Qiu et al. / Speech Communication 55 (2013) 40–50

addition, LCT has four parameters, which control the LCTdomain properties of the speech signal, such signal proper-ties will be more fully reflected in LCT domain. So speechsignal processing is the most appropriate based on AM–FM model in LCT domain.

4.2. The LCT analysis of speech (M ¼ 1)

By the definition of the LCT we know it has three freeparameters, these parameters will help us to investigatethe performance of the AM–FM speech in the LCTdomain. From the definition of LCT as well as (11) equa-tion, the LCT of this speech (M ¼ 1) can be rewritten as

La;b;c;dðf ðtÞÞ ¼ffiffiffiffiffiffiffiffiffiffi

1

j2pb

sA exp j a0 þ

d2b

u2

� �� ��Z 1

�1exp

�jb

ut� �

expja2b

t� �

� exp j1

2a2t2 þ ja1t

� �dt

¼ffiffiffiffiffiffiffiffiffiffi

1

j2pb

sA exp j a0 þ

d2b

u2

� �� ��Z 1

�1exp j

1

2

abþ a2

� t2

� �� exp jt a1 �

ub

� � dt; ð20Þ

if we set

Za;b;uðtÞ ¼ exp j1

2

abþ a2

� t2

� �exp jt a1 �

ub

� � ; ð21Þ

then

La;b;c;dðf ðtÞÞ ¼

ffiffiffiffiffiffiffiffiffiffi1

j2pb

sA exp j a0 þ

d2b

u2

� �� �Z 1

�1Za;b;uðtÞdt:

ð22ÞBecause of the integral natureZ T

�TZa;b;uðtÞdt 6

Z T

�TjZa;b;uðtÞjdt;

where T is sufficiently large, The largest peak just occurs ifand only if a

bþ a2 ¼ 0 and a1 � ub ¼ 0.

Because now La;b;c;dðf ðtÞÞ includes four parametersa; b; d; u, if we want to conduct two-dimensional peaksearch, two parameter must be fixed. Firstly, by differentialwe can identify determine range of the value b; d, thenrespectively set the b; d value obtained, conduct respectivelytwo-dimensional peak search on ða; uÞ point, we can get dif-ferent peaks for different b; d values, then compare the dif-ferent peaks, and select the largest peak. At this time fourparameters of the corresponding the peak recorded valuesare ð~a; ~b; ~d; ~uÞ, the estimated values of the a2, a1 are

a2 ¼ � ~a~b

a1 ¼ ~u~b

(: ð23Þ

If we set a ¼ ~a, b ¼ ~b, u ¼ ~u, a1 ¼ a1, a2 ¼ a2 in Eq. 21,then

eZ ~a;~b;~uðtÞ ¼ exp j1

2

~a~bþ a2

� �t2

� �exp jt a1�

~u~b

� �� �¼ 1; ð24Þ

jL~a;~b;~c;~d ½f ðtÞ�ð~uÞj ¼ffiffiffiffiffiffiffiffiffiffi

1

j2p~b

sAZ 1

�1eZ ~a;~b;~uðtÞdt

���� ����; ð25Þ

then

L~a;~b;~c;~d ½f ðtÞ�ð~uÞffiffiffiffiffiffiffi1

j2p~b

qexp j ~d

2~b~u2

� ¼ K expðja0ÞðK is the amplitudeÞ

¼ Kðcosða0Þ þ j sinða0ÞÞ: ð26Þ

Let H ¼ L~a;~b;~c;~d ½f ðtÞ�ð~uÞffiffiffiffiffi1

j2p~b

pexp j ~d

2~b~u2ð Þ, so

a0 ¼ arg arctanImðHÞReðHÞ

�; ð27Þ

also

bA ¼ jL~a;~b;~c;~d ½f ðtÞ�ð~uÞjffiffiffiffiffiffiffi1

j2p~b

q��� ��� R11 eZ ~a;~b;~uðtÞ�� �� : ð28Þ

When LCT parameters a; b; c; d reduces to ða; b; c; dÞ ¼ðcos a; sin a;� sin a; cos aÞ, then LCT becomes into the frac-tional Fourier transform by a fixed phase factor multiplied,then the a2, a1, a0 and bA become into the a2, a1, a0 and bAwhich are estimated by FRFT.

So

f ðtÞ ¼ bA exp j1

2a2t2 þ ja1t þ ja0

� �; ð29Þ

the original speech signal (M ¼ 1) can be reconstructed.Below we present two speech reconstruction method.

4.3. The first method

The first method is that the direct use of the reversibilityproperty of the linear canonical transform,

Ld;�b;�c;a½La;b;c;dðyðtÞÞ� ¼ yðtÞ:

When the speech signal based AM–FM model carry back-ground Gaussian noise, most of its energy concentrates in anarrow band which is as center as the peak period of LCTfield, however, the background Gaussian noise in LCT do-main does not have a good time–frequency focusing prop-erty (Moshinsky and Quesne, 2006; Almeida, 1994). Toreconstruct the speech signal, we can use this property, thatis we can design a band pass filter with peak point as thecenter of the filter (Qi and Tao, 2003), sample with additivenoise signal through the filter can retain most energy of thespeech signal and filter out most of the noise energy, thenwe can recover the original speech signal through the in-verse LCT (Almeida, 1994).

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−6

−4

−2

0

2

4

6

t

y

Fig. 1. The original chirp signal needed to restore with noise.

−10 −5 0 5 100

10

20

30

40

50

60

u

LCT

peak

Fig. 2. Set a ¼ �0:99211, b ¼ 0:12533, d ¼ �0:99211, search for the peakafter the LCT.

Fig. 3. Set b ¼ 0:12533, d ¼ �0:99211, exists the peak point after LCT.

W. Qiu et al. / Speech Communication 55 (2013) 40–50 45

4.4. The second method

The AM–FM model of speech is used in detection andrecovery of the multi-component chirps signal. We can dealwith as follows: in multi-component chirp signal detection,it occurs the strong component to impact the weak compo-nent, to avoid the impact of such interference, we can com-bine the first method. Firstly, set a threshold, and use it toconduct respectively four-dimensional peak search apply-ing the quasi-Newton method, in this way we can obtainthe largest peak recorded value ð~a; ~b; ~d; ~uÞ, so the strongcomponent can be detected, and we estimate its parametersas same as single chirp signal parameter estimation, and thefirst component can be reconstructed; secondly, to detectsub-component and we must make sure that the strongcomponent interference can be eliminated, so we designan sequential band-pass adaptive filter in LCT domain tofilter the first strong component. Then obtain the speechsignal by filtered to conduct inverse LCT with parametersbeing ð~d;�~b;�~c; ~aÞ, the signal becomes being in timedomain. So the second chirp component can be detected,and we estimate its parameters. Repeat the above processuntil the magnitude of the last signal component belowthe predetermined threshold. Make filtering and parameterestimation combined to avoid interference effectively, wecan obtain better reconstruction of the speech signal (Qiand Tao, 2003).

MLE, the discrete PPT, the discrete CFT, the QPT, andthe recently proposed FCT, these chirps signal processingmethods are both based on the signal peaks detection.LCT is also another method for chirps signal processing,it is also based on the signal peaks detection. Besides, con-version relationships exist among these methods.

5. Simulation results

5.1. Single chirp component of the speech

The original single chirp signal (M ¼ 1) with Gaussiannoise

y ¼ 3 exp j � 1

28t2 þ 2t þ p

3

� �� �þ wðtÞ:

The SNR is 12 dB about this speech, the observation timeof the original signal is [�2 s,2 s], sampling interval isMT ¼ 0:01 s, observations range in the LCT field[�10,10], sampling interval is Mu ¼ 0:05. We apply the dis-crete LCT proposed in Aizenberg and Astola (2006), Liand Xu (2012), and Abatzoglou (1986) because of thedimensionless normalized treatment effects, these parame-ters must be adjusted appropriately (such as phase will oc-cur to change) to find the correct parameter estimation, thismethod can refer to Koc et al. (2008).

Fig. 1 is the time domain graph of the original speechsignal with Gaussian noise, and SNR of 12 dB. In Fig. 2,after fixed parameters a ¼ �0:99211, b ¼ 0:12533,d ¼ �0:99211 a peak value is searched. In Fig. 3, set

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−4

−3

−2

−1

0

1

2

3

4

t

y

Original signalReconstruct signal by MLReconstruct signal by PPTReconstruct signal by DechripReconstruct signal by LCT

Fig. 5. Reconstruct the signal by using parameter estimation comparedwith the original speech.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.05

0.1

0.15

0.2

0.25

t

Mea

n er

ror r

ate

in a

mpl

itude

Mean error rate by MLMean error rate by PPTMean error rate by DechirpMean error rate by LCT

46 W. Qiu et al. / Speech Communication 55 (2013) 40–50

b ¼ 0:12533, d ¼ �0:99211, the graph show the spectrumof the chirp signal in LCT domain, from the graph wecan see the peak value, quasi-Newton method can be usedas four-dimensional peak search. Then we can find the cor-responding coordinates ð~a; ~b; ~d; ~uÞ. By the parameters esti-mation modulation frequency, angular frequency, phaseand amplitude respectively are a2 ¼ �7:9158, a1 ¼ 1:9947,a0 ¼ 1:0071, bA ¼ 3:0619. Fig. 4 shows the first method torestore the original speech signal. Fig. 5 shows compari-sons between the original chirp signal and the recon-structed signal by the parameters estimation in the otherfour kinds of transforms domain. Fig. 6 shows mean errorrate comparisons of these methods in amplitude betweenthe proposed method and other methods. From Fig. 6,we can see that mean error rates of these methods are sim-ilar during [0s,1s], differences are not very significant, butduring [�2 s,0 s] and [1 s,2 s], mean error rates by MLmethod is the most significant in these methods, then isPPT method and Dechirp method, mean error rate byLCT method is not the most effective. From Fig. 6, itshows that LCT is feasible in speech signal recovery com-pared with other methods.

5.2. Speech of multi-component chirps for M ¼ 2

The speech signal (M ¼ 2) with Gaussian noise

y ¼ 3 exp j � 1

28t2 þ 2t þ p

4

� �� �þ 4 exp j � 1

218t2 þ 5t þ p

3

� �� �þ wðtÞ:

The SNR is 22 dB about this speech, the observation timeof the original signal is [�2 s,2 s], sampling interval isMT ¼ 0:01 s, observations range in the LCT field[�10,10], sampling interval is Mu ¼ 0:05.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−4

−3

−2

−1

0

1

2

3

4

t

y

Original signalReconstruct signal

Fig. 4. Reconstruction of signal by the inverse LCT compared with theoriginal signal.

Fig. 6. Mean error rate in amplitude using parameters estimation by LCTcompared with other methods when SNR of 12 dB.

Fig. 7 is the time domain graph of the speech signal(M ¼ 2) with Gaussian noise, and SNR of 22 dB. FromFig. 8 we identified two peaks after a four-dimensionalpeaks search on ða; uÞ plane. Fig. 9 shows search the secondpeak after inhibiting the first component. Fig. 10 shows thefirst method to restore the speech signal. By the parametersestimation modulation frequency, angular frequency,phase and amplitude were respectively a1 ¼ �7:9011,b1 ¼ 2:1047, c1 ¼ 0:7480, bA1 ¼ 2:9021, a2 ¼ �18:1348,b2 ¼ 4:9024, c2 ¼ 1:0923, bA2 ¼ 4:1262. Fig. 11 shows com-parisons between the original chirp signal and the recon-structed signal by the parameters estimation in LCTdomain and in the other three kinds of transforms domain.Fig. 12 shows mean error rate comparisons of these meth-ods in amplitude between the proposed method and othermethods. From Fig. 12, we can see that mean error ratesof these methods are similar during [�0.5 s,1 s], differences

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−10

−8

−6

−4

−2

0

2

4

6

8

10

t

y

Fig. 7. The original speech signal.

Fig. 8. Search to peaks of the two components.

Fig. 9. Inhibit the first component to search the second peak.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−8

−6

−4

−2

0

2

4

6

8

t

y

Original signalReconstruct signal

Fig. 10. Reconstruction of speech by the first method.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−8

−6

−4

−2

0

2

4

6

8

t

y

Original signalReconstruct signal by MLReconstruct signal by PPTReconstruct signal by DechripReconstruct signal by LCT

Fig. 11. Reconstruct the speech by using parameter estimation comparedwith the original speech.

W. Qiu et al. / Speech Communication 55 (2013) 40–50 47

are not very significant, but during other interval areas,mean error rate by ML method is the most significant in

these methods, then is PPT method and Dechirp method,mean error rate by LCT method is not the most significant.From Fig. 12, it shows that LCT is feasible in speech signalrecovery compared with other methods.

5.3. The experiment of real speech signal

From the references, we know that speech, which can beconsidered as the AM–FM model, is a special case of themulticomponent chirp signals. It is shown in Section 5.2that we has done on artificial structure signal for experi-mental analysis, now we use real speech signals for experi-mental analysis. We collected source speech “we” with asampling rate of 44 kHz, and SNR of 40 dB by means ofa notebook computer with recorder. Fig. 13 is the timedomain graphic of the source speech “we” with Gaussiannoise, and SNR of 40 dB. Fig. 14 shows the first methodto restore the source speech “we”, mean error rate of recon-

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.02

0.04

0.06

0.08

0.1

0.12

t

Mea

n er

ror r

ate

in a

mpl

itude

Mean error rate by MLMean error rate by PPTMean error rate by DechirpMean error rate by LCT

Fig. 12. Mean error rate in amplitude using parameters estimation byLCT compared with other methods when SNR of 22 dB.

0 0.2 0.4 0.6 0.8 1 1.2

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

t

y (t)

Original speech signal "we"

Fig. 13. Source speech signal “we”.

0 0.2 0.4 0.6 0.8 1 1.2

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

t

y (t)

Reconstruct speech by LCT

Fig. 14. Reconstruction of speech “we” by the first method.

0 0.2 0.4 0.6 0.8 1 1.2

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

t

y (t)

Reconstruct speech by LCT

Fig. 15. Reconstruct the speech “we” by using parameter estimation.

0.5 0.505 0.51 0.515 0.52

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

t

yOriginal speechReconstruct speech by the first method

Fig. 16. Reconstruct the speech “we” between 0.5 s and 0.52 s by using thefirst method.

48 W. Qiu et al. / Speech Communication 55 (2013) 40–50

structed speech is 0.1520. Fig. 15 shows the reconstructedspeech signal “we” by the parameters estimation in LCTdomain, mean error rate of reconstructed speech is0.1460. In general, the “M” cannot be estimated perfectly,we can obtain different results for if we use the differentthreshold values. for example, when the threshold is setto be 0.2, by means of the second method the order numberof chirp components “M” is estimated to 21, however, theperfect estimation method and how to choose the rightthreshold will be one of our research directions in thefuture.

In order to see the differences between the original andreconstructed signal more clearly, Fig. 16 gives the compar-ison of amplification between the original and recon-structed signals during time 0.5 and 0.52 s by using thesecond method. From Fig. 16 we can see that the funda-mental frequency of this region is 1=0:0075 ðsÞ ¼ 130 Hz,

0.5 0.505 0.51 0.515 0.52

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

t

y

Original speechReconstruct speech by MLReconstruct speech by PPTReconstruct speech by DechirpReconstruct speech by LCT

Fig. 17. Reconstruct the speech “we” between 0.5 s and 0.52 s by usingparameter estimation compared with the original speech.

0 1 2 3 4 50

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

t

Mea

n er

ror r

ate

in a

mpl

itude

Mean error rate by MLMean error rate by PPTMean error rate by DechirpMean error rate by LCT

Fig. 18. Mean error rate of reconstruct the speech “we” between 0.5 s and0.52 s by LCT compared with other methods when SNR of 40 dB.

W. Qiu et al. / Speech Communication 55 (2013) 40–50 49

because that is a real voiced speech segment, the Fig. 16should includes at least 2.5 cycles in this 0.02 s segment,this results verify the correctness of the methods proposedin this paper.

Fig. 17 shows comparisons between the original speechand the reconstructed speech by the parameters estimationin LCT domain and in the other three kinds of transformsdomain. Fig. 18 shows mean error rate comparisons ofthese methods in amplitude between the proposed methodand other methods. In Fig. 18, mean error rate by PPTmethod is the most significant in these methods, then isDechirp method and ML method, mean error rate byLCT method is little compared to other methods. Thisresults shows that the proposed method is also effectivefor real speech signal recovery.

6. Conclusion

In this paper two recovery methods of speech based onthe AM–FM model in the LCT domain are proposed. Onedepends on the linear canonical transform domain filtering,the other is based signal parameter estimation to restorethe speech signal. The experiments are also proposed toshow the performance of the methods. However, all hadtheir own shortcomings, filtering out the noise signalenergy maybe inadequate in these methods. Besides, inmulti-component chirp signal detection, it occurs thestrong component to impact the weak component, to avoidthe impact of such interference, these problems will beinvestigated in detail in our future works. At the same time,there certainly exist some problems about the real applica-tions of the proposed method in real speech signal process-ing, for example, the ideal order number of chirpcomponents “M” in real speech signal, in general, cannotbe estimated perfectly, this problem should be investigatedand discussed in future works.

Acknowledgements

The authors would like to thank the anonymous review-ers and the Subject Editor for their valuable comments andsuggestions for the improvements of the manuscript, espe-cially for the discussions the ability of the proposed meth-ods for the real speech signals. The authors would alsothank Dr. Hai Jin of Beijing Institute of Technology forthe proofreading of the paper.

References

Yin, H., Nadeu, C., Hohmann, V., Xie, X., Kuang, J.M., 2008. Orderadaptation of the fractional Fourier transform using the intraframepitch change rate for speech recognition. In: The 6th InternationalSymposium on Chinese Spoken Language Processing, ISCSLP’08.

Teager, H.M., Teager, S.M., 1989. Evidence for nonlinear soundproduction mechanisms in the vocal tract. In: NATO Advanced StudyInstitute on Speech Production and Speech Modelling, Bonas, France.

Dimitriadis, D., Maragos, P., 2005. Robust AM–FM features for speechrecognition. IEEE Signal Processing Letters 12 (9), 425–434.

Bovik, A.C., Maragos, P., Quatieri, T.F., 1993. AM–FM energy detectionand separation in noise using multiband energy operators. IEEETransactions on Signal Processing 41 (12), 3245–3265.

Santhanam, B., Maragos, P., 2000. Multicomponent AM–FM demodu-lation via periodicity-based algebraic separation and energy-baseddemodulation. IEEE Transactions on Communications 48 (3), 473–490.

Huang, N.E.A., 1998. The empirical mode decomposition and the Hilbertspectrum for nonlinear and non-stationary time series analysis.Proceedings of the Royal Society of London-A 454 (1971), 903–995.

Gianfelici, F., Biagetti, G., Crippa, P., Turchetti, C., 2007. Multicompo-nent AM–FM representations: an asymptotically exact approach.IEEE Transactions on Audio, Speech, and Language Processing 15 (3),823–837.

Aizenberg, I., Astola, J.T., 2006. Discrete generalized Fresnel functionsand transforms in an arbitrary discrete basis. IEEE Transactions onSignal Processing 54 (11), 4261–4270.

Fan, H.Y., Lu, H.L., 2006. Collins diffraction formula studied in quantumoptics. Optics Letters 31 (17), 2622–2624.

50 W. Qiu et al. / Speech Communication 55 (2013) 40–50

Kostenbauder, A.G., 1990. Ray-pulse matrices: a rational treatment fordispersive optical systems. IEEE Journal of Quantum Electronics 26(6), 1148–1157.

Moshinsky, M., Quesne, C., 2006. Linear canonical transformations andtheir unitary representations. Journal of Mathematical Physics 27 (4),665–670.

Collins, S.A., 1970. Lens-system diffraction integral written in terms ofmatrix optics. Journal of the Optical Society of America 60, 1168–1177.

Bargmann, V., 1961. On a Hilbert space of analytic functions and anassociated integral transform. Part I. Communications on Pure andApplied Mathematics 14, 187–214.

Pei, S.C., Ding, J.J., 2002. Eigenfunctions of linear canonical transform.IEEE Transactions Acoust on Speech Signal Process 50 (1), 11–26.

Sharma, K.K., Joshi, S.D., 2008. Uncertainty principle for real signals inthe linear canonical transform domains. IEEE Transactions on SignalProcessing 56 (7), 2677–2683.

Tao, Ran, Deng, Bing, Wang, Yue, 2009. Fractional Fourier Transformand its Applications. Tsinghua University Press, Beijing.

Li, Cui-Ping, Li, Bing-Zhao, Xu, Tian-Zhou, 2012. Approximatingbandlimited signals associated with the LCT domain from nonuniformsamples at unknown locations. Signal Processing 92 (7), 1658–1664.

Sharma, K.K., Joshi, S.D., 2006. Signal separation using linear canonicaland fractional Fourier transforms. Optics Communications 265, 454–460.

James, D.F., Agarwal, G.S., 1996. The generalized Fresnel transform andits applications to Optics. Optics Communications 126, 207–212.

Li, Bing-Zhao, Xu, Tian-Zhou, 2012. Spectral analysis of sampled signalsin the linear canonical transform domain. Mathematical Problems inEngineering., 19.

Li, Bing-Zhao, Tao, Ran, Wang, Y., 2007. New sampling formulae relatedto the linear canonical transform. Signal Processing 87, 983–990.

Zhao, Juan, Tao, Ran, Li, Yan-Lei, Wang, Yue, 2009. Uncertaintyprinciples for linear canonical transform. IEEE Transactions on SignalProcessing 57 (7), 2856–2858.

Li, Bing-Zhao, Tao, Ran, Xu, Tian-Zhou, Wang, Yue, 2009. The Poissonsum formulae associated with the fractional Fourier transform. SignalProcessing 89 (5), 851–856.

Tao, Ran, Zhang, Feng, Wang, Yue, 2008. Fractional power spectrum.IEEE Transactions on Signal Processing 56 (9), 4199–4206.

Stern, A., 2008. Uncertainty principles in linear canonical transformdo-mains and some of their implications in optics. Journal of the OpticalSociety of America A 25 (3), 647–652.

Maragos, P., Quatieri, T., Kaiser, J.F., 1993. On amplitude and frequencydemodulation using energy operators. IEEE Transaction on Acoustics,Speech and Signal Processing 41 (4), 1532–1550.

Chew, K.C., Soni, T., Zeidler, J.R., 1994. Tracking model of an adaptivelattice filter for a linear chirp FM signal in noise. IEEE Transactionson Signal Processing 42 (8), 1939–1951.

Yao, J., Zhang, Y.T., 2002. The application of bionic wavelet transform tospeech signal processing in cochlear implants using neural networksimulations. IEEE Engineering in Medicine and Biology Society 49(11), 1299–1309.

Smith III, J.O., Serra, X., 1992. PARSHL: a program for the analysis/synthesis of inharmonic sounds based on a sinusoidal representation.In: Proceedings of the ICMC’87. Available from: <http://ccrma.stan-ford.edu/?jos/parshl/>.

Maragos, P., Quatieri, T., Kaiser, J.F., 1993. On amplitude and frequencydemodulation using energy operators. IEEE Transaction on Acoustics,Speech and Signal Processing 41 (4), 1532–1550.

Friedlander, B.J.F., 1995. Estimation of amplitude and phase parametersof multicomponent signals. IEEE Transrictions on Signal Processing43 (4), 917–926.

Abatzoglou, T.J.F., 1986. Maximum likelihood joint estimation offrequency and frequency rate. IEEE Transactions on Aerospace andElectronic Systems 22 (6), 708–716.

Almeida, L.B., 1994. The fractional Fourier transform and time-frequencyrepresentations. IEEE Transactions Signal Process 42 (11), 3084–3091.

Qi, L., Tao, R., 2003. Based on fractional Fourier transform of multi-component LFM signal detection and parameter estimation. Science inChina Series E 33 (8), 749–759.

Koc, A., Ozaktas, H.M., Candan, C., Kutay, M.A., 2008. Digitalcomputation of linear canonical transforms. IEEE Transactions onSignal Processing 56 (6), 2383–2394.