A comparison of spectral smoothing methods for segment ...plaza.ufl.edu/guru1984/files/Spectral_Smoothing... · A comparison of spectral smoothing methods for segment concatenation

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

D.T. Chappell †, John H.L. Hansen,

Robust Speech Processing Group Center for Spoken Language Research University of Colorado Boulder, Campus Box 594 (Express Mail: 3215 Marine Street, Room E-265) Boulder, Colorado 80309-0594 303 – 735 –5148 (Phone) 303 – 735 – 5072 (Fax) [email protected] (email)

† Department of Electrical Engineering,

Duke University, P.O. Box 90291,, Durham, NC 27708-0291, [email protected] (email)

R

RSPG

CSLR

Speech Communication, Volume 36, Issues 3-4, pp 343-373, March 2002.

D.T. Chappell, J.H.L. Hansen, "Spectral Smoothing for Speech Segment Concatenation,” Speech Communication, Volume 36, Issues 3-4, March 2002, Pages 343-373.

A comparison of spectral smoothing methods for segmentconcatenation based speech synthesis q

David T. Chappell b, John H.L. Hansen a,b,*

a Robust Speech Processing Laboratory (RSPL), Center for Spoken Language Research (CSLR), Room E265, University of Colorado,

3215 Marine St., P.O. Box 594, Boulder, CO 80309-0594, USAb Department of Electrical Engineering, P.O. Box 90291, Duke University, Durham, NC 27708-0291, USA

Received 21 April 1999; received in revised form 24 May 2000; accepted 15 December 2000

Abstract

There are many scenarios in both speech synthesis and coding in which adjacent time-frames of speech are spectrally

discontinuous. This paper addresses the topic of improving concatenative speech synthesis with a limited database by

proposing methods to smooth, adjust, or interpolate the spectral transitions between speech segments. The objective is to

produce natural-sounding speech via segment concatenation when formants and other spectral features do not align

properly. We consider several methods for adjusting the spectra at the boundaries between waveform segments. Tech-

niques examined include optimal coupling, waveform interpolation (WI), linear predictive parameter interpolation, and

psychoacoustic closure. Several of these algorithms have been previously developed for either coding or synthesis, while

others are enhanced. We also consider the connection between speech science and articulation in determining the type of

smoothing appropriate for given phoneme–phoneme transitions. Moreover, this work incorporates the use of a recently-

proposed auditory-neural based distance measure (ANBM), which employs a computational model of the auditory system

to assess perceived spectral discontinuities. We demonstrate how actual ANBM scores can be used to help determine the

need for smoothing. In addition, formal evaluation of four smoothing methods, using the ANBM and extensive listener

tests, reveals that smoothing can distinctly improve the quality of speech but when applied inappropriately can also

degrade the quality. It is shown that after proper spectral smoothing, or spectral interpolation, the final synthesized speech

sounds more natural and has a more continuous spectral structure. � 2002 Elsevier Science B.V. All rights reserved.

Keywords: Speech synthesis; Speech coding; Spectral smoothing; Spectral interpolation

1. Introduction

When speech is produced naturally by a human,there is a measurable degree of continuity between

phone segments. This degree of continuity is re-lated to the physical movement and placement ofthe vocal system articulators. When speech isproduced artificially, such as in segment-basedsynthesis or in low-bit-rate coding, the samephone-to-phone continuity may not exist.

Speech synthesis, coding, and voice transfor-mation can benefit from improvements in spectralsmoothing. There are a number of scenarios inwhich the spectral structure of speech at adjacent

www.elsevier.com/locate/specomSpeech Communication 36 (2002) 343–374

qThis work was supported in part by SPAWAR under grant

No. N66001-92-0092.*Corresponding author. Tel.: +1-303-735-5148; fax: +1-303-

735-5072.

E-mail address: [email protected] (J.H.L. Hansen).

0167-6393/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-6393 (01 )00008-5

time observations is not smooth. Listeners candetect abrupt changes in the timbre of the speech,that cause the speech to sound unnatural. Al-though there are several existing techniques tosmooth or interpolate the spectral structure ofspeech (Slaney et al., 1996; Conkie and Isard,1997; Plumpe et al., 1998), most studies consideronly a single method with limited performancecomparison. Hence, there is still much room forimprovement.

Text-to-speech (TTS) systems based on con-catenation produce continuous speech by selectingwaveform units from speech databases. Many ofthese systems use databases with a large number(e.g., over 50,000) of available segments with var-ied characteristics. This approach can yield high-quality speech (Hunt and Black, 1996; Hirokawaand Hakoda, 1990), but such algorithms succeedprimarily because of their large databases. Theselarge-database synthesis schemes generally con-centrate on segment choice and search algorithmssince the corpora contain enough sample units toinclude a reasonably close match for each desiredphoneme. In contrast, concatenative speech syn-thesis with a smaller database of only a few hun-dred to thousand phone unit waveforms perspeaker will yield more discontinuities at segmentboundaries. With a smaller database, each speechsegment must be modified to fit the desired prop-erties. The spectral characteristics of the beginningand ending of each phone must be molded tosmooth the transition between adjoining units. It isalso important to know when spectral smoothingis needed. For example, there should be a highdegree of continuity between /aa/ and /r/ in theword car, but more discontinuity between /k/ and/aa/ in the same word. While spectral smoothingcan benefit speech synthesis based on both largeand small databases, smoothing is more valuablefor smaller databases.

In this paper, our emphasis is on small databasesize concatenative synthesis, with a focus on con-trasting the strengths and weaknesses of spectralsmoothing algorithms. We point out, however,that spectral smoothing algorithms are useful inother scenarios as well such as LP based speechcoding and voice transformation. The basicwaveform unit used here is the phone, with mul-

tiple adjacent phones used when appropriatematches are found in the speech database.

The paper outline is as follows. First, we con-sider relevant background on topics in concate-native speech synthesis, spectral smoothing, andarticulation. We also review a recently-proposedauditory-based distance measure that can aid thesmoothing process. Next, we present several spec-tral smoothing algorithms. Each algorithm’sdescription is accompanied by a discussion andevaluation. In Section 4, we describe the methodused to determine which approach to spectralsmoothing to use for various phone-to-phoneconcatenation cases. Finally in Section 5, we pre-sent formal listener tests to evaluate the quality ofthe various spectral smoothing algorithms con-sidered. We also describe algorithm results anddiscuss our findings on the proper use of spectralsmoothing methods.

2. Background

This section provides background on severaltopics important to the research discussed withinthis paper. Several sources discuss these topics inmore detail (e.g., Deller et al., 2000; Flanagan,1972; O’Shaughnessy, 1990; Witten, 1982).

2.1. Concatenative synthesis overview

There are a number of different techniques forsynthesizing speech. The technique used in thiswork, concatenative synthesis, starts with a col-lection of speech waveform signals and concate-nates individual segments to construct a newutterance. The concatenation process itself isstraightforward, though large databases may re-quire complex search algorithms, and signal pro-cessing may be used to modify the constructedsignal to achieve some desired speaker character-istics. The final speech is more natural and thespeaker is more recognizable than with otherforms of synthesis since concatenative synthesisbegins with a set of natural speech segments.

The database of recorded waveform segments istypically in one of two forms. Many concatena-tive speech synthesis systems produce continuous

344 D.T. Chappell, J.H.L. Hansen / Speech Communication 36 (2002) 343–374

speech by selecting waveform segments fromdatabases with a large number (i.e., +50,000) ofsegments with varied characteristics (Hirokawaand Hakoda, 1990; Hunt and Black, 1996; Huanget al., 1997; Breen and Jackson, 1998). These largedatabases are usually not recorded with concate-native synthesis in mind but instead are genericspeech corpora. Direct concatenation of segmentsfrom such a large database can yield high speechquality since the database contains enough samplesegments to include a close match for each desiredsegment; however, such a technique is costly interms of database collection, search requirements,and segment memory storage and organization.For databases that contain multiple instances ofeach speech unit, synthesizers commonly selectsegments based upon two cost functions: the targetcost compares available segments with a theore-tical ideal segment, and the concatenation costmeasures the spectral continuity between poten-tially concatenated segments (Hunt and Black,1996). A recent study has compared several spec-tral distance measures to determine which mea-sures best predict audible discontinuities whenused as concatenation costs (Klabbers and Vel-dhuis, 1998).

In comparison, other concatenative synthesissystems use a set of specially selected diphoneswith boundaries set at the phoneme centers whereformants are stable. These databases are muchsmaller and contain only one example of each di-phone in the language (about 1200 in English).Such diphone databases are typically recordedspecifically for concatenative synthesis. In bothdatabase styles – generic corpora and diphonedatabases – the formants of concatenated speechsegments may not align perfectly, but the spectralalignment is generally reasonable.

The limited amount of speech in any givendatabase is unlikely to include segments that pre-cisely match the desired reference segment, giventhe existence of any knowledge of reference seg-ment characteristics. While most synthesizerssimply take the nearest matching segment as itstands without additional processing, some sys-tems will modify the segment before concatena-tion. Pitch-synchronous overlap and add (PSOLA)is often used to adjust the segment pitch and du-

ration to match the desired reference. By ma-nipulating pitch-synchronous analysis windows,PSOLA provides a simple mechanism for prosodicadjustment (Moulines and Charpentier, 1990;Moulines and Laroche, 1995). While a perfectlymatched segment is desirable, modifying theavailable data is a practical method of achievingsimilar results. Modifying the three prosodiccharacteristics – pitch, duration, and power –allows a limited database to produce a wider rangeof speech segments for concatenation. Many im-plementations of PSOLA do not include spectralsmoothing in order to minimize the computationalcomplexity, but we have expanded upon the basictime domain PSOLA algorithm to incorporatesmoothing.

2.2. Spectral smoothing

In both speech synthesis and audio coding,there are circumstances where subsequent datasegments have audibly different spectra at theiradjoining boundaries. Signal processing can beused to smooth the existing waveform or createnew data to bridge the gap between segmentsresulting from compression or coding errors.Straightforward linear interpolation in the fre-quency domain does not yield acceptable results,and therefore alternative algorithms (see Section 3)are needed to provide more natural transitions. Itis noted that spectral smoothing generally indicatesmodification of existing audio frames and spectralinterpolation means the addition of frames; here weemphasize the addition of frames but do not dis-tinguish between the two terms.

In the absence of spectral smoothing, unnaturalspectral transitions will arise. Studies have shownthat smooth changes in frequency are perceived aschanges within a single speaker, whereas suddenchanges are perceived as being a change in speaker(Moore, 1997). Other research has shown thatformant discontinuities are audible in TD-PSOLAsynthesis (Donovan, 1996). Spectral smoothingcan eliminate these audibly unnatural transitions.Therefore, the goal of this study is to exploreseveral spectral-based smoothing and adjustmentalgorithms to address spectral discontinuity forsegment-based concatenative synthesis and to

D.T. Chappell, J.H.L. Hansen / Speech Communication 36 (2002) 343–374 345

explore ways to determine when and where thesmoothing should be applied.

At present, spectral smoothing is most com-monly used for speech and audio coding. Similarmethods are sometimes used for speaker transfor-mation (Savic and Nam, 1991; Mizuno and Abe,1995; Slifka and Anderson, 1995). In comparison,spectral smoothing is only sometimes used forspeech synthesis (Mizuno et al., 1993). Eventhough our experiments have focused on TD-PSOLA synthesis, other researchers have success-fully applied spectral smoothing to other synthesisalgorithms such as the Harmonic/Stochastic (H/S)model and multi-band resynthesis PSOLA (MBR-PSOLA or MBROLA) (Dutoit and Leich, 1993;Dutoit, 1994) as well as the harmonic plus noisemodel (HNM) (Syrdal et al., 1998). In some cases,spectral smoothing of concatenated speech candegrade synthesis quality rather than yield im-provement (i.e., produce various artifacts suchas suddenly appearing/disappearing narrowbandpeaks, spectral peaks fading and rising versusshifting in frequency, and nonlinear peak fre-quency shifts (Goncharoff and Kaine-Krolak,1995)). Spectral smoothing tends to perform bestwhen the original spectra are similar to each other;such as in speech coding and concatenative syn-thesis with large or specially-designed databases.

2.3. Spectral distance measure

In a previous study, an auditory-neural basedmeasure (ANBM) was proposed which aids in theselection of speech units for speech synthesis viasegment concatenation (Chappell and Hansen,1997; Hansen and Chappell, 1998). The ANBMmeasures the ‘‘distance’’ between the spectralcharacteristics of two adjacent time-slices in thespeech signal. It differs from other spectral dis-tance measures in being based upon a model ofmammalian auditory perception.

The ANBM uses the output of a computationalauditory model to generate one feature vector foreach frame of speech. First, a computationalmodel generates the average firing rates of syn-apses of auditory nerves. We use Carney’s non-linear auditory model, which is based upon andclosely approximates measurements of auditory

nerve (AN) fibers in cats (Carney, 1992). The au-ditory model calculates the time-varying spike ratefor the synapse between an inner hair cell and anAN. Next, the analysis stage locates the primary,or modal, frequency at which each AN fires. Tofind the primary firing frequency for an ANchannel, we first calculate the spectrum and thenfind the frequency for the corresponding peakabsolute value. This dominant frequency is storedin the feature vector for that frame of speech. Boththe auditory model and the measure’s analysisstage operate on each AN channel separately; foreach channel k for a given frame, the analysis stagestores the primary firing frequency value xk withinthe feature data vector ~xx. Finally, the featurevectors are compared via the city-block metricshown below to estimate the perceived mismatchbetween frames of speech,

d1ð~xx;~yyÞ ¼XNk¼1

jxk � ykj: ð1Þ

A lower ANBM score implies less perceived au-ditory difference, while a larger score impliesgreater perceived discontinuity.

This measure can therefore provide informationon the amount of perceptual segment mismatch todirect additional signal processing to smooth anydiscontinuities or disfluencies. Here, we considerits use for determining whether a concatenatedsegment boundary is sufficiently smooth, though itmay also be useful for determining the degree towhich a speech signal sounds natural or concate-nated.

While the original formulation of the ANBMdid not specify the number of auditory-nervechannels, we chose to use 32. Using the knowncharacteristic frequencies for cat inner hair cells, 32channels cover characteristic frequencies from 100Hz to 3587 Hz (Liberman, 1982). Phase locking,which is the mechanism assumed in the techniqueof finding the modal firing frequency, is known tooccur only below 4–5 kHz (Moore, 1997).

2.4. Articulation

Articulation is rarely considered in spectralsmoothing. Nonetheless, knowledge of manner of


articulation and its acoustic correlates can aid inspectral smoothing. Speech articulation and itsconnection to acoustics are well-understood anddescribed in several texts (e.g., Coker, 1976; Delleret al., 2000; Ladefoged, 1975; Moore, 1997;O’Shaughnessy, 1990; Pickett, 1980).

The effects of coarticulation on formant move-ment represent a major cause for the need forspectral smoothing. This influence represents aserious challenge for segment-based synthesiswhen segment codebook sizes are small since seg-ments are more likely to be individual phones. Toovercome these problems, many speech synthesissystems use diphone units, which are bordered atthe relatively stable positions in the center ofphonemes, rather than phones, which are borderedat the unstable transition positions.

Coarticulation is caused by articulators movingsmoothly into position and gradually returning toneutral positions over the course of one or morephones. When an articulator’s motion is not di-rectly involved in the production of a phone, it isfree to move according to previous and subsequentphonemes. For example, labial phonemes allowthe tongue to move freely while lingual phonemesallow the lips to move. The limits of motion of thearticulators used for speech production are atdifferent rates of speed (O’Shaughnessy, 1990;Zemlin, 1968), which implies that the transitionperiods between different phonemes should havedifferent durations. Research using an articulatorymodel has demonstrated the effects of movementof articulatory organs on segmental duration(Shiga et al., 1998).

Acoustics and articulation are important forspectral smoothing as well as general synthesis dueto the effects of coarticulation on formant posi-tions. Some phonemes can yield similar steady-state spectra but differ in phone transitions (e.g.,/d/ and /g/ versus /b/) (Parthasarathy and Coker,1992). In nasalization and rhotacization, a conso-nant colors the spectrum of adjacent vowels in apredictable way.

In English, the first three formants largely de-termine the phoneme. F1 is high when the tongueconstriction is nearer the glottis and when themouth opening is large and unrounded. F2 gener-ally increases as the point of constriction moves

forward from the glottis, as the tongue constricu-tion narrows, and as the size of the mouth openingincreases. F3 increases as the constriction movesforward from the glottis and also as the mouthopening increase in size and becomes less rounded.Moreover, formant bandwidth depends upon thedegree of constriction: open vowels have narrowerformants than constricted vowels (Stevens andHouse, 1955). Fig. 1 illustrates the correlation be-tween articulation and the first three formantpositions for vowels (Ladefoged, 1981). Severalsources (e.g., Fant, 1960; Witten, 1982; Delleret al., 2000; Ladefoged, 1975; O’Shaughnessy, 1990)cite average formant frequencies for various vowelsand describe the spectral structure for consonants.

Table 1 summarizes the relationships betweenarticulation and acoustic features (Ladefoged,1975). The described acoustic features are intendedonly as rough guides, and the actual acousticcorrelates depend on the exact combination ofarticulatory features.

3. Spectral smoothing algorithms

Four approaches to spectral smoothing areconsidered here, with details provided for onlythose methods which show encouraging results.

Fig. 1. Acoustic (formant) correlates of articulatory features

for vowels.


Although several researchers have studied smooth-ing techniques (e.g., audio morphing (Slaney et al.,1996), HMM-based smoothing (Plumpe et al.,1998)), the field is emerging and typically onlycommon existing speech processing algorithms (e.g.,linear prediction techniques) are employed. Severalof these processing techniques were originally de-veloped for other purposes, including interpolationfor audio coding and voice transformation, andin general are not typically applied for spectralsmoothing in concatenative synthesis. Here we fo-cus only on their application to spectral smoothing.

One approach to smoothing is to interpolatebetween boundaries of adjoining segments. There-fore, these anchor frames should be good repre-sentatives of the sound. The approach taken here isto perform linear interpolation in different domainsbetween the two anchor frames, though we alsosuggest cubic spline interpolation as an alternative.The frames are pitch-synchronous where one frameis two pitch periods long; this synchronization isimportant for some interpolation methods.

One important issue in spectral smoothing isto determine for which circumstances smoothingshould be performed. If two segments have asufficiently close spectral match, then distortionintroduced by smoothing may negate the per-formance gain. Moreover, many smoothing tech-niques are inappropriate for use with unvoicedspeech.

Another issue is to determine the best time spanover which to interpolate. The pitch will remain

continuous if data is inserted equal to an integernumber of pitch periods. Our experiments haveshown that three to five periods generally workswell; however, further study is needed to determinethe proper number of pitch periods for differentcircumstances.

The remainder of this section describes thesmoothing algorithms in detail. We focus on LPalgorithms since they are commonly used and canyield good results. We also devote special attentionto the continuity effect since it is a new approachfor smoothing. In addition, we mention otherspectral smoothing algorithms to complete thediscussion. The smoothing algorithms we examineare (i) optimal segment coupling, (ii) waveforminterpolation, (iii) LP techniques (pole shifting andLSF interpolation), and (iv) the continuity effect.

In the following four sections, we illustrate ex-amples of spectral smoothing using speech spec-trograms at the end of each section. Fig. 18summarizes all sample speech spectrograms for thephrase ‘‘carry an oily rag’’. Fig. 18(a) representsthe phrase produced naturally by a male speaker.Fig. 18(b) reflects the results from segment syn-thesis with no spectral smoothing from a code-book with a nominal size of 380 segments.

3.1. Optimal coupling

It is standard in concatenative synthesis that theboundaries of speech segments be fixed, but theoptimal coupling technique allows the segment

Table 1

Acoustic correlates of articulatory features

Articulation Acoustic features

Vowel formant frequencies typically contained in each of freq. bands: 0–500, 500–1500, 1500–2500 Hz, etc.

Bilabial F2 and F3 comparatively low

Alveolar F2 around 1700–1800 Hz

Velar F2 usually high; common origin of F2 and F3 transitions

Retroflex general lowering of F3 and F4Stop sharp beginning of formant structure

Fricative random noise pattern dependent on point of articulation

Nasal formant structure similar to vowels with formants around 250, 2500, 3250 Hz; F2 low amplitude; distinct

antiresonance

Lateral formant structure similar to vowels with formants around 250, 1200, 2400 Hz; higher formants reduced in

amplitude

Approximant formant structure similar to vowels; usually changing

Dental F2 around 1600–1800 Hz; F3 around 2900–3050 Hz


boundaries to move in order to improve thespectral match between adjacent segments (Conkieand Isard, 1997). At its simplest, the optimalcoupling technique is rather straightforward.During synthesis, each segment’s boundary forconcatentation is chosen in order to fit best withthe adjacent segments in the synthesized utterance.An objective measure of spectral mismatch is usedto determine the level of spectral fit between seg-ments at various possible boundaries. The measureof spectral mismatch is tested at a number ofpossible segment boundaries, and the minimummeasure score indicates the location of the closestspectral match.

If two segments are to be concatenated, wherethe end frame of the first segment is in the rangexi; . . . ; xf , and the start frame of the second seg-ment is in the range yi . . . yf , then the distancemeasure function dð Þ is evaluated at all possibleboundary positions to find mina;b dðxa; ybÞ. Forconcatenation of this segment pair, the boundaryframes xa and yb of the segments are selected suchthat the measured mismatch between frames isminimal. Fig. 2 shows an example scenario wheremoving the segment boundaries will noticeablychange the spectral alignment of formants.

While any form of measure may be used todetermine the amount of mismatch, for the sakeof improving spectral quality, using a spectraldiscontinuity measure is appropriate. Measuresconsidered here include the difference of mel-frequency cepstral coefficients (MFCC) and theauditory-neural based measure (ANBM) (Hansenand Chappell, 1998).

In simple frame mismatch, distance measuresare calculated for frames ending at various possi-ble segment boundaries. The distance measurestake into account only the single audio frame fromeach speech segment which lies next to theboundary under consideration. More advancedvariations of optimal coupling also consider thedirection of spectral parameter motion. Althoughour studies used only a simple frame mismatch,more complex proposals include use of regressioncoefficients and linear coefficient fit (Conkie andIsard, 1997).

There are a number of advantages to the opti-mal coupling technique. The algorithm is concep-tually simple and easy to implement. It can becombined with other spectral smoothing tech-niques and need not stand alone. Optimal couplingcan successfully complement other smoothingtechniques because it causes formants to be natu-rally closer to each other at segment boundaries.Since coupling does not modify the existingspeech, it does not introduce additional artifacts.For spectral matching purposes, it effectively ex-pands the number of speech segments in the data-base.

Despite these advantages, there are also severaldisadvantages to optimal coupling. Finding theoptimal coupling point requires a search for eachsegment joint, and an exhaustive search is requiredfor full optimal coupling. Moving the segmentboundaries carries the risk of accidentally cuttingan important part of a sound or adding an inap-propriate sound. Errors often occur by leaving outtoo much of a sound, though heuristic rules can

Fig. 2. Optimal segment coupling.


reduce or eliminate this effect. More importantly,optimal coupling is limited in its effectiveness sinceit only works with existing speech and does notactually modify formant positions.

In summary, optimal segment coupling is arelatively simple approach for a reasonable benefit.It is easy to use either by itself or in combinationwith a more direct smoothing algorithm. In someforms of concatenative synthesis, segments arespecifically designed or pre-selected from a data-base such that formants match smoothly at theiredges, and in these cases optimal coupling willprovide little if any gain. In comparison, optimalcoupling clearly has no application to spectralsmoothing for speech coding since the originalsignal is already naturally smooth.

Fig. 18(c) shows a spectrogram of the phrase‘‘carry an oily rag’’ with optimal coupling betweensegments. In Section 5.4 we discuss this spectro-gram and compare it with similar spectrogramsresulting from other spectral smoothing algo-rithms.

3.2. Waveform interpolation

Waveform interpolation (WI) is a speech-cod-ing technique which takes advantage of the grad-ual evolution of the shape of pitch-periodwaveforms. The WI coder operates on a frame-by-frame basis. In each segment, the pitch track iscalculated and characteristic waveforms are ex-tracted. Each characteristic waveform is typicallyone pitch period long, but the length may be aninteger number of periods. In coding, characteris-tic waveforms are extracted from the original sig-nal at regular time intervals. In order to conservespace in coding, a WI-coded signal is typicallytransmitted as quantized frequency coefficientsfor separate rapidly and slowly evolving compo-nents. On reconstruction, intermediate pitch-cyclewaveforms between transmitted waveforms areapproximated by interpolation. To produce aninterpolated waveform, both the pitch period andwaveform signal are interpolated in either the timedomain (at a common period of 2p radians) or thefrequency domain (Kleijn et al., 1996; Kleijn andHaagen, 1995). WI is essentially a form ofsmoothing intended for speech and audio coding.

Though developed for coding purposes, WI canalso be adapted for use in spectral smoothing. Inthis case, the waveform is interpolated betweenframes at the edges of speech segments to createnew inserted smoothed data. The concept is thesame as for coding, but the end goal is different.For synthesis, the original waveform can be keptintact for interpolation rather than compressingthe data via quantization. When the originalwaveforms are available, interpolating in either thetime or the frequency domain yields identical re-sults. A new pitch period of the desired length isconstructed by averaging the amplitudes of theperiods of natural speech at the same relative po-sitions within the waveforms. Such a scheme hasbeen used on frames with constant pitch inMBROLA synthesis (Dutoit and Leich, 1993).

Fig. 3 shows an example of WI with two naturalframes of speech (/aa/ and /ae/) and one interpo-lated frame. In addition, a performance example isshown in Fig. 18(d), which shows an examplespectrogram of a phrase with smoothing per-formed via waveform interpolation.

We conclude that WI is generally better than nosmoothing, but has difficulty producing consistentresults. In the simplified version, WI is concep-tually simple, computationally fast, and easy toimplement in the time domain. When spectralenvelopes are similar, WI can give good results.However, it does not perform actual formantsmoothing and thus yields only small improve-ments. WI generally produces smoother results fora large number of interpolated pitch periods, andworks best on vowel-to-vowel transitions. Al-though the results often sound smoother than withno interpolation, there are artifacts, and the gen-eral quality is not as good as smoothing techniquesthat directly manipulate formant positions.

In addition to direct use for calculatingsmoothed speech frames, WI can also be appliedfor residual interpolation in the linear prediction(LP) methods (Kleijn and Haagen, 1995). LPmethods concentrate on interpolating the spectralenvelope, but the residual signal must also begenerated. Rather than using a generic pulsed ex-citation or a single residual appropriate for thespeaker, we use WI to interpolate between the re-siduals of the bordering frames of natural speech.


Fig. 4 illustrates the application of WI to the re-sidual in LP smoothing. This use of WI with LPfollows a recent trend in speech synthesis towardsmixing deterministic and noise components.

We performed evaluations using WI with LP onregions of speech. WI generated the residual to goalong with the original LP spectrum. The resyn-thesized speech was in some cases practically in-distinguishable from the original, while there wereusually some small artifacts. With longer interpo-lated regions, the level of noticeable distortion was

greater. When the interpolated region differs fromthe speech on either side (e.g., near silence or ashort phone in the interpolated region), then thedistortion is more noticeable. As expected, theframes differ more from the original natural speechin the center of the interpolated region. The mostcommon distortion was that the timing of pitchpulses was off from the original, giving a somewhatartificial quality to the speech. It is believed thatthis artifact is due at least in part to using a con-stant-length frame size for this test, and that a

Fig. 3. Example of waveform interpolation for a single inserted frame from /aa/ (anchor Frame 1 on top) to /ae/ (anchor Frame 2 on

bottom).

Fig. 4. Waveform interpolation applied to residual of LP coding.


pitch-period-based frame size would improve theresulting speech quality. Despite occasional dis-tortion, the interpolated region was generally ac-ceptable. Thus, WI is an appropriate way togenerate the residual for LP interpolation tech-niques.

3.3. LP techniques

LP interpolation techniques are often used tosmooth LP-filter coefficients in LP coding (LPC)and sometimes also for speech synthesis (Shadleand Atal, 1979). The basic strategy is to modelthe speech signal as separate spectral and resid-ual (filter and source) components and to adjusteach component separately. Here, we perform LPspectral parameter interpolation in one of severaldomains, while the residual is interpolated usingWI.

If the LP spectra are directly interpolated, for-mants will rise and fall in the interpolated framesof speech rather than move smoothly in frequency,amplitude, and bandwidth as is desired. Fig. 5shows an example of improper results from inter-polating LP spectra (two anchors with four inter-polated frames); for comparison with a moresophisticated interpolation algorithm, the anchor-frame spectra are the same as for Fig. 6. Thus,rather than LP spectra interpolation, we stronglyrecommend interpolating LP parameters in a do-main where the parameters are closely linked toformant location. To perform spectral smoothing,the LP parameters should be interpolated and re-combined with a separately-interpolated residual.

LPC analysis yields less frame-to-frame varia-tion and smoother evolution of the coefficientswhen analysis is performed on pitch-synchronouswindows (Paliwal and Kleijn, 1995). Thus, itworks well in PSOLA-based systems.

LP interpolation has a number of advantagesbut also some disadvantages. Most importantly,LP methods allow direct manipulation of thespectral envelope and thereby indirect manipula-tion of formants in a way desirable for smoothing.On the downside, it is difficult to determine whichparameters control which formants and how tomatch parameters between frames. Also, in somedomains, the inherent ordering of parameters doesnot give the best matching of parameters. Addi-tional algorithm processing is required to translatethe signal into the desired LP domain and backunless the data is already LP-encoded. Since LPanalysis is based on an all-pole model, it does notadequately model nasal consonants and nasalizedvowels. Despite these limitations, we have foundthat LP interpolation techniques can provide goodresults which exceed those of the other algorithmstested in this study.

LP interpolation was examined in two differentdomains. The following two subsections give de-tails on these approaches and discuss their indi-vidual strengths and weaknesses.

3.3.1. LP pole shiftingIn speech coding, the LP poles are rarely shifted

directly in the z-plane because the parameters areusually stored and transmitted in another repre-sentation. Nonetheless, LP poles are a useful rep-

Fig. 5. Example of directly interpolating LP spectra.


resentation for interpolation. The two majorproblems involved with pole shifting are (1)matching the poles between the two anchor framesand (2) performing the interpolation.

Pole matching is not a simple problem, and it isrelated to the problem of finding formants basedupon LP poles (Snell and Milinazzo, 1993). Forcoding natural speech, poles are unlikely to movevery far between known values, and thus thematching problem is easier than for concatenativesynthesis where poles may be at widely differentpositions in adjacent segments. The obvious solu-tion of simply aligning poles by frequency orderfails to consider several situations, such as thepresence of real poles or other poles that do notdirectly correspond to formants but instead con-tribute to overall spectral shape. Moreover, theremay be cases where formants split or merge, orarise in such a way that there is not a one-to-onecorrespondence between poles in different frames.

A good approach is to search for the optimal polematch using an appropriate distance measureformula such as the following (Goncharoff andKaine-Krolak, 1995):

Dðp0; p1Þ ¼ln p1

p0

� �� lnðð1�r2

0Þ=ð1�r2

1ÞÞ

lnðr1=r0Þ

n o; r0 6¼ r1;

ln p1p0

� �� 2r2=1� r2f g; r ¼ r0 ¼ r1;

8<:

ð2Þ

where pi are complex pole positions and ri are poleradii. Our experiments have shown that this dis-tance formula gives insufficient weight to the ra-dius and thus may match a formant-connectedpole with an overall spectral shaping pole nearbyin frequency. This distance measure has thisweakness because it was derived from equationsfor frequency and bandwidth which are based on asingle pole rather than a multi-pole system. Animproved distance measure could lead to better

Fig. 6. Example of a successful LP pole shifting scenario.


automatic pole matching and thereby better spec-tral smoothing.

In pole matching, a common problem ariseswhen one frame of speech has more real poles thanthe adjoining speech segment frame. Fig. 6 illus-trates this scenario, where four frames of speechare interpolated between Frames 1 and 2 poleplots. One solution to the pole-matching problemis to convert the pole constellation to a domainwhere each pole has a complex conjugate and thenuse a distance measure to match poles (Goncharoffand Kaine-Krolak, 1995). Another approach is tofirst match conjugate pairs that result in the min-imum overall distance between matched pairs. Foreach remaining unmatched conjugate pair, thenearest single real pole is selected as a match.

Fig. 7 shows an important pole-matching sce-nario where improper matching yields poor re-sults. Whether the poles are matched betweenFrames 1 and 2 by frequency ordering or by usingEq. (2), the first two poles become inappropriately

criss-crossed over the four interpolated frames.With user assistance, the proper pole match couldeasily be made, but both automatic algorithms fail.As a result, the movement between Frames 1 and 2is smooth for F2 and F3 and the overall spectralslope, but F1 (ca 550 Hz) suddenly drops in am-plitude for the interpolated frames and then sud-denly rises in amplitude.

Once poles have been appropriately matchedbetween anchor frames, their positions mustbe interpolated. This should not be performeddirectly in the complex plane, but instead themagnitude and phase of the poles should beinterpolated separately. Separate interpolation ofreal and imaginary components in the z-plane canproduce values which are not truly intermediate,but interpolating in the magnitude-phase domainproduces more reasonable results. It is known thatthe magnitude of a pole relates to formant band-width, while the angle relates to formant fre-quency. While the pole radii can be interpolated

Fig. 7. Example of a failed LP pole shifting scenario from Frame 1 to Frame 2 with four interpolated frames.


directly, bandwidth can be directly interpolated byusing the standard formulae that relate frequencyF and bandwidth BW of a single pole pi to its anglehi and radius ri and the sampling period Ts,

Fi ¼hi

2pTs; BWi ¼

� lnðriÞpTs

: ð3Þ

Thus, to linearly interpolate with the frequencyand the bandwidth, the angle should be interpo-lated linearly while the radius is interpolated geo-metrically. Specifically, if the new pole pi ¼ ri\hi

is a fraction k1 from a known pole p1 ¼ r1\h1,and a fraction k2 from pole p2 ¼ r2\h2, wherek1 þ k2 ¼ 1, then the components of pi should begenerated by the equations

hi ¼ k1h1 þ k2h2 and ri ¼ rk11 þ rk22 : ð4Þ

Ideally, each LP pole pair would correspond to asingle formant, but in practice multiple poles willaffect the location and bandwidth of each formantand some poles will contribute to overall spectralshape. Thus, although pole shifting does modifyformants, it can have undesired effects such asformant bandwidth spreading. Quite often, the LPmodel order is selected with the notion thatsmoothing will be applied (i.e., for fs ¼ 8 kHz,studies will select an order of P ¼ 9, correspondingto approximately four formant pole-pairs and onereal overall shaping pole). Other research onwaveform synthesis has been successful in sepa-rating individual formants based on poles forformant frequency modification (Mizuno et al.,1993). Key points of that work are that (1) for-mants are connected with only those poles with alow value of BWi divided by Fi and (2) pole ex-traction is checked by comparing the target andcalculated spectral intentions in an iterative pro-cedure.

Fig. 18(e) shows an example spectrogram ofa segment concatenated phrase with smoothingperformed via LP pole shifting.

In summary, LP pole manipulation has excel-lent spectral smoothing potential, yet severalweaknesses are present. Shifting pole locationgives the ability to shape the desired interpolatedspeech spectral structure. When each pole corre-sponds to a formant and the poles move little be-

tween anchor frames, then the interpolation issimple and of high quality. In more complex sit-uations, the relationship between pole location andspectral envelope must be considered to ensurethat pole matching and interpolation gives thedesired results. The results can be quite good, buteven more recent techniques are not sufficient to beapplied in a completely unsupervised manner. In aminority of cases, pole interpolation can yield re-sults which are worse than no smoothing. Futureefforts should consider ways to automatically as-sess the success of the interpolated pole shiftedframes.

3.3.2. LSF interpolationThe line spectral frequency (LSF) representa-

tion, also known as line spectrum pair (LSP), isoften used for speech coding (Papamichalis, 1987).Interpolation between LSFs has been used notonly for coding but also for synthesis and evenspectral smoothing. LSFs are calculated from theLP poles in a technique that yields two sets ofinterleaved zeros on the unit circle. Representingthe LPC filter in the LSF domain ensures its sta-bility and is thus appropriate for coding and in-terpolation.

For coding, the LSF representation is generallyaccepted as giving the best performance in terms ofspectral distortion, and it always yields stable fil-ters after interpolation (Paliwal and Kleijn, 1995;Paliwal, 1995). Some comparative studies haveshown that LSF interpolation gives better resultsthan other representations when used for inter-polation in coding as measured by spectral dis-tortion (Paliwal and Kleijn, 1995; Paliwal, 1995) orprediction error (Erkelens and Broersen, 1994).Other comparison studies showed no inherentadvantage for LSFs (Atal et al., 1989).

LSFs can also be interpolated for speech syn-thesis. For waveform synthesis, however, there islimited benefit from the compression and quanti-zation advantages which LSFs offer for coding.

The two major problems of pole shifting aretrivial for LSF interpolation. Unlike pole shifting,LSF interpolation provides an inherent order forparameter matching. When LSF pairs are matchedin the obvious sequence of increasing frequency,


however, this sequence is not always in the orderwhich yields the best results. As with pole match-ing, there are cases where a parameter that corre-sponds to a formant will be matched with aparameter that corresponds to general spectralslope. The interpolation process is also straight-forward since there is only one dimension (fre-quency) involved. The two major interpolationmethods are to either interpolate each P and Qzero separately or to first interpolate the P (po-sition) zeros and then interpolate the differenceparameters. Since the position parameters cor-respond to formant position while differenceparameters roughly correspond to bandwidth, thelatter approach is more intuitive. Specifically, if anew zero pair Pi;Qi is a fraction k1 from knownzero pair P1;Q1 and fraction k2 from zero pairP2;Q2, where k1 þ k2 ¼ 1, then this form of inter-polation yields Pi ¼ k1P1 þ k2P2 and Qi ¼ Pi þk1ðQ1 � P1Þ þ k2ðQ2 � P2Þ.

Fig. 8 shows a scenario where LSF interpola-tion succeeds, while Fig. 9 shows an examplewhere it performs poorly. In each figure, the z-plane plots for anchor Frames 1 and 2 show boththe LP poles and the LSF zeros, while the plot ofinterpolated zeros shows the transition from thefirst frame zeros (outermost ring) to the last framezeros (innermost ring). In Fig. 8, all formant peakshave moved smoothly across frequency as desiredon the spectral envelope plot. In Fig. 9, the for-mants located near 800, 1400, and 1800 Hz do notmove in frequency as desired but instead shift onlyin amplitude. As another example of performance,Fig. 18(f) shows a spectrogram of a phrase withsmoothing performed via LSF interpolation.

Despite some obvious strengths, the use ofLSFs for interpolation can also display some in-herent drawbacks. The interpolation techniqueitself is simple: the zeros have an inherent order formatching, and the interpolation is in a single di-

Fig. 8. Example successful LSF interpolation scenario. Note that the two circles for interpolated zeros are used only for imaging

purposes; all zeros for all frames are actually on the unit circle.


mension. The zeros’ inherent order does notalways match between frames in the manner thatcould yield the best results after interpolation.More importantly, little can be done to improvethe basic algorithm. As with LP pole manipula-tion, there are a small but noticeable number ofscenarios in which LSF interpolation yields resultswhich are worse than no smoothing. AlthoughLSF interpolation is simple and can give goodresults, it does not hold the potential to be uni-versally successful for direct LP pole interpolation.

3.4. Continuity effect

The fourth smoothing approach does not per-form audio signal interpolation but instead masksdiscontinuities. The continuity effect is a psy-choacoustic phenomenon that is suggested here asa possible method for spectral smoothing. Whentwo sounds are alternated, a less intense masked

sound may be heard as continuous despite beinginterrupted by a more intense masking sound. Thesensory evidence presented to the auditory systemdoes not make it clear whether or not the obscuredsound has continued. Psychologists call this effect‘‘closure’’ (Bregman, 1990; Moore, 1997). Fig. 10illustrates the phenomenon.

Perceptual closure occurs when a missing soundgap is filled by a noise or other sound that masksthe missing sound. The visual counterpart to au-ditory closure is looking at a scene while movingpast a picket fence; the observer assumes that thescene continues uninterrupted behind the fenceboards even though only part of the scene is visibleat any one time. In auditory perception, illusorycontinuity requires either that the masking soundcontain the frequency content of the missing, the-oretically masked sound or that the maskingsound be near enough in frequency or time to themissing sound for simultaneous masking to occur

Fig. 9. Example failed LSF interpolation scenario. Note that the two circles for interpolated zeros are used only for imaging purposes;

all zeros for all frames are actually on the unit circle.


according to the neural response of the peripheralauditory system.

The continuity effect has also been shown towork for speech signals alternated with noise. Aseries of studies has shown that irregularly spacedbursts of noise interrupting speech at the rate usedin phone or diphone concatenation (about 6 persecond) is near a minimum in the effect of noise onspeech comprehension. Moreover, with this inter-ruption frequency and the desired fraction of timespent on speech versus noise (91%), listener testsrevealed a very high word articulation rate. Insome circumstances, interrupting noise has beenshown to actually increase intelligibility (Bregman,1990; Moore, 1997). Similar perceptual studieshave found that replacing a phone with an extra-neous sound results in listeners reporting thepresence of the phone, while replacing the phonewith silence results in correct detection of the gap(Warren, 1970).

In the case of spectral smoothing, the continuityeffect can be employed by adding noise betweenspeech segments. Although closure has not beenpreviously applied to speech synthesis, the conceptis not entirely foreign: in some audio systems, largeburst errors are sometimes filled with white noise.

We extend the concept by spectrally shaping thenoise so that it contains only the spectral envelopenecessary to possibly contain any intermediatesound. The listener’s perception fills in any gaps sothat it seems as though speech is being producedwithin the noise, and the perceived speech is con-tinuous with the preceding and following existingspeech.

Fig. 11 shows an example of how a frequency-domain filter is obtained for inserted noise. Thespectral envelopes of the two original speechframes are compared, and the filter is constructedto meet the maximum of the two envelopes at allpoints and to interpolate between any peaks(presumably formants) between the two spectra.To generate this spectral envelope for the noise, allpeaks are found and sorted by frequency for thespectral envelopes for both of the original frames.For each frequency range between two spectralpeaks, the anchor-frame envelopes are comparedas follows. If the amplitude of one of the originalenvelopes is larger than the other at all frequenciesin the range between two peaks, then that portionis directly used in the new envelope. Otherwise,that frequency range will have amplitude valuesconstructed by interpolating between the two

Fig. 10. Illustration of the continuity effect.

Fig. 11. Example noise envelopes for continuity effect.


peaks. Once the new spectral envelope is con-structed, Gaussian white noise is passed throughthe filter to create shaped noise that will mask anyhypothetical speech between the two naturalframes without introducing more noise than nec-essary for auditory masking.

Although the proposed use of the continuityeffect does not provide great spectral smoothing inall situations, we have found that it performs wellin some cases and has distinct potential for im-provement. From our experiments, a section ofinserted shaped noise about 45–125 ms long gen-erally yields the best results. A section of noise thatis too long gives the impression of either insertinga stop or simply playing noise. In most cases, it isclear that some noise is present in the signal, but itis also feasible that the speech continues smoothlyin the background. In other cases, the shaped noisesounds worse than nothing, but in rare cases itsounds very good and natural. Our experimentsrevealed an optimal amplitude of the noise atabout 1=4 the amplitude of the rms of the mean ofthe adjacent natural frames. A lower noise ampli-tude sounds more natural and acceptable, but itcan also sound more often like an inappropriatelyinserted stop. Fig. 18(g) shows a sample spectro-gram.

Inserting shaped noise is noticeably better thanwhite noise; it sounds more like the noisy signal ofthe correct spectrum. When smoothing is per-formed between concatenated speech segments,the use of closure may fail when the formants ofthe two segments are naturally too far apart.When noise is played intermittently with naturalspeech, the formants will typically be close enoughtogether for the continuity effect to apply, but suchsituations do not always occur with concatena-tion. Still, many concatenative databases havesegments selected such that their formants arenearly aligned. In summary, using shaped noisecan provide perceptive spectral smoothing in somecases, but in other cases it can be very annoying.We have shown that application of the method ispromising, especially for phonemes with fricationwhere LP and WI techniques fail. Further researchis warranted in determining the perceptually opti-mal spectral envelope for shaping the insertednoise.

4. Determining smoothing required

Rather than blindly applying the same spectralsmoothing algorithm in the same manner at allconcatenation points, we suggest that severalmethods may be necessary since it is important todetermine the proper type and amount of spectralsmoothing required. Not all segment joints benefitfrom spectral smoothing, and no single smoothingalgorithm performs best in all situations. Relevantfactors – including phonemes, articulator posi-tioning, and spectral perception – help determinethe type, amount, and duration of modificationrequired for smoothing. It is noted that we havenot emphasized F0 issues since standard methodssuch as PSOLA can modify the pitch to the desiredfrequency.

In this study we used knowledge from bothspeech science and signal processing to determinethe smoothing required in different scenarios. Wehave compiled this data into a table (see Section4.2 and Table 3) for use as an aid in smoothing. Inaddition, we provide further details on smoothingfor several example phone pairs (see Section 4.3).Although we performed this analysis on phonesegments, the same concepts are applicable fordiphone concatenation.

One important issue of spectral smoothing isdetermining the circumstances under which thesmoothing should be performed. If two segmentshave a sufficiently close spectral match, then thedistortion introduced by smoothing techniquesmay sometimes outweigh the performance gain.On the other hand, spectral smoothing generallyperforms better on segments with similar spectralcharacteristics, and attempting to smooth verydifferent spectral envelopes can yield poor results.Moreover, many smoothing techniques are inap-propriate for use with unvoiced speech. The twopieces of data used in automatically determiningwhether smoothing is appropriate for a joint are(1) knowledge of the phonemes involved and (2)the ANBM score (see Section 2.3) for the joint.

Certain smoothing algorithms are better forcertain phonemes. For example, LP techniques arenot as successful for nasals and nasalizations be-cause they employ all-pole models and thus do notreflect the anti-resonances.


Another issue is determining the best time spanover which to interpolate. The pitch will remaincontinuous if data is inserted in blocks equal to aninteger number of pitch periods. Many of our ex-periments have used a pitch-synchronous synthe-sizer, and we have seen that inserting three to fiveinterpolated periods generally works well. Whilewe have considered experiments which range be-tween 0 and 10 pitch periods for insertion, futurestudies should be done to determine the optimalnumber of frames (i.e., pitch periods) of smoothingfor specific phone-to-phone circumstances.

4.1. Method

To determine the desired spectral smoothing,we consider aspects of both speech science andsignal processing. The articulation between pho-nemes gives an indication of the expected formanttransitions. Analysis of the ANBM scores fornatural and concatenated phones indicates theapproximate scores for perceptively smoothspeech. Knowledge of the phonemes and smooth-ing algorithms leads to recommendations as towhich algorithms should be used for various cir-cumstances.

We propose using the ANBM to evaluate theperceived spectral smoothness between segments.In order to test the ANBM across phone bound-aries in natural speech, we applied the measure tophone transitions from the TIMIT database 1 re-sampled at 8 kHz. Using the phoneme labelssupplied with TIMIT, the ANBM score was cal-culated for each phone-to-phone transition in thedatabase. We recorded the measure scores andcalculated relevant statistics such as the samplemean and the unbiased sample standard deviation.The resulting ANBM scores are used to assessphone-to-phone spectral transition information.

Table 2 shows the net ANBM results and sta-tistics across all phone-to-phone transitions. Itreports the sample mean and sample standarddeviation for the entire data set. Measure analysiswas performed on only male speakers in the

training portion of the database. Fig. 12 shows ahistogram of the resulting ANBM scores at phonetransitions for the entire database.

Analyses with the ANBM have shown thatmeasure scores vary according to several factors:naturalness, the specific phonemes concatenated,the speaker, and the dialect. Subjectively, differentspeakers yield different amounts of smoothnesswhen their phonemes are concatenated, and sta-tistical analysis of ANBM scores confirms thisobservation. Calculating the probability that thepopulation mean difference d is at least 0 (PdAL0)(Berry, 1996) indicates that several of the dialectregions have scores which are statistically signifi-cantly different from each other even at the 99%probability interval. Although there is no singlethreshold which is equally suited to all speakers,we have included data from speakers of differentAmerican English dialect regions in determiningapproximate thresholds for general use.

Fig. 13 shows the sample mean and samplestandard deviation for those phoneme-pair junc-tions with 400 or more examples in the database.The first entry shows the sample mean and samplestandard deviation for the overall dataset, andindividual transition results follow. This figureshows that different phoneme-pairs produce shif-ted expected measure score ranges.

In order to use ANBM scores to determinewhether smoothing is necessary, we have used aprobe listener test to connect ANBM scores withsubjective evaluations. For each possible phonemeclass junction, we generated approximately 100concatenated examples from TIMIT. We thenused a nine-value subjective opinion score tomeasure the perceived spectral smoothness of boththese concatenated examples and a small set ofnatural examples. Based on the subjective scores,we suggest using an ANBM threshold to deter-

1 Available from the Linguistic Data Consortium at http://

www.ldc.upenn.edu/.

Table 2

Net ANBM results for TIMIT

Number of speakers 326

Number of phone transitions 124,193

Sample mean 222.24

Sample standard deviation 100.27

Maximum score 732

Minimum score 6


mine when a concatenated joint is acceptablysmooth. Joints with ANBM scores below thethreshold will be considered acceptable, whilescores above the threshold will indicate a need forspectral smoothing. We compared ANBM scoresand opinion scores to establish relative correlationmeasures. The correlation coefficients for differentphoneme class pairs varied with better values formore strongly voiced data: the correlation coeffi-cient for combinations of vowels (VL) is 0.21; forcombinations of VL and/or diphthongs (DT) it is0.17; and for VL, DT, and semi-vowels (LG; liq-uids and glides) it is 0.09.

Based on our evaluations and observations, werecommend a threshold of the sample mean of thenatural ANBM scores for a given phoneme classpair. This is not a clear-cut threshold, but raisingor lowering it can change the fraction of joints thatare smoothed. Fig. 14 shows ROC curves that il-lustrate the trade-off on the probabilities of de-tection (Pd) and false alarm (Pf ) for joints that need

smoothing for different sets of phoneme classes.Curves are shown for junctions involving three setsof phoneme classes: vowels and diphthongs; vow-els, diphthongs, and semi-vowels; and all phonemeclasses. The detection in this figure means thatusing the ANBM to find joints that are subjec-tively marked in the ‘‘poor’’ to ‘‘good’’ range, with‘‘very good’’ and ‘‘excellent’’ joints being acceptedas already sufficiently smooth.

In some practical situations, time constraintsmay prevent the use of spectral smoothing on allsegment joints. It is suggested that the ANBMscore should be used to rank-order the perceivedsmoothness of joints. The ANBM score will listthe joints in the same general order that a humanlistener would rank the perceived smoothness. Thespectral smoothing system can then devote moreprocessing resources to smoothing only the topX% of the joints, where X is smaller for larger-database synthesizers. There is a natural balancebetween resulting speech quality and segment

Fig. 13. Sample mean and standard deviation of ANBM score for overall data set and for different phoneme pair junctions.

Fig. 12. Histogram of ANBM Scores for TIMIT.


synthesis database size. It is therefore fair to saythat for small database size systems, largeramounts of segment smoothing will be necessary.It also stands to reason that very large databasesize systems may require little or no spectralsmoothing. As the database size increases, so toodecreases the fraction of segment joints that re-quire spectral smoothing.

As demonstrated by the measurements we havereported, the ANBM score should not be taken asan absolute scale for determining smoothness. In-stead, the score should be considered in the con-text of typical scores for the given phonemetransition and the given speaker. The relative scoregives a better measurement of the perceived con-tinuity.

The just-noticeable difference (JND) or differ-ence limen (DL) of formant frequency, bandwidth,and intensity should also be considered in spectralsmoothing. The JND for formant frequencies is 3–14% of the formant frequency value. The formantbandwidth JND is about 20–40%. For formantamplitudes, typical JNDs are approximately 1.5

dB for F 1 and 3 dB for F 2 (Flanagan, 1972;O’Shaughnessy, 1990; Rabiner and Juang, 1993).The precise JND depends upon whether the speechis natural or steady-state and whether one or moreformants are changing simultaneously.

4.2. Recommendations table

Our experience with concatenating andsmoothing various phonemes allows us to makerecommendations as to the spectral smoothingalgorithms which perform best for each case ac-cording to the classes of phonemes joined. Thenatural smoothness and expected amount ofsmoothing also follow from given phoneme clas-ses. Although each specific phoneme pair variesslightly, practical space limitations force us to listresults by phoneme classes.

Table 3 shows our recommendations for spec-tral smoothing according to phoneme class. Weshow results only for those phoneme class pairsthat have at least 100 natural examples within

Fig. 14. ROC curves for detection of ‘‘very good’’ and ‘‘excellent’’ joints with the ANBM score for different phoneme class sets.

Pd ¼ probability of detection; Pf ¼probability of false alarm; VL¼ vowels; DT¼ diphthongs; LG¼ semi-vowels; � ¼ results from a

threshold of the mean ANBM score.


TIMIT. Although some phonemes within eachclass (and certainly individual phones) may havedifferent results, it is reasonable to generalize tophoneme classes. These recommendations are de-

rived not only from our objective experiments (seeSection 5) but also from our subjective experienceas reflected in the specific examples shown inSection 4.3.

Table 3

Recommendations on spectral smoothing by phoneme classa

Phoneme Nat. ANBM score Smoothing

Pair Min. Mean Max. S.D. Alg. Amount

Stop ! stop 46 308 662 108 Closure Large

Stop ! nasal 43 253 610 107 Closure Large

Stop ! fricative 38 211 604 98 Closure Large

Stop ! semi-vowel 39 220 673 93 Closure Large

Stop ! whisper 48 181 395 72 Closure Small

Stop ! vowel 24 191 609 78 Closure Large

Stop ! diphthong 46 200 529 88 Closure Large

Stop ! affricate 54 249 565 98 Closure Small

Nasal ! stop 41 258 642 109 Closure Small

Nasal ! nasal 23 181 383 87 LP Large

Nasal ! fricative 36 228 527 90 Closure Small

Nasal ! semi-vowel 18 193 514 95 LP Large

Nasal ! whisper 40 220 510 100 Closure Small

Nasal ! vowel 16 215 604 90 LP Large

Nasal ! diphthong 37 233 531 90 LP Large

Fricative ! stop 52 233 560 81 Closure Small

Fricative ! nasal 64 221 503 80 Closure Large

Fricative ! fricative 48 179 510 73 Closure Large

Fricative ! semi-vowel 56 204 453 77 Closure Small

Fricative ! whisper 63 166 389 63 Closure Small

Fricative ! vowel 30 200 546 76 Closure Large

Fricative ! diphthong 51 209 537 78 Closure Large

Semi-vowel ! stop 51 283 662 109 Closure Large

Semi-vowel ! nasal 44 210 515 98 LP Small

Semi-vowel ! fricative 51 224 567 84 Closure Small

Semi-vowel ! semi-vowel 32 181 546 81 LP Large

Semi-vowel ! vowel 18 178 600 80 LP Large

Semi-vowel ! diphthong 27 194 515 69 LP Large

Whisper ! vowel 45 177 534 83 Closure Small

Whisper ! diphthong 58 179 441 76 Closure Small

Affricate ! stop 88 246 431 75 Closure Large

Affricate ! vowel 63 175 533 55 Closure Small

Vowel ! stop 35 260 661 103 Closure Small

Vowel ! nasal 22 216 647 92 LP Large

Vowel ! fricative 33 198 565 75 Closure Small

Vowel ! semi-vowel 9 172 596 81 LP Large

Vowel ! whisper 40 164 449 67 Closure Small

Vowel ! vowel 6 145 538 66 LP Large

Vowel ! diphthong 32 143 368 68 LP Large

Diphthong ! stop 44 243 608 101 Closure Small

Diphthong ! nasal 37 211 562 90 LP Large

Diphthong ! fricative 47 181 477 71 Closure Small

Diphthong ! semi-vowel 35 182 414 71 LP Large

Diphthong ! vowel 20 128 427 60 LP Large

aANBM scores are shown for natural joints, and suggestions are given for the algorithm and amount of smoothing to use.


For each phoneme class pair in the table, weshow relevant statistics of the ANBM score cal-culated for naturally-occurring examples of thejoint from TIMIT. We show the range (minimumand maximum), sample mean, and sample stan-dard deviation. These values can be useful in de-termining the relative smoothness of concatenatedjoints as well as establishing phone-to-phone classthresholds for directing an appropriate smoothingmethod.

In addition, we make recommendations as tothe type of smoothing appropriate for each pho-neme class pair. Of the various algorithms con-sidered in this study, both LP techniques (seeSection 3.3) and closure (the continuity effect; seeSection 3.4) give results that are broadly useful.Optimal coupling can be used to supplement eitheralgorithm if desired.

Although it is difficult to give a quantitativerepresentation of the extent of smoothing neces-sary for all situations in a category, we do give anindication of how much smoothing is typicallyneeded for each class joint (i.e., large versus smalllevels of smoothing). The amount of smoothingspecified indicates not only the typical perceivedspectral distance between phone segments but alsothe relative amount of speech data that should beinserted in an interpolation region. When formantslie near each other between segments, then theamount of smoothing is typically small, whereaslarge discontinuities require a larger amount ofsmoothing. The amount and duration of smooth-ing are related since a longer duration is typicallyneeded to smooth a larger discontinuity, but theyare not necessarily the same. For example, despitethe large discontinuity in formant position be-tween a nasal and a vowel, a short smoothing

duration is appropriate since the spectral changeresults from the rapid motion of the velum.

4.3. Specific examples

This section considers several specific examplesof smoothing based on the previously-describedtable and smoothing algorithms. We examine threephoneme-to-phoneme combinations and considereach smoothing algorithm to conclude which ap-proach is most effective for each scenario. Thesethree scenarios are considered to be examples ra-ther than general categories.

For each example phoneme pair, we examinefour or five sample cases from different speakers –three male and one female – selected arbitrarilyfrom TIMIT. We extracted one single samplephone for each phoneme for each speaker, and weconcatenated pairs to make continuous speech. Wethen applied the previously described spectralsmoothing algorithms and examined the results.We note that in the following examples, a frameconsists of one pitch period of speech data.

Table 4 shows the compiled data for thesespecific examples. Fig. 15 shows a histogram of theresulting ANBM scores from TIMIT for each ofthe three phone-transition examples described be-low.

4.3.1. Vowel-to-vowel joint: /iy/–/aa/As an example of a vowel-to-vowel transition,

we considered /iy/ to /aa/. The phoneme /iy/ is ahigh front vowel, while /aa/ is a low back vowel. Inthis transition, articulation is due primarily to thetongue. In /iy/, formants are expected to lie at 270,2290, 3010 Hz, and in /aa/, they should lie near730, 1090, 2440 Hz. Thus, the first three formants

Table 4

Specific examples smoothing table

Phoneme Nat. ANBM Score Articulation Smoothing

Pair Min. Mean Max. Movement Formants Alg. Amount Duration

/m/ ! /iy/ 42 232 463 Lips F2 LP Large 23 ms

/d/ ! /ah/ 110 214 448 Tongue F1, F2 Closure Large 38 ms

/iy/ ! /aa/ 50 144 251 Tongue F1, F2, F3 LP Small 30 ms


are expected to be significantly offset betweensegments.

Of the smoothing algorithms under consider-ation, optimal coupling has the least noticeableeffect on this phoneme pair. While the measurescore improves with coupling, there is minimalperceived difference after coupling and minimalvisual difference in the spectrograms.

Waveform interpolation provides some smallamount of smoothing. Although formant loca-tions do not actually shift in the interpolated re-gion with WI, the transition sounds smoother forthree of the four speakers. With a large number ofWI frames, the result sounds noisy, and the bestresults are when around three or four frames areused.

In comparison, LP pole shifting does yield ac-tual formant movement as desired and thus givesbetter-sounding smoothing. Results vary accord-ing to the order of LP analysis with no one orderworking best in all cases. The pole-matchingproblem arises as previously mentioned, and poormatching can yield poor results. Four or fiveframes of smoothing typically works best, thoughonly one or two frames were appropriate in theone sample where the formants were nearlyaligned naturally.

LSF interpolation also moved the formantsproperly (in three of the four samples) and yieldedaudibly acceptable results. The optimal interpola-tion duration varies for each sample from two toseven frames. In two cases, long interpolation re-gions yielded noisy speech.

The continuity effect can yield feasible resultsfor /iy/–/aa/, but only one of the tested cases gavegood results. Generally four to six frames ofshaped noise worked best, though the best case –in which the formants were naturally nearlyaligned – had the best results for one to twoframes. In two cases, the noisy region gave thefalse illusion of the presence of an extra phoneme.

Thus, we recommend using one of the LP in-terpolation methods for spectral smoothing of the/iy/–/aa/ transition. A typical good interpolationperiod is equal to about four pitch periods, oraround 30 ms for a typical male speaker. While upto three formants may have to move a fair dis-tance, the LP algorithms can provide appropriatesmoothing.

4.3.2. Stop-to-vowel joint: /d/–/ah/The /d/ to /ah/ transition is an example of a

stop-to-vowel phoneme pair. The phoneme /d/ is avoiced alveolar oral stop, while /ah/ is a mid vowel.In this transition, the articulation is primarily withthe tongue. Based on the phonemes involved, weexpect F 1 to rise in frequency in /d/ since con-striction of front of the oral cavity lowers F 1. F 2and F 3 should have a slight fall in frequency in /d/.In /ah/, formants are expected to lie at 640, 1190,2390 Hz.

Optimal coupling gave very little benefit to thesmoothness of this joint. In four of the five caseswe examined, there was no change at all fromcoupling, and the fifth case gave only minimalchange with no perceived improvement in quality.

Fig. 15. Histogram of ANBM scores for examples.


Applying waveform interpolation to /d/–/ah/tends to overly smooth the stop consonant. The /d/sound can become indistinct or change to be per-ceived as /b/. Thus, WI should not be applied tothis phoneme pair.

The LP interpolation methods can produceappropriate formant transitions as visible onspectrograms. The audible quality, however, isarguably worse than without interpolation. Afterinterpolation, the /d/ phone can lose some of itsstop qualities or sound more like /b/.

The continuity effect gives good results for thisphoneme pair. In two cases, the results are rea-sonable, while for two others they produced suffi-cient improvement so as to make a previously poorconcatenation sound like a perfect match.

Therefore, of all the methods examined, theresults using the continuity effect clearly outshinethe others. A typical good interpolation period isequal to about five pitch periods, or around 38 msfor a typical male speaker. While the stop conso-nant loses its character under most forms ofspectral smoothing, the presence of shaped noisecan noticeably increase perceived smoothness andquality.

4.3.3. Nasal-to-vowel joint: /m/–/iy/As a sample nasal-to-vowel pair, we examined

/m/ to /iy/. The phoneme /m/ is a voiced bilabialnasal stop, while /iy/ is a high front vowel. In thistransition, the articulation is primarily with thelips, though the tongue also plays some role. In /m/,we expect to find F1 near 250 Hz and F3 near 2200Hz; F2 should be weak, and an antiresonanceshould be around 750–1250 Hz. In /iy/, formantsare generally located around 270, 2290, and 3010Hz. Thus, formants are not likely to be too farapart between segments.

With several of the sample cases we examined,the formant intensity was so much stronger in the/iy/ phone than in the /m/ that it is difficult to de-termine spectral smoothness via visual inspectionof spectrograms. Still, the most important judge-ment is auditory perception of smoothness.

Applying optimal coupling to this phonemepair gives only slight improvement. Though thecoupling algorithm yields larger shifts in phoneboundary positions for this pair than for other

phoneme pairs examined in detail, the results wereonly marginally better in formant matching andperceived quality. In comparison, waveform in-terpolation yields only slight improvement for thisphoneme pair. None of the test cases gave lowerperformance with WI, but results ranged from nonoticeable change to slight perceived improve-ment. The transition was sometimes smoother asformants faded in and out.

Both LP interpolation algorithms gave somesmoothness improvement. An interpolation periodwhich is too long yields a buzzy, scratchy quality,but the results otherwise generally sound smootherand more natural. The change in formant locationand bandwidth is also noticeably smoother. LSFinterpolation yielded slightly better performancethan pole interpolation.

Using noise with the continuity effect does notgive much smoothing for this junction. In three ofthe four cases, using shaped noise resulted in anunacceptable ‘‘hollow’’ sound. Using white noisewas feasible in some cases but at times was worsethan raw concatenation. Inserting too much noisecould result in the /m/ being perceived as a frica-tion. In general, the continuity effect results werepoor.

Therefore, the LP algorithms – LSF interpola-tion in particular – give the best spectral smooth-ing performance for /m/–/iy/. We found that atypical good interpolation period is around threepitch periods, or 23 ms for a typical male speaker.Incorporating optimal coupling as well can yieldfurther improvement, but LSF interpolation pro-vides the most useful impact.

5. Results and evaluations

The net results of the examined algorithmsshowed improvement over standard techniquesapplied to small databases. The final speech ismore natural-sounding than direct concatenationof selected units with no spectral processing. Still,even the best smoothing algorithms sometimesyield poor results at certain joints. Thus, blind useof a smoothing algorithm to all segment joints canresult in speech that is of worse net quality than


direct concatenation, but proper use of smoothingcan noticeably increase the quality.

Table 5 summarizes the four major approacheswhich have been considered in this study. Toevaluate these spectral smoothing algorithms, weperformed comparisons based on a perceptuallistener test and an objective quality measure. Inaddition, we show and describe a set of samplespectrograms that compare the algorithms’ per-formance. We combine these results with our ownsubjective observations from this study to drawconclusions about the algorithms’ effectiveness.

5.1. Data tested

For these evaluations, we chose to use twodifferent speech databases. Our informal tests andevaluations were primarily based on the TIMITdatabase, where each speaker provides only ten

phonetically-balanced sentences with approxi-mately 400 phones. The phoneme labels forTIMIT include 60 distinct allophones.

In comparison, for our formal tests we useddata collected specifically for this research study.For each speaker, the new database includes 35phonetically balanced sentences and 114 wordsfor a total of approximately 2300 phones. Thecorpus includes 34 continuous read sentences, ofwhich 18 are adapted from the TIMIT databaseand 16 are adapted from the Brown text corpus. 1

There are also a large number of mono-syllabic,isolated words: 19 words are read three times each,and 57 words are read once each. Finally, thereis 10 s of continuous, spontaneous speech. Thisdatabase was phoneme-labeled with the RSPLspeech time-aligner (Pellom and Hansen, 1998;Pellom, 1998), which models 46 different phonemeunits.

Table 5

Summary of interpolation algorithms

Optimal coupling

Summary Adjust segment boundaries to improve spectral match

Advantages Does not modify actual signal

Disadvantages Limited benefit gained

Results Most consistent improvement in quality

Evaluation Better than nothing but not sufficient

Recommendation Useful as is

Waveform interpolation

Summary Interpolate between two pitch periods

Advantages Simple

Disadvantages Does not consider formant locations

Results Occasionally yields good results

Evaluation Not good by itself but useful on LP residual

Recommendation Useful on LP residual

LP interpolation

Summary Interpolate between linear predictive parameters

Advantages Averages formants when parameters match formants

Disadvantages Poor matching of parameters will give poor results

Results Performance varies from good in many cases to poor

Evaluation Often quite good; warrants more work

Recommendation Useful as is; warrants more research

Closure (continuity effect)

Summary Insert noise shaped to match desired spectral envelope

Advantages Spectrally matches surrounding signal

Disadvantages Still possesses noisy quality

Results Offers improvement primarily for transitions with noise-like sounds

Evaluation Holds potential, but not good enough yet

Recommendation Warrants more research


For the comparison of smoothing algorithms,14 sample words were created by constructing newwords from the collected database. The first half ofone word was matched with the second half ofanother word. Pitch periods were manually alignedbetween segments, and the segments were chosenso that the pitch varied by no more than 2.5 Hz ateach joint. The original prosody of each segmentwas left unaltered. Spectrally smoothed frameswere inserted via overlap-add.

The concatenated words and their source con-texts are listed in Table 6. We chose to test segmentjoints that are common in English. Our work alsoplaces an emphasis on voiced data, and thereforeall combinations included transitions to or fromvowels. The following segment joints under testfall within seven of the nine most frequent pho-neme-class transitions as measured from TIMIT(Pellom and Hansen, 1998): vowel! stop, semi-vowel (liquid/glide)!vowel, stop!vowel, frica-tive!vowel, vowel!nasal, vowel! semi-vowel,nasal!vowel.

5.2. Listener test

A number of informal listener tests were per-formed – both subjective and objective – in thevarious stages of examining the spectral smoothingalgorithms. The results of several of these testswere mentioned or incorporated into the preceding

text. Here we report only the final, formal listenertest results.

In order to reach a large number of potentiallisteners, the listener test was implemented acrossthe World Wide Web (WWW). Both the instruc-tions and the test itself were placed on the WWWwith speech available in multiple audio file formatsto accommodate listeners on a variety of com-puters. Although direct control of the exact lis-tening environment was not possible, we did makerecommendations and ask listeners to report onthe equipment used in performing the evaluation.

The test included several variations on 14 dif-ferent words concatenated from segments in thedatabase (see Section 5.1). Algorithm variationsincluded the natural speech, raw concatenationwithout smoothing, optimal coupling, waveforminterpolation, LP pole shifting, LSF interpolation,and shaped noise (continuity effect). In a brieftraining phase, listeners were presented with an-chor speech signals to exemplify the high andlow judgement categories. Listeners were asked tomake a category judgement and give an opinionscore for each word under test with nine ratings ona 1.0–5.0 scale (Deller et al., 2000; Quackenbushet al., 1988).

A total of 33 listeners with no history of hearingproblems performed this mean option score(MOS) evaluation. Using a five-point MOS scale,with half-step ratings allowed, each algorithm

Table 6

Words used for listener testa

Word Phonemes Source Context

Bear /b//eh/r/ /b/oa/t/ /hh/eh/r/

Dog /d//ao/g/ /d/eh/r/ /w/ao/g/

Fear /f//iy/r/ /f/ay/r/ /w/iy/r/

Hair /hh/eh//r/ /hh/eh/d/ /h/ay/r/

Here /hh/ /ih/r/ /hh/aa/d/ /m/ih/r/

Hide /h/ay//d/ /h/ay/r/ /hh/aa/r/d/

Make /m/ey//k/ /m/ey/n/ /w/ow/k/

Mat /m//ae/t/ /m/ao/n/ /b/ae/t/

Moon /m/uw//n/ /m/uw/n/ /m/aa/n/

Nut /n//ah/t/ /n/aw/ /b/ah/t/

Wait /w//ey/t/ /w/eh/g/ /b/ey/t/

Wine /w/ay//n/ /w/ay/k/ /m/ow/n/

Wire /w/ay//r/ /w/ay/k/ /hh/ih/r/

Wood /w//uh/d/ /w/iy/k/ /hh/uh/d/

a Phones were extracted from two source words and combined at a single concatenation point to synthesize a new word. Phoneme

labels are from the TIMIT set. In the given phonemic spellings, ‘‘//’’ indicates the point of concatenation between phones.


received at least one vote across all nine possiblegrades. This result points to the wide range oflistener preference for all methods considered (e.g.,from raw concatenation to each of the smoothingmethods). Table 7 shows the mean opinion scoresfor each of the algorithms tested.

Fig. 16 plots these same MOS results. For eachspectral smoothing technique, the mean opinionscore is shown both for the entire test set and foreach phoneme class. Each transition tested in-cludes a vowel and a consonant (NA¼ nasal,LG¼ semi-vowel, ST¼ stop, or FR¼ fricative).The vertical bar indicates the sample standard

deviation (SD) of the overall opinion scores withtick marks at 0.5 and 1.0 SD.

Note that the optimal coupling scores givenhere are only for those concatenated words forwhich coupling did change the final speech signal.In 8 of the 14 words of the test, the point of con-catenation was the same for the raw concatenationand the optimally coupled forms. Although in-cluding the unmodified data for these eight wordsin with the coupled data decreases the MOS resultsfor coupling, it still leaves coupling with a higherrating than raw concatenation without anysmoothing.

Table 7

MOS results from listener testa

Algorithm MOS Better Worse

Natural speech 4.13 N/A N/A

Raw concatenation 3.53 N/A N/A

Optimal coupling 3.82 77.0% 23.0%

Waveform interpolation 2.80 40.1% 59.9%

Pole shifting 2.69 38.4% 61.6%

LSF interpolation 3.14 39.4% 60.6%

Shaped noise (closure) 2.43 20.1% 79.9%

a Included are percent of tokens rated better than raw concatenation and percent worse than raw concatenation.

Fig. 16. Mean opinion scores from listener test. Evaluations are of transitions between vowels and specified phone classes.


The mean scores for each category indicate thegeneral performance, but analysis was also per-formed on the relative scoring of each algorithmfor each of the evaluated words. Table 7 also re-ports the fraction of cases in which each smooth-ing algorithm was rated better or worse than thesame word with no smoothing. With the exceptionof optimal coupling, the opinion scores were gen-erally lower after smoothing was performed. Whilethe focus here was on the relative performanceacross the different smoothing methods, it is pos-sible that the absolute MOS scores could changefor a larger number of tested segment joints. Still,these results are in line with observations thatsmoothing can at times be useful or harmful de-pending upon the phoneme circumstances. Opti-mal coupling was the only algorithm whichconsistently performed well enough to receive therecommendation of general use without consider-ation of the situation.

5.3. Objective scoring

The ANBM (see Section 2.3) was used to pro-vide an objective measure for assessing segmentdiscontinuity. ANBM scores were calculated for

40 concatenated words which were smoothed witheach of the techniques under evaluation. Thisword set includes all the examples in the listenertest (see Section 5.2) in addition to other concat-enated words. For each concatenated word, theANBM score was obtained at multiple pointsaround each joint with a comparison made for themaximum ANBM scores.

Fig. 17 reports the ANBM measure scores inthe same format as the MOS scores in Fig. 16 (seeSection 5.2). The large standard deviations forseveral of the algorithms reflect how the resultsvary widely from measureably improved to mildlydegraded over pure concatenation. Note that theANBM scores do vary by phoneme junction, andthe standard deviation marks help indicate howeach phoneme class performs for different algo-rithms compared with the overall test set. Thewhite noise results are reported for comparisonwith shaped noise.

5.4. Spectrogram comparison

Fig. 18 shows one example spectrogram fromeach smoothing algorithm. The phrase ‘‘carry anoily rag’’ from TIMIT is used for these spectro-

Fig. 17. ANBM scores for smoothed joints. Examples are from transitions between vowels and specified phone classes.


grams. The naturally-produced speech in Fig.18(a) demonstrates that most formants havesmooth transitions between phones yet some seg-ment joints have rougher formant transitions.

Each subsequent spectrogram in the figureshows synthesized speech produced by concate-nating phones with the same phone segments usedin each example. The ANBM was used as theconcatenation cost in selecting phone segments toattempt to find good spectral alignment in seg-ments from the database. Note that the formantstend to be smoother and more continuous withseveral of the smoothing techniques, especiallywith LP pole shifting.

Fig. 18(b) does not include any spectralsmoothing; although several joints have only smallformant movement, others (e.g., at 0.62 s) havelarge and unnatural jumps in formant position. InFig. 18(c) the /oy/ phone (0.4–0.6 s) and the /l/phone (0.6–0.7 s) are clearly different than for theother examples shown; one result is a smootherformant transition at 0.6 s. In Fig. 18(d), severalsegments are noticeably different at the joints dueto waveform interpolation; in the example at 0.6 s,the widely separated formants move towards eachother compared with (b) but still show a roughtransition. In the LP pole shifting example shownin Fig. 18(e), good spectral transitions are presentfor segments in the time region (0.1–0.4 s); how-ever, poor smoothing is present near 0.6 s. In Fig.18(f), LSF interpolation not only provides goodsmoothing for time region (0.1–0.4 s) but alsodistinctly improves the formant transitions near0.6 s. It is important to note that processing via thecontinuity effect (as shown in Fig. 18(g)) is per-ceptually motivated, and as such the standardspectrogram will not display the perceived level ofspectral smoothness.

5.5. Discussion

The evaluations presented here show thatspectral smoothing can at times improve thequality of speech yet at times can degrade it fur-ther. When smoothing succeeds it can noticeablyincrease the continuity of speech. In some scenar-ios it is better to perform no processing at all.

Fig. 18. Spectrograms of the phrase ‘‘carry an oily rag’’: (a)

naturally produced and (b)–(g) concatenated speech. Solid

vertical lines mark actual phone segment boundaries.


Moreover, which algorithm (if any) is best to usedepends upon the circumstances.

Successful spectral smoothing can reduce thedisfluency of speech. Smoothing seems to affectnaturalness more than it affects intelligibility. Therange of results from these evaluations does notimply that smoothing as a whole is not good, butinstead it indicates that there is no single solutionthat can properly smooth all spectral discontinu-ities.

Thus, indiscriminate use of spectral smoothingis a poor choice because the results can producefurther discontinuities. With current techniques, itgenerally would be best to manually inspect eachjoint after smoothing, but such user-assisted laboris typically impractical. The results from MOS lis-tener evaluations and ANBM scores show that nosmoothing method is clearly superior and thatmore effective methods are necessary. We recom-mend use of an existing automated quality-check-ing procedure such as rating with the ANBM.

The evaluations presented here have emphasizedapplications in concatenative synthesis with a lim-ited data set of phonemes. Many of the conceptsdescribed herein also apply to synthesis with di-phone sets and large corpora, but the emphasisshifts in such situations. When spectral smoothingis used in speech and audio coding, the situationdiffers because the speech was originally spectrallycontinuous. For coding applications, smoothing istypically simpler and does not encounter some ofthe previously mentioned problems that accompanyconcatenative synthesis. For example, LP parametermatching has a higher success rate with coding ap-plications, and the duration of smoothing is some-times shorter. Moreover, the quality of the resultingsmoothed speech is generally higher for coding.

6. Conclusions

In this study, we have focused on a compari-son of four algorithms for spectral smoothing ofconcatenated speech. The algorithms consideredinclude three major existing techniques forsmoothing – optimal coupling, waveform interpo-lation, and LP interpolation – and one technique(application of the continuity effect) which was

considered for spectral smoothing for the first time.In addition to performing extensive informalcomparisons of the algorithms, we have reportedresults from a formal listening test and scoring withan auditory-based objective measure. These eval-uations have been performed in the context of aphoneme concatenation synthesizer with a smalldata set. The net results of the discussed algorithmsare that no method is clearly superiod (see Table 5)and no single algorithm performs best in all phonejoint circumstances. Application of smoothingmethods to many of the smoothed segment jointsdemonstrate noticeable improvements over directconcatenation, while other joints are of noticeablyworse quality after applying a smoothing algo-rithm. This study has shown that most segmentbased smoothing methods are not universally suc-cessful for all segment joints and that the use of anobjective measure of segment joint quality is nec-essary to direct more effecting smoothing.

Although synthesis systems with smoothingtypically apply a single algorithm indiscriminately,we recommend using a smart system. Knowledgeof the phonemes involved in each joint enablesselection of an appropriate smoothing algorithm.Scoring with an objective measure such as theANBM enables automated evaluation of whetherthe smoothing has improved or degraded theperceived quality of each transition.

While using such a smart system enables im-provements over current techniques, there is stillroom for enhancement in spectral smoothing. Forexample, LP pole shifting could greatly benefitfrom the derivation of a better distance measure formatching poles in the z-plane across frames. Inaddition, the proposed use of the continuity effect isnovel and promising enough that there may beways to improve it that we have not yet considered.

When spectral smoothing is appropriatelyapplied, the final speech has smoother, morecontinuous formants and is often more natural-sounding than direct concatenation of segmentswithout processing. These improvements benefitspeech coding by enabling appropriate generationof intermediate data between transmitted frames.These smoothing algorithms also improve the re-sults of concatenative speech synthesis with alimited database.


Acknowledgements

This study was supported in part by SPAWARunder grant no. N66001-92-D-0092. The authorsthank M. Ostendorf (now at Washington Univer-sity) for providing C code for the Carney auditory-nerve model during the 1993 DoD Workshop onRobust Speech Recognition, Rutgers University.The authors also thank Bryan Pellom and RobertRoss, also with the Robust Speech ProcessingLaboratory, for their assistance in collecting andlabeling some of the speech corpora used in thisstudy.

References

Atal, B.S., Cox, R.V., Kroon, P., 1989. Spectral quantization

and interpolation for CELP coders. In: Proc. 1989 IEEE

ICASSP, Glasgow, Scotland, Vol. 1, pp. 69–72.

Berry, D.A., 1996. Statistics: A Bayesian Perspective. Duxbry,

Belmont, CA.

Breen, A.P., Jackson, P., 1998. A phonologically motivated

method of selecting non-uniform units. In: Proc. 1998

ICSLP, Sydney, Australia.

Bregman, A.S., 1990. Auditory Scene Analysis: The Perceptual

Organization of Sound. MIT Press, Cambridge, MA.

Carney, L.H., 1992. A model for the responses of low-frequency

auditory-nerve fibers in cat. Journal of the Acoustical

Society of America 93, 401–417.

Chappell, D.T., Hansen, J.H.L., 1997. An auditory-based

measure for improved phone segment concatenation. In:

Proc. 1997 IEEE ICASSP, Munich, Germany, Vol. III,

pp. 1639–1642.

Coker, C.H., 1976. A model of articulatory dynamics and

control. Proc. IEEE 64, 452–460.

Conkie, A.D., Isard, S., 1997. Optimal coupling of diphones.

In: van Santen, J.P.H., Sproat, R.W., Olive, J.P., Hirsch-

berg, J. (Eds.), Progress in Speech Synthesis, Springer,

New York, Chapter 23, pp. 293–304.

Deller Jr., J.R., Hansen, J.H.L., Proakis, J.G., 2000. Discrete-

Time Processing of Speech Signals. IEEE Press, New York.

Donovan, R.E., 1996. Trainable speech synthesis. Ph.D. thesis,

Department of Engineering, Cambridge University.

Dutoit, T., 1994. High quality text-to-speech synthesis: a com-

parison of four candidate algorithms. In: Proc. 1994 IEEE

ICASSP, Adelaide, South Australia, Vol. 1, pp. 565–568.

Dutoit, T., Leich, H., 1993. MBR-PSOLA: Text-To-Speech

synthesis based on an MBE re-synthesis of the segments

database. Speech Communication 13, 435–440.

Erkelens, J.S., Broersen, P.M.T., 1994. Analysis of spectra

interpolation with weighting dependent on frame energy. In:

Proc. 1994 IEEE ICASSP, Adelaide, South Australia, Vol.

1, pp. 481–484.

Fant, G., 1960. Acoustic Theory of Speech Reproduction.

Mouton, The Hague.

Flanagan, J.L., 1972. Speech Analysis, Synthesis and Percep-

tion, second ed. Springer, New York.

Goncharoff, V., Kaine-Krolak, M., 1995. Interpolation of LPC

spectra via pole shifting. In: Proc. 1995 IEEE ICASSP,

Detroit, MI, Vol. 1, pp. 780–783.

Hansen, J.H.L., Chappell, D.T., 1998. An auditory-based

distortion measure with application to concatenative speech

synthesis. IEEE Transactions on Speech and Audio Pro-

cessing 6, 489–495.

Hirokawa, T., Hakoda, K., 1990. Segment selection and pitch

modification for high quality speech synthesis using wave-

form segments. In: Proc. 1990 ICSLP, Kobe, Japan, Vol. 1,

pp. 337–340.

Huang, X., Acero, A., Hon, H., Ju, Y., Liu, J., Meredith,

S., Plumpe, M., 1997. Recent improvements on Micro-

soft’s trainable text-to-speech system–Whistler. In: Proc.

1997 IEEE ICASSP, Munich, Germany, Vol. II, pp.

959–962.

Hunt, A.J., Black, A.W., 1996. Unit selection in a concatenative

speech synthesis system using a large speech database. In:

Proc. 1996 IEEE ICASSP, Atlanta, GA, pp. 373–376.

Klabbers, E., Veldhuis, R., 1998. On the reduction of concat-

enation artifacts in diphone synthesis. In: Proc. 1998 ICSLP,

Sydney, Australia, Vol. 5, pp. 1983–1986.

Kleijn, W.B., Haagen, J., 1995. Waveform interpolation for

coding and synthesis. In: Kleijn, W.B., Paliwal, K.K.,

(Eds.), Speech Coding and Synthesis, Elsevier, Amsterdam,

Chapter 5, pp. 175–207.

Kleijn, W.B., Shoham, Y., Sen, D., Hagen, R., 1996. A low-

complexity waveform interpolation coder. In: Proc. 1996

IEEE ICASSP, Atlanta, Georgia, Vol. 1, pp. 212–215.

Ladefoged, P., 1975. A Course in Phonetics, third ed. Harcourt

Brace, New York.

Ladefoged, P., 1981. Preliminaries to Linguistic Phonetics.

University of Chicago, Chicago.

Liberman, M.C., 1982. The cochlear frequency map for the cat:

labeling auditory-nerve fibers of known characteristic

frequency. Journal of the Acoustical Society of America

72, 1441–1449.

Mizuno, H., Abe, M., 1995. Voice conversion algorithm based

on piecewise linear conversion rules of formant frequency

and spectrum tilt. Speech Communication 16, 153–164.

Mizuno, H., Abe, M., Hirokawa, T., 1993. Waveform-based

speech synthesis approach with a formant frequency mod-

ification. In: Proc. 1993 ICASSP, Vol. 2, pp. 195–198.

Moore, B.C.J., 1997. An Introduction to the Psychology of

Hearing, fourth ed. Academic Press, New York.

Moulines, E., Charpentier, F., 1990. Pitch-synchronous wave-

form processing techniques for text-to-speech synthesis

using diphones. Speech Communication 9, 453–467.

Moulines, E., Laroche, J., 1995. Non-parametric techniques for

pitch-scale and time-scale modification of speech. Speech

Communication 16, 175–205.

O’Shaughnessy, D., 1990. Speech Communication: Human and

Machine. Addison-Wesley, New York.


Paliwal, K.K., 1995. Interpolation properties of linear predic-

tion parametric representations. In: Proc. EuroSpeech’95,

Madrid, Vol. 2, pp. 1029–1032.

Paliwal, K.K., Kleijn, W.B., 1995. Quantization of LPC

parameters. In: Kleijn, W.B., Paliwal, K.K. (Eds.), Speech

Coding and Synthesis. Elsevier, Amsterdam, pp. 433–466.

Papamichalis, P.E., Cliffs, N.J, 1987. Practical Approaches to

Speech Coding. Prentice-Hall, Englewood.

Parthasarathy, S., Coker, C.H., 1992. On automatic estimation

of articulatory parameters in a text-to-speech system.

Computer Speech and Language 6, 37–75.

Pellom, B.L., 1998. Enhancement, segmentation, and synthesis

of speech with application to robust speaker recognition.

Ph.D. thesis, Robust Speech Processing Laboratory, Dept.

of Electrical Engineering, Duke University.

Pellom, B.L., Hansen, J.H.L., 1998. Automatic segmentation of

speech recorded in unknown noisy channel characteristics.

Speech Communication 25, 97–116.

Pickett, J.M., 1980. The Sounds of Speech Communication: A

Primer of Acoustic Phonetics and Speech Perception.

University Park Press, Baltimore.

Plumpe, M., Acero, A., Hon, H., Huang, X., 1998. HMM-

based smoothing for concatenative speech synthesis. In:

Proc. 1998 ICSLP, Sydney, Australia, Vol. 6, pp. 2751–

2754.

Quackenbush, S.R., Barnwell, T.P., Clements, M.A., 1988.

Objective Measures of Speech Quality. Prentice-Hall, En-

glewood Cliffs.

Rabiner, L., Juang, B.-H., 1993. Fundamentals of Speech

Recognition. Prentice-Hall, Englewood Cliffs, NJ.

Savic, M., Nam, I.-H., 1991. Voice personality transformation.

Digital Signal Processing 1, 107–110.

Shadle, C.H., Atal, B.S., 1979. Speech synthesis by linear

interpolation of spectral parameters between dyad bound-

aries. The Journal of the Acoustical Society of America 66,

1325–1332.

Shiga, Y., Matsuura, H., Nitta, T., 1998. Segmental duration

control based on an articulatory model. In: Proc. 1998

ICSLP, Sydney, Australia, Vol. 5, pp. 2035–2038.

Slaney, M., Covell, M., Lassiter, B., 1996. Automatic audio

morphing. In: Proc. 1996 IEEE ICASSP, Atlanta, Georgia,

pp. 1001–1004.

Slifka, J., Anderson, T.R., 1995. Speaker modification with

LPC pole analysis. In: ICASSP-95, Vol. 1, pp. 644–647.

Snell, R.C., Milinazzo, F., 1993. Formant location from LPC

analysis data. IEEE Transactions on Speech and Audio

Processing 1, 129–134.

Stevens, K.N., House, A.S., 1955. Development of a quantita-

tive description of vowel articulation. Journal of the

Acoustical Society of America 27, 484–493.

Syrdal, A., Stylianou, Y., Garrison, L., Conkie, A., Schroeter,

J., 1998. TD-PSOLA versus harmonic plus noise model in

diphone based speech synthesis. In: ICASSP-98, Seattle,

Vol. 1, pp. 273–276.

Warren, R.M., 1970. Perceptual restoration of missing speech

sounds. Science 167, 392–393.

Witten, I.H., 1982. Principles of Computer Speech. Academic

Press, New York.

Zemlin, W.R., 1968. Speech and Hearing Science: Anatomy

and Physiology. Prentice-Hall, Englewood Cliffs, NJ.


Documents

A comparison of spectral smoothing methods for segment ...plaza.ufl.edu/guru1984/files/Spectral_Smoothing... · A comparison of spectral smoothing methods for segment concatenation