[WSOLA]an Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High-quality Time-scale Modifications of Speech

Embed Size (px)

Citation preview

  • 8/6/2019 [WSOLA]an Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High-quality Time-scale Modifications

    1/4

    AN OVERLAP-ADD TECHNIQUE BASED ON WAVEFORM SIMILARITY (WSOLA)FOR HIGH QUALITY TIME-SCALE MODIFICATION OF SPEECH

    Werner VERHELSTand Marc ROELANDSVrije U niversiteit Brussel

    Faculty of Applied Science, dept. ETRODSSPPleinlaan 2, B-1050 Brussels

    Belgium

    ABSTRACTA concept of wavefonn similarity is proposed for tackling theproblem of time-scale modification of speech, and is worked-out in the context of short-time Fourier transfonnrepresentations.The resulting WSOLA algorithm produces high quality speechoutput, is algorithmically and computationally efficient androbust, and allows for on-line processing with arbitrary time-scaling factors that may be specified in a time-varying fashionand that can be chosen over a wide continuous range of values.

    I. INTRODUCTIONAlgorithms for high quality time-scale modification of speechare important for applications such as voice-mail and dictation-tape playback or post synchronization of film and video, wherethe potentiality of controlling the apparent speaking rate is adesirable feature. The problem with time-scaling a speechsignal x ( n ) lies in realising the specified time-warp function~ ( n )n such a way as to affect the apparent speaking rate only,preserving other perceived aspects such as timbre, voicequality, and pitch. W e will therefore consider in this paper thatthe ideal time-scaling algorithm should produce a syntheticwaveform y ( n ) that maintains maximal local similarity to theoriginal wavefonn x ( m ) in corresponding neighbourhoods ofrelated sample indices n = T ( m ) . This could be expressedmathernatically as

    Vm : y(n+T(m)) .w(n) (=) x(n+m).w(n) , (1)where w(n) is a windowing function, and the symbol e)sdefined to hold the rather vague meaning maximally similarto.Assuming that the relation of maximal similarity persists afterFourier transformation, and defining the short-time Fouriertransfonn X(w,m) of a sequence x ( m ) by

    +-

    X ( o , m ) = C x ( n + m ) . w ( n ) . e - ,n=-Pexpression (1) can be rewritten as

    Y ( o , ~ ( m ) )) ( o , m ) . (2 )

    If we choose the effective length of w ( n ) in (1) to span at leastone pitch period, we can expect that the important perceptualcharacteristics of the signal can remain fairly unaffected by thetime-scaling operation, provided that Y(o,m) can be specifiedin accordance with a suitable similarity measure.In general, fmding an operational definition for (=I in (2 )amounts to solving the time-scaling problem based oninanipulation of short-time Fourier transforms. Section 2 ofthis paper discusses some of the problems encountered and theapproach taken in algorithms of the overlap-add (OLA) andsynchronized overlap-add (SOLA ) traditions. Section 3introduces our WSOLA approach as a variant in which weexplicitly pursued the idea of m aking operational the intuitivenotion of maximal local wavefonn similarity. Beforeconcluding the paper, we indicate in section4 that WSOLAproduces a natural sounding output and is algorithmically andcomputationally efficient and robust, and allows for on-lineprocessing with arbitrary time-scaling factors that may bespecified in a time-varying fashion and can be chosen over awide continuous range of values.11. TIME-SCALE MODIFICATION BASED ON OLA-

    TECHNIQUESLet X (w , z (L , ) )epresent a down-sampled version of the short-time Fourier transform (STFT) of the input signal x ( n ) , an dassume we force (=) to represent strict equality in equation (2)by specifying the 2-dimensional function

    f (w ,L, ) = x ( w , . c - ( L , ) ) . (3 )It will be clear that, except for som e trivial cases such as w(n)E0 or T(m) m , there will not generally exist a solution forequation (1) with equality required. This implies that there willnot generally exist a signal j ( n ) that has Y ^ ( o . L , ) as a STFTor, equivalently, that Y^ ( W . L , ) is not valid as a STFT.This kind of problem is liable to occur whenever the intent is tocreate a 1-dim ensional signal by constructing a 2-di~ne nsionalSTFT-representation of it. As a possible solution, the overlap-add technique [ l ] proposes to synthesize a signal y(n) whoseSTFT Y(o&,) is as close as possible to the desired Y ^ ( w . L , ) inthe least-squares (LS) sense.

    11-5540-7803-0946-4193 3.00 0 993 IEEE

    Authorized licensed use limited to: National Tsing Hua University. Downloaded on June 3, 2009 at 07:55 from IEEE Xplore. Restrictions apply.

  • 8/6/2019 [WSOLA]an Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High-quality Time-scale Modifications

    2/4

    In case Y ^ ( w . L , ) is a time-warped STFT, as in (3). th ecorresponding synthesis equation becom es

    C v ( n - L , ) . x ( n + z - ( L , ) - L, )1

    where v ( n )=w(n) is a windowing function and the Lk epresentconsecutive window positions. The basic OLA-synthesisoperation (numerator of eq . 4) then co nsists of cutting-out inputsegments around analysis instants I?(&), and repositioningthem at corresponding synthesis instants Lkbefore adding themtogether to fonn the output signal. While it can be noted thatstraightforward application of the OLA procedure to the time-warped STFT Y ^ ( w , L , ) corresponds to interpreting asmaximally close in LS-sense in Y(w Lk) (3 X(W,TI(L,)) (adown-sampled version of eq. (2 ) ) . t would not be an appropriatechoice in this case. It was shown in [ 2 ] hat this is not due tothe OLA procedure itself, but rather to the choice that was madefo r I; ( w . L , ) . Indeed, it was expressed in eq. (3 ) that individualoutput segments should ideally correspond to input segmentsthat have been repositioned according to the desired time-warpfunction. As a result, the OLA procedure in eq. (4) doesreposition the individual input segment5 with respect to eachother (destroying original phase relationships in the process)and constructs the output signal by interpolating between thesemisaligned segments. The resulting distortions are detrimentalfor signal quality and are illus trated in figure 1.

    Fig. 1. OLA-synthesisfrom li r t ime-warped STFT docs notsucceed to rep Iicatc the quasi-pe riodic structirre of tl w origincilsignal (a) n its output (6).To avoid pitch period discontinuities or phase jumps atwaveform-segment joins, [ 2 ] proposes to realign each inputsegment to the already fonned portion of the output signalbefore performing the OLA operation. The resultingsynchronized OLA algorithm (S OLA ) thus produces the signal

    kin a left-to-right fashion, where shift factors Ak (E [-A,,,,.A,,,])are chosen such as to maximize the cross-correlationcoefficient between v(n- L , +A,).& + z-l (L,)- L, +A,) an d

    /=--Another form of synchronization is obtained by applying atime-domain pitch-synchronized OLA technique (TD-PSOLA[3]). In that case the OLA procedure is performed pitch-synchronously (i.e., Lk-Lk-, quals the local pitch period) onsegments that are, accordingly, excised in a pitch synchronousway from an original x( n) .We can thus observe that both SOLA and TD-PSOLArecognize that a tolerance At is needed in order to ensure propersegment synchronization in OLA synthesis: while SOLA usesthis tolerance to allow for post-synchronization in

    TD-PSOLA uses it in a pre-synchronization step to obtain apitch synchronous ST FT on both sides of

    f(w,L,) X ( W , Z - ( L , ) + A ~ ) .

    LII. WSOLA: AN OVERLAP-ADD TECHNIQUE BASEDON WAVEFORM SIMILARITY

    It was shown in the preceding section that, if an O LA synthesisprocedure is to be used for time-scaling, one should allow for atolerance on the time-warping function that will actually berealised. In fact, this tolerance can be seen to give concretefonn to the words corresponding neighbourhoods that wereused in the introductory section to state that the ideal time-scaling algorithm should produce a synthetic waveform y ( n )that maintains maximal local similarity to the originalwaveform n(n) in corresponding neighbourhoods of relatedsample indices n = T ( m ) . lLike TD-PSOLA, WSOLA uses this timing tolerance forspecifying the input segments that are to be used in the OLAprocedure. Therefore, in both cases the basic synthesisequation is

    v(n- Lk.x (n + z-I(L, )+Ak- , )

    While in TD-PSOLA the A, are chosen such that pitchsynchronicity is maintained, WSOLA uses them to ensure thatthe time-scale modified waveform can maintain maximalsimilarity to the original (natural) waveform across its segmentjoins. In other words, WSO LA ensures sufficient signalcontinuity at segment joins by requiring maximal similarity tothe natural continuity that existed in the input signal. Based on

    better mathematical rendering of the intended meaningwould have been possible by usinginstead of eq. (1) .

    Vm: y ( n +m ).w (n) (=) x (n + ~ ( m )A,).w(n)

    Authorized licensed use limited to: National Tsing Hua University. Downloaded on June 3, 2009 at 07:55 from IEEE Xplore. Restrictions apply.

  • 8/6/2019 [WSOLA]an Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High-quality Time-scale Modifications

    3/4

    this idea, a variety of practical implementations can beconstructed. The operation of a basic version of the WSOLAtechnique is illustrated in figure 2and explained below.

    Fig. 2. Illustration of a W SOLA algorithm.By choosing regularly spaced synthesis instants L, = kL and asymmetric window such that

    (7 )synthesis equation (6 ) simplifies to

    y ( n )= C v ( n- L ) . x ( n+T-' (K)- L + A I ) . (8)Proceeding in a left-to-right fashion, assume segment (1) fromfigure 2was the last segment that was excised from the inputand added to the output at time instant Lk-, = (k-l)L, i.e.segment (a) = segment (1). WSOLA then needs to find asegment (b) that will overlap-add with (a) in a synchronizedway and can be excised from the input around time instantr ' ( k . L ) . As (1') would overlap-add with (1) = (a) in a naturalwa y to form a portion of the original input speech, WSOLA canselect (b) such that it resembles (1') as closely as possible andis located within the prescribed tolerance interval aroundr ' ( k L ) n the input wave. The position of this best segment (2 )is found by maximizing a similarity measure (such as the cross-correlation or the cross-AMDF) between the sample sequenceunderlying (1') and the input speech. After overlap-adding (b)with (a), WSOLA proceeds to the next output segment, where(2') now plays the same role as (1') in the previous step.Figure 3 illustrates in more detail how the position of a bestsegment m is detennined by finding the value 6 = A, that lieswithin a tolerance region [-4,,..A,,,,] around z - ' ( m . L ) andmaximizes the chosen similarity measure c(m,6) with respect tothe signal portion that would form a natural continuation for thepreviously chosen segment m-1.

    L

    2A 30 ms hanning window with 50% overlap, for example,satisfies this condition and is a fairly standard choice.

    Fig. 3. Illustration of similarity-based signalsegmentation in WSOLA.N representing the window length, som e examples of similaritymeasures that can be applied successfully are:

    a cross-correlation coefficientN - lc , ( m , 6 )=x x ( n + . c - ' ( ( m 1 ) L) + Am-, L ) . x ( n u r - ' ( m L ) + ) ,"-0

    a normalised cross-correlation coefficient

    or a cross-AMDF coefficientcA m , )=C i x ( n+T-' ( (m )L )+A,,,-' +L )- ( n+~ - ' ( d )64.N - l

    n= O

    IV . EVALUATIONThe performance of WSOLA was evaluated in extensiveinformal listening tests (many of them concerned WSOLA with20 ins hanning windowing, 50% overlap,4, 5 m s, c,(m,6)or cA(m,6),an d 10 W z sampling frequency). For all time-scaling factors tested (z(r)= a t , with a E 10.4 . 0.71U 11.3 ..2.01) we found the resulting speech quality to be very high andto be robust against background noises, including competingvoices. (Figure 4shows an example output waveform.)

    Fig. 4. (a) original speech fragment, (b )corresponding WSOLAoutput waveform when slowed down with a = 0.6.

    11-556

    Authorized licensed use limited to: National Tsing Hua University. Downloaded on June 3, 2009 at 07:55 from IEEE Xplore. Restrictions apply.

  • 8/6/2019 [WSOLA]an Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High-quality Time-scale Modifications

    4/4

    As mentioned before, many variants of the basic WSOLAtechnique are possible. These can be constructed for exampleby varying the windowing function, the similarity measure, orthe portion of x ( n ) that is to serve as the reference for naturalsignal continuity. This design flexibility can be used tooptimize the algorithm for implementation on a given targetsystem. W e found that all tested variants of the algorithmprovided similar high quality, from which we concluded thatwaveform-similarity is a real powerful principle for t ime-scaling. As an example, figure 5 illustrates the robustness ofWSOLA against the choice of distance measure.

    Synchronizing methodEffcctivc window length

    Namalhingdcmnlinsror=ccmstanrAlgailhmic&

    cmpl l a t i odefficiencyRobustness

    SprchqualityPitch modification

    E y o C m r a c ~ , ,,I. , -150100050 difference [samples]

    -50 -25 0 25 50Fig. 5 . Histogram of the difference between alignmentparameters Ak obtained from c,(m,6) and from cA( m,6 ).

    While SOLA and TD-PSOLA will produce an equally highspeech quality when operated properly, they each present somedisadvantages compared to WSOLA. Table 1 summarizes aqualitative comparison between the three methods.

    TD-FSOLA SOLA WSOLApitch epochs outputsimilarity input similarity

    pitch adaptive fixed (>4.pitch) f i xedno no yes

    low hidl very high

    low high highhigh high highY e Ix) no

    PSOLA. It can be noted that the variant of TD-PSOLA thatwas considered can be operated in much the same way as apitch excited vocoder [4]. While it could consequently be usedfor a more general prosodic modification of speech, i t remainstrue that WSOLA can be preferred when only time-scalemodification needs to be performed.

    CONCLUSIONA concept of waveform similarity was proposed for tackling theproblem of time-scale modification of speech, and was worked-out in the context of S TFT manipulation.The resulting WSO LA algorithm is designed in the tradition ofthe OLA, SOLA and TD-PSOL A techniques, and provides highquality output speech with high algorithmic and computationalefficiency and robustness.

    REFERENCESD.W. Griffin, J.S. Lim, Signal Estimation from M odifiedShort-Time Fourier Transfonns, IEE E Trans. on Acoust.,Speech, and Signal Processing, Vol. ASSP-32, No. 2, pp.S . Roucos, A. Wilgus, High quality Time-ScaleModification of Speech, ICASSP-85 , pp. 236-239,1985.E. Moulines, F. Charpentier, Pitch-SynchronousWaveform Processing Techniques for Text-to-SpeechSynthesis Using Diphones, Speech Commu nication Vol. 9(5/6), pp . 453467 , 1990.W. Verhelst, On the Quality of Speech Produced byImpulse Driven Linear Systems, ICASSP-91, pp . 501-504,1991.

    236-243, 1984.

    Authorized licensed use limited to: National Tsing Hua University Downloaded on June 3 2009 at 07:55 from IEEE Xplore Restrictions apply