Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada

Hybrid Time-Scale Modification of Audio

Patrick-André Savard, Philippe Gournayand Roch Lefebvre

Université de Sherbrooke, Québec, Canada

Problem description Prior art

◦ Synchronized overlap-add w/fixed syn. (SOLAFS)◦ Improved phase vocoder

Hybrid time-scale modification◦ High level algorithm◦ Classification◦ Main algorithm◦ Mode transition

Performance evaluation◦ Classification performance◦ Subjective testing results

Presentation content

Patrick-André

Main algorithm before classification?

What is time-scale modification? Subject of interest:

◦ Subjective quality of time-scaled signals Existing methods:

◦ Time vs frequency approaches◦ High quality results on specific types of signals

TSM applied to various signal types◦ Can be speech, music, or mixed-type signals

There is a need for a more “universal” method

Problem description

Prior ArtSynchronized overlap-add with fixed synthesis (SOLAFS)

Input Signal

Output Signal

Sa

Ss

WLEN delay delay

Prior ArtImproved phase vocoder Based on the block-by-

block STFT analysis/synthesis model

STFT phases are updated so as to preserve instantaneous frequencies

STFT amplitudes are preserved

STFT modification Improvements

Peak- detection

Compute inst. freq. for peaks

Define regions of influence

Update peak phases

Apply phase-lock. to ROIs

¯

¯

STFT modification stage

¯

¯

FFT

IFFT

Overlap-add and gain control

N

Ra

Rs

Uses a frame-by-frame model

Each frame goes through a classifier

Signals identified as monophonic are processed using SOLAFS

Signals identified as polyphonic or noisy are processed using the phase vocoder

Hybrid time-scale modification:High level algorithm

Read input frame

Classifysignal

Process samples using SOLAFS

Process samples using the phase

vocoder

Write output frame

Monophonic Polyphonic, noisy

Patrick-André

re-introduire SOLAFS vs Phase Vocoder = Hybrid

Goal:◦ Discriminate monophonic/polyphonic/noise signals

Method used:◦ Test the maximum of the normalized cross-

correlation (C.C.) measure in SOLAFS for each analysis window

Hybrid time-scale modification:Classification

0 100 200 300 400 500 600 700-1

-0.5

0

0.5

1

Am

plitu

de

Time (ms)

0 5 10 15 20 25 30 35 40 450

0.5

1

Synthesis window number

Nor

mal

ized

cro

ss-c

orre

latio

n

0 100 200 300 400 500 600 700-1

-0.5

0

0.5

1

Am

plitu

de

Time (ms)

0 5 10 15 20 25 30 35 40 450

0.5

1

Synthesis window number

Nor

mal

ized

cro

ss-c

orre

latio

n

Music Signal

Speech Signal

Unvoiced Voiced

Voiced speech: High C.C.

Music: Low to medium C.C.

Unvoiced speech: Low &

high C.C.

Patrick-André

features slide

Patrick-André

Mettre en emphase la variation de xcorr pour unvoiced speech + music

SOLAFS processing

Rmax<Txcorr

Hybrid time-scale modification:Main Algorithm

Default method: SOLAFS

Switches to phase vocoder when Rmax<Txcorr

Constraint on minimum length of a SOLAFS synthesis segment

Frame 1 Frame 2

SOLAFS processing

Rmax<Txcorr

SOLAFS processin

g

Phase vocoder

processing

Phase vocoder

processing

Phase vocoder

processing

Phase vocoder

processing

Frame 1 Frame 2

discarded

Hybrid time-scale modification:SOLAFS to Phase Vocoder Transition

Phase vocoder initialization:

Synthesis padded with input samples

Initialization based on matching input/output samples

Gain control: More padding needed Synthesis further

padded and windowed to reproduce a phase vocoder output

Last SOLAFS synthesis window

Output signal padded with input samples

Initialization based onmatching

input/output samples

Previously padded

synthesis

More padding using input

samples

Resulting synthesis is windowed

First phase vocoder

synthesis window overlaps

coherently

Hybrid time-scale modification:Phase Vocoder to SOLAFS Transition Current frame’s first

analysis window is out of phase with current output signal

Assume that the current input frame contains a stationary signal

First input window is one phase vocoder analysis step ahead

First SOLAFS segment is OLA at the last phase vocoder synthesis step

SOLAFS synthesis samples (after the first OLA region) replace synthesis samples obtained by the phase vocoder

Previous frame Current frame

Synthesis signal(before

transition)

First SOLAFS synthesis window

Subsequent SOLAFS

synthesis windows

Current frame’s first analysis window

(not in phase with current output)Approximately in phase with current output

0 0.2 0.4 0.6 0.8 1

Time-scaled speech signal (=2, Tmax

=0.6)

Time (s)

0 0.2 0.4 0.6 0.8 1SOLAFS

Phase vocoderClassification results

Time (s)

Performance evaluationClassification of a speech signal Signal length =1

second Tmax=0.6 Unvoiced speech

is successfully detected

Triggers phase vocoder processing

Performance evaluationClassification of a music signal Signal length =

25 seconds Tmax=0.6 Classification

results: 91 % phase

vocoder 9 % SOLAFS

0 5 10 15 20 25

Time-scaled music signal (=2, Tmax

=0.6)

Time (s)

0 5 10 15 20 25SOLAFS

Phase vocoderClassification results

Time (s)

A/B method Speech, music and mixed content (speech

over music) samples tested Hybrid method compared to stand-alone

techniques Comparisons performed on compressed and

expanded signals Eight listeners took part of the test Samples evaluated using a 5 step scale

Performance evaluationSubjective testing

Patrick-André

cas ou les methodes ind. fail.signaux complementaires

H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%

10%

20%

30%

40%

50%

60%

70%

Speech

Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75


10%

20%

30%

40%

50%

60%

70%

SpeechMusic



10%

20%

30%

40%

50%

60%

70%

SpeechMusicMixed


H >> PV H > PV H = PV H < PV H << PV0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Speech

Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75


5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

SpeechMusic



5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

SpeechMusicMixed



10%

20%

30%

40%

50%

60%

SpeechMusicMixed



10%

20%

30%

40%

50%

60%

SpeechMusic



10%

20%

30%

40%

50%

60%

SpeechMusicMixed



10%

20%

30%

40%

50%

60%

Speech



10%

20%

30%

40%

50%

60%

SpeechMusic



10%

20%

30%

40%

50%

60%

SpeechMusicMixed


Patrick-André

A hybrid TSM method is presented◦ Uses a frame-by-frame classification stage◦ Selects the best method based on the input signal

monophonic/polyphonic/noise character◦ Mode transitions

High quality results are obtained◦ Using speech, music and mixed-content signals

Future work◦ Refine the classification criterion◦ Use of phase flexibility to improve phase coherence

would improve phase vocoder to SOLAFS transitions

Conclusion

Contact: [email protected]

Thank you.

mailto:[email protected]

Documents

Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada