Upload
anissa-garrison
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Hybrid Time-Scale Modification of Audio
Patrick-André Savard, Philippe Gournayand Roch Lefebvre
Université de Sherbrooke, Québec, Canada
Problem description Prior art
◦ Synchronized overlap-add w/fixed syn. (SOLAFS)◦ Improved phase vocoder
Hybrid time-scale modification◦ High level algorithm◦ Classification◦ Main algorithm◦ Mode transition
Performance evaluation◦ Classification performance◦ Subjective testing results
Presentation content
What is time-scale modification? Subject of interest:
◦ Subjective quality of time-scaled signals Existing methods:
◦ Time vs frequency approaches◦ High quality results on specific types of signals
TSM applied to various signal types◦ Can be speech, music, or mixed-type signals
There is a need for a more “universal” method
Problem description
Prior ArtSynchronized overlap-add with fixed synthesis (SOLAFS)
Input Signal
Output Signal
Sa
Ss
WLEN delay delay
Prior ArtImproved phase vocoder Based on the block-by-
block STFT analysis/synthesis model
STFT phases are updated so as to preserve instantaneous frequencies
STFT amplitudes are preserved
STFT modification Improvements
Peak- detection
Compute inst. freq. for peaks
Define regions of influence
Update peak phases
Apply phase-lock. to ROIs
¯
¯
STFT modification stage
¯
¯
FFT
IFFT
Overlap-add and gain control
N
Ra
Rs
Uses a frame-by-frame model
Each frame goes through a classifier
Signals identified as monophonic are processed using SOLAFS
Signals identified as polyphonic or noisy are processed using the phase vocoder
Hybrid time-scale modification:High level algorithm
Read input frame
Classifysignal
Process samples using SOLAFS
Process samples using the phase
vocoder
Write output frame
Monophonic Polyphonic, noisy
Goal:◦ Discriminate monophonic/polyphonic/noise signals
Method used:◦ Test the maximum of the normalized cross-
correlation (C.C.) measure in SOLAFS for each analysis window
Hybrid time-scale modification:Classification
0 100 200 300 400 500 600 700-1
-0.5
0
0.5
1
Am
plitu
de
Time (ms)
0 5 10 15 20 25 30 35 40 450
0.5
1
Synthesis window number
Nor
mal
ized
cro
ss-c
orre
latio
n
0 100 200 300 400 500 600 700-1
-0.5
0
0.5
1
Am
plitu
de
Time (ms)
0 5 10 15 20 25 30 35 40 450
0.5
1
Synthesis window number
Nor
mal
ized
cro
ss-c
orre
latio
n
Music Signal
Speech Signal
Unvoiced Voiced
Voiced speech: High C.C.
Music: Low to medium C.C.
Unvoiced speech: Low &
high C.C.
SOLAFS processing
Rmax<Txcorr
Hybrid time-scale modification:Main Algorithm
Default method: SOLAFS
Switches to phase vocoder when Rmax<Txcorr
Constraint on minimum length of a SOLAFS synthesis segment
Frame 1 Frame 2
SOLAFS processing
Rmax<Txcorr
SOLAFS processin
g
Phase vocoder
processing
Phase vocoder
processing
Phase vocoder
processing
Phase vocoder
processing
Frame 1 Frame 2
discarded
Hybrid time-scale modification:SOLAFS to Phase Vocoder Transition
Phase vocoder initialization:
Synthesis padded with input samples
Initialization based on matching input/output samples
Gain control: More padding needed Synthesis further
padded and windowed to reproduce a phase vocoder output
Last SOLAFS synthesis window
Output signal padded with input samples
Initialization based onmatching
input/output samples
Previously padded
synthesis
More padding using input
samples
Resulting synthesis is windowed
First phase vocoder
synthesis window overlaps
coherently
Hybrid time-scale modification:Phase Vocoder to SOLAFS Transition Current frame’s first
analysis window is out of phase with current output signal
Assume that the current input frame contains a stationary signal
First input window is one phase vocoder analysis step ahead
First SOLAFS segment is OLA at the last phase vocoder synthesis step
SOLAFS synthesis samples (after the first OLA region) replace synthesis samples obtained by the phase vocoder
Previous frame Current frame
Synthesis signal(before
transition)
First SOLAFS synthesis window
Subsequent SOLAFS
synthesis windows
Current frame’s first analysis window
(not in phase with current output)Approximately in phase with current output
0 0.2 0.4 0.6 0.8 1
Time-scaled speech signal (=2, Tmax
=0.6)
Time (s)
0 0.2 0.4 0.6 0.8 1SOLAFS
Phase vocoderClassification results
Time (s)
Performance evaluationClassification of a speech signal Signal length =1
second Tmax=0.6 Unvoiced speech
is successfully detected
Triggers phase vocoder processing
Performance evaluationClassification of a music signal Signal length =
25 seconds Tmax=0.6 Classification
results: 91 % phase
vocoder 9 % SOLAFS
0 5 10 15 20 25
Time-scaled music signal (=2, Tmax
=0.6)
Time (s)
0 5 10 15 20 25SOLAFS
Phase vocoderClassification results
Time (s)
A/B method Speech, music and mixed content (speech
over music) samples tested Hybrid method compared to stand-alone
techniques Comparisons performed on compressed and
expanded signals Eight listeners took part of the test Samples evaluated using a 5 step scale
Performance evaluationSubjective testing
H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%
10%
20%
30%
40%
50%
60%
70%
Speech
Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75
H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%
10%
20%
30%
40%
50%
60%
70%
SpeechMusic
Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75
H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%
10%
20%
30%
40%
50%
60%
70%
SpeechMusicMixed
Performance evaluation: ResultsHybrid vs SOLAFS, α=1.75
H >> PV H > PV H = PV H < PV H << PV0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Speech
Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75
H >> PV H > PV H = PV H < PV H << PV0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
SpeechMusic
Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75
H >> PV H > PV H = PV H < PV H << PV0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
SpeechMusicMixed
Performance evaluation: ResultsHybrid vs Phase vocoder, α=1.75
H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%
10%
20%
30%
40%
50%
60%
SpeechMusicMixed
Performance evaluation: ResultsHybrid vs SOLAFS, α=0.75
H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%
10%
20%
30%
40%
50%
60%
SpeechMusic
Performance evaluation: ResultsHybrid vs SOLAFS, α=0.75
H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA0%
10%
20%
30%
40%
50%
60%
SpeechMusicMixed
Performance evaluation: ResultsHybrid vs SOLAFS, α=0.75
H >> PV H > PV H = PV H < PV H << PV0%
10%
20%
30%
40%
50%
60%
Speech
Performance evaluation: ResultsHybrid vs Phase vocoder, α=0.75
H >> PV H > PV H = PV H < PV H << PV0%
10%
20%
30%
40%
50%
60%
SpeechMusic
Performance evaluation: ResultsHybrid vs Phase vocoder, α=0.75
H >> PV H > PV H = PV H < PV H << PV0%
10%
20%
30%
40%
50%
60%
SpeechMusicMixed
Performance evaluation: ResultsHybrid vs Phase vocoder, α=0.75
A hybrid TSM method is presented◦ Uses a frame-by-frame classification stage◦ Selects the best method based on the input signal
monophonic/polyphonic/noise character◦ Mode transitions
High quality results are obtained◦ Using speech, music and mixed-content signals
Future work◦ Refine the classification criterion◦ Use of phase flexibility to improve phase coherence
would improve phase vocoder to SOLAFS transitions
Conclusion