VQ Source Models: Perceptual & Phase Issues

VQ Source Models - Ellis & Weiss 2006-05-16 - /121
1. Source Models for Separation 2. VQ with Perceptual Weighting 3. Phase and Resynthesis 4. Conclusions
VQ Source Models: Perceptual & Phase Issues
Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{dpwe,ronw}@ee.columbia.edu http://labrosa.ee.columbia.edu/
Single-Channel Scene Analysis
underconstrained: infinitely many decompositions time-frequency overlaps cause obliteration .. no obvious segmentation of sources (?)
time / s time / s
2
4
6
8
0 0.5 1 1.5 2 0
level / dB
combination physics source models
i.e. no projection can guarantee to remove overlaps
• Overlaps ⇒ Ambiguity scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) ∝ P(X |{Si}) P({Si})
• Better source models → better inference .. learn from examples?
3
Vector-Quantized (VQ) Source Models
• “Constraint” of source can be captured explicitly in a codebook (dictionary):
defines the ‘subspace’ occupied by source
• Codebook minimizes distortion (MSE) by k-means clustering
4
1000 200 300 400 500 600 700 800 codebook index
20
40
60
80
level / dB
Simple Source Separation • Given models for sources,
find “best” (most likely) states for spectra:
can include sequential constraints... different domains for combining c and defining
• E.g. stationary noise:
5
{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2) p(x|i1, i2) = N (x;ci1+ ci2,Σ) combination
model
0 1 2
0 1 2
Codebook Size • Two (main) variables:
o number of codewords o amount of training data
• Measure average accuracy (distortion):
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
)
600 training frames 1000 training frames 2000 training frames 4000 training frames 8000 training frames 16000 training frames o main effect of
codebook size o larger codebooks need/allow more data o (large absolute distortion values)
Distortion Metric • Standard MSE gives equal weight by channel
excessive emphasis on high frequencies
• Try e.g. Mel spectrum approx. log spacing of frequency bins
• Little effect (?):
d
0 0.5 1 1.5 2time / s0 0.5 1 1.5 2 0
20
40
60
80
Linear Frequency Mel Frequency
VQ800 Codebook - Linear distortion measure VQ800 Codebook - Mel/cube root distance
codebook index codebook index
Resynthesis Phase • Codewords quantize spectrum magnitude
phase has arbitrary offset due to STFT grid
• Resynthesis (ISTFT) requires phase info use mixture phase? no good for filling-in
• Spectral peaks indicate common instantaneous frequency ( ) can quantize and cumulate in resynthesis .. like the “phase vocoder”
8
-40
-20
0
20
0 200 400 600 800 1000 0
200
400
600
800
1000
Resynthesis Phase (2) • Can also improve phase iteratively
repeat:
goal:
9
X (1)(t, f )=|X(t, f )| · exp{ jφ(1)(t, f )} x(1)(t)=istft{X (1)(t, f )}
φ(2)(t, f )=∠ ! stft{x(1)(t)}
0 1 2 3 4 5 6 7 83
4
5
6
7
8
Starting from random phase
Evaluating Model Quality • Low distortion is not really the goal;
models are to constrain source separation fit source spectra but reject non-source signals
• Include sequential constraints e.g. transition matrix for codewords .. or smaller HMM with distributions over codebook
• Best way to evaluate is via a task e.g. separating speech from noise
10 0 0.5 1 1.5 2 2.5
80 60 40 20 0
80 60 40 20 0
80 60 40 20 0
level / dB
time / s
fre q
/ b in
Direct VQ (magsnr = -0.6 dB)
HMM-smoothed VQ (magsnr = -2.1 dB)
Future Directions • Factorized codebooks
codebooks too large due to combinatorics separate codebooks for type, formants, excitation?
• Model adaptation many speaker-dependents model, or... single speaker-adapted model, fit to each speaker
• Using uncertainty enhancing noisy speech for listeners: use special tokens to preserve uncertainty
11
0 1000 2000 3000 4000 5000 6000 7000 8000 -40
-20
0
20
40
Summary • Source models permit separation of
underconstrained mixtures or at least inference of source state
• Explicit codebooks need to be large .. and chosen to optimize perceptual quality
• Resynthesis phase can be quantized .. using “phase vocoder” derivative .. iterative re-estimation helps more
12
Extra Slides
Other Uses for Source Models Projecting into the model’s space:
• Restoration / Extension inferring missing parts
• Generation / Copying: adaptation + fitting
Example 2: Mixed Speech Recog. • Cooke & Lee’s Speech Separation Challenge
short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white at M 5 soon" t5_bwam5s_m5_bbilzp_6p1.wav
System Noise Condition
HTK 1.0 45.7 82.0 88.6 87.2
GDL-MAP I 2.0 33.2 68.6 85.4 87.3
GDL-MAP II 2.7 7.6 14.8 49.6 77.2
oracle 1.1 4.2 8.4 39.1 76.4
SDL 1.4 3.4 7.7 38.4 77.3
Table 2: Word error rates (percent) on the noisy development set.
The error rate for the “random-guess” system is 87%. The sys-
tems in the table are: 1) The default HTK recognizer, 2) IBM–
GDL MAP–adapted to the speech separation training data, 3)
MAP–adapted to the speech separation training data and artifi-
cially generated training data with added noise, 4) Oracle MAP
adapted Speaker dependent system with known speaker IDs, 5)
MAP adapted speaker dependent models with SDL
6. Experiments and Results
The Speech Separation Challenge [1] involves separating the
mixed speech of two speakers drawn from of a set of 34 speakers.
An example utterance is place white by R 4 now. In each record-
ing, one of the speakers says white while the other says blue, red
or green. The task is to recognize the letter and the digit of the
speaker that said white.
We decoded the two component signals under the assumption
that one signal contains white and the other does not, and vice
versa. We then used the association that yielded the highest com-
bined likelihood.
Log-power spectrum features were computed at a 15 ms rate.
Each frame was of length 40 ms and a 640 point FFT was used
producing a 3195 dimensional log-power-spectrum feature vector.
6 dB 3 dB 0 dB !3 dB !6 dB !9 dB 0
50
100 a) Same Talker
6 dB 3 dB 0 dB !3 dB !6 dB !9 dB 0
50
100 b) Same Gender
6 dB 3 dB 0 dB !3 dB !6 dB !9 dB 0
50
SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human
Figure 1: Word error rates for the a) Same Talker, b) Same Gender
and c) Different Gender cases.
5the DC component was discarded
6 dB 3 dB 0 dB -3 dB -6 dB -9dB All
ST 29 42 47 47 46 55 44.3
SG 8 10 13 13 15 30 14.8
DG 9 8 11 18 22 36 17.3
All 16.0 21.2 25.0 26.8 28.8 41.2 26.5
Table 3: Word error rates (percent) for grammar and acoustic con-
straints. ST-Same Talker, SG-Same Gender, DG-Different Gen-
der. Conditions where our system outperformed human listeners
are bolded.
Figure 1 shows results for the 3 different conditions. Human
listener performance [1] is shown along with the performance of
the SDL recognizer without separation, GMM without dynam-
ics, using acoustic level dynamics, and using both grammar and
acoustic-level dynamics.
The top plot in Figure 1 shows word error rates (WER) for the
Same Talker condition. In this condition, two recordings from the
same speaker are mixed together. This conditions best illustrates
the importance of temporal constrains. By adding the acoustic
dynamics, performance is improved considerably. By combin-
ing grammar and acoustic dynamics, performance improves again,
surpassing human performance in the !3 dB condition. The second plot in Figure 1 shows WER for the Same Gender
condition. In this condition, recordings from two different speak-
ers of the same gender are mixed together. In this condition our
system surpasses human performance in all conditions except 6 dB and !9 dB.
The third plot in Figure 1 showsWER for the Different Gender
condition. In this condition, our system surpasses human perfor-
mance in the 0 dB and 3 dB conditions. Interestingly, temporal constraints do not improve performance relative to GMM without
dynamics as dramatically as in the same talker case, which indi-
cates that the characteristics of the two speakers in a short segment
are effective for separation. The performance of our best system, which uses both gram-
mar and acoustic-level dynamics, is summarized in Table 3. This system surpassed human lister performance at SNRs of 0 dB and !3 dB on average across all speaker conditions. Averaging across all SNRs, the system surpassed human performance in the Same Gender condition. Based on these initial results, we envision that super-human performance over all conditions is within reach.
7. References
[1] Martin Cooke and Tee-Won Lee, “Interspeech speech separation challenge,” http : //www.dcs.shef.ac.uk/ ! martin/SpeechSeparationChallenge.htm, 2006.
[2] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004.
[3] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003.
[4] P. Varga and R.K. Moore, “Hidden markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990.
[5] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 291– 298, 1994.
[6] Peder Olsen and Satya Dharanipragada, “An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,,” in Proceedings of Eurospeech 2003, Geneva, Switzerland, September 1-4 2003, vol. 4, pp. 2509–2512.
o Model individual speakers (512 mix GMM) o Infer speakers and gain o Reconstruct speech o Recognize as normal... • Grammar constraints a big help
Kristjansson et al. Interspeech’06
w er
Model-Based Separation • Central idea:
Employ strong learned constraints to disambiguate possible sources {Si} = argmaxSi P(X | {Si})
• e.g. fit speech-trained Vector-Quantizer to mixed spectrum:
separate via T-F mask (again)
16
MAXVQ Results: Denoising
time
time
time
time
time
fr eq u en cy
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec of isolated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR
MAXVQ Results: Denoising
time
time
time
time
time
fr eq u en cy
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec of isolated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR
fro m
R ow
Separation or Description? • Are isolated waveforms required?
clearly sufficient, but may not be necessary not part of perceptual source separation!
• Integrate separation with application? e.g. speech recognition
words output = abstract description of signal
17
vs.
Evaluation • How to measure separation performance?
depends what you are trying to do
• SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. ’06]
• Intelligibility? rare for nonlinear processing to improve intelligibility listening tests expensive
• ASR performance? separate-then-recognize too simplistic; ASR needs to accommodate separation
18

Documents

VQ Source Models: Perceptual & Phase Issues